* Persistent Reservation API V3
@ 2015-08-26 16:03 Christoph Hellwig
2015-08-26 16:03 ` [PATCH 1/5] block: cleanup blkdev_ioctl Christoph Hellwig
0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2015-08-26 16:03 UTC (permalink / raw)
This series adds support for a simplified Persistent Reservation API
to the block layer. The intent is that both in-kernel and userspace
consumers can use the API instead of having to hand craft SCSI or NVMe
command through the various pass through interfaces. It also adds
DM support as getting reservations through dm-multipath is a major
pain with the current scheme.
NVMe support currently isn't included as I don't have a multihost
NVMe setup to test on, but Keith offered to test it and I'll have
a patch for it shortly.
The ioctl API is documented in Documentation/block/pr.txt, but to
fully understand the concept you'll have to read up the SPC spec,
PRs are too complicated that trying to rephrase them into different
terminology is just going to create confusion.
Note that Mike wants to include the DM patches so through the DM
tree, so they are only included for reference.
I also have a set of simple test tools available at:
git://git.infradead.org/users/hch/pr-tests.git
Changes since V2:
- added an ignore flag to the reserve opertion as well, and redid
the ioctl API to have general flags fields
- rebased on top of the latest block layer tree updates
Changes since V1:
- rename DM ->ioctl to ->prepare_ioctl
- rename dm_get_ioctl_table to dm_get_live_table_for_ioctl
- merge two DM patches into one
- various spelling fixes
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH 1/5] block: cleanup blkdev_ioctl 2015-08-26 16:03 Persistent Reservation API V3 Christoph Hellwig @ 2015-08-26 16:03 ` Christoph Hellwig 0 siblings, 0 replies; 7+ messages in thread From: Christoph Hellwig @ 2015-08-26 16:03 UTC (permalink / raw) Split out helpers for all non-trivial ioctls to make this function simpler, and also start passing around a pointer version of the argument, as that's what most ioctl handlers actually need. Signed-off-by: Christoph Hellwig <hch at lst.de> --- block/ioctl.c | 227 ++++++++++++++++++++++++++++++++-------------------------- 1 file changed, 127 insertions(+), 100 deletions(-) diff --git a/block/ioctl.c b/block/ioctl.c index 8061eba..df62b47 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -193,10 +193,20 @@ int blkdev_reread_part(struct block_device *bdev) } EXPORT_SYMBOL(blkdev_reread_part); -static int blk_ioctl_discard(struct block_device *bdev, uint64_t start, - uint64_t len, int secure) +static int blk_ioctl_discard(struct block_device *bdev, fmode_t mode, + unsigned long arg, unsigned long flags) { - unsigned long flags = 0; + uint64_t range[2]; + uint64_t start, len; + + if (!(mode & FMODE_WRITE)) + return -EBADF; + + if (copy_from_user(range, (void __user *)arg, sizeof(range))) + return -EFAULT; + + start = range[0]; + len = range[1]; if (start & 511) return -EINVAL; @@ -207,14 +217,24 @@ static int blk_ioctl_discard(struct block_device *bdev, uint64_t start, if (start + len > (i_size_read(bdev->bd_inode) >> 9)) return -EINVAL; - if (secure) - flags |= BLKDEV_DISCARD_SECURE; return blkdev_issue_discard(bdev, start, len, GFP_KERNEL, flags); } -static int blk_ioctl_zeroout(struct block_device *bdev, uint64_t start, - uint64_t len) +static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode, + unsigned long arg) { + uint64_t range[2]; + uint64_t start, len; + + if (!(mode & FMODE_WRITE)) + return -EBADF; + + if (copy_from_user(range, (void __user *)arg, sizeof(range))) + return -EFAULT; + + start = range[0]; + len = range[1]; + if (start & 511) return -EINVAL; if (len & 511) @@ -295,89 +315,115 @@ static inline int is_unrecognized_ioctl(int ret) ret == -ENOIOCTLCMD; } -/* - * always keep this in sync with compat_blkdev_ioctl() - */ -int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, - unsigned long arg) +static int blkdev_flushbuf(struct block_device *bdev, fmode_t mode, + unsigned cmd, unsigned long arg) { - struct gendisk *disk = bdev->bd_disk; - struct backing_dev_info *bdi; - loff_t size; - int ret, n; - unsigned int max_sectors; + int ret; - switch(cmd) { - case BLKFLSBUF: - if (!capable(CAP_SYS_ADMIN)) - return -EACCES; - - ret = __blkdev_driver_ioctl(bdev, mode, cmd, arg); - if (!is_unrecognized_ioctl(ret)) - return ret; + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; - fsync_bdev(bdev); - invalidate_bdev(bdev); - return 0; + ret = __blkdev_driver_ioctl(bdev, mode, cmd, arg); + if (!is_unrecognized_ioctl(ret)) + return ret; - case BLKROSET: - ret = __blkdev_driver_ioctl(bdev, mode, cmd, arg); - if (!is_unrecognized_ioctl(ret)) - return ret; - if (!capable(CAP_SYS_ADMIN)) - return -EACCES; - if (get_user(n, (int __user *)(arg))) - return -EFAULT; - set_device_ro(bdev, n); - return 0; + fsync_bdev(bdev); + invalidate_bdev(bdev); + return 0; +} - case BLKDISCARD: - case BLKSECDISCARD: { - uint64_t range[2]; +static int blkdev_roset(struct block_device *bdev, fmode_t mode, + unsigned cmd, unsigned long arg) +{ + int ret, n; - if (!(mode & FMODE_WRITE)) - return -EBADF; + ret = __blkdev_driver_ioctl(bdev, mode, cmd, arg); + if (!is_unrecognized_ioctl(ret)) + return ret; + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + if (get_user(n, (int __user *)arg)) + return -EFAULT; + set_device_ro(bdev, n); + return 0; +} - if (copy_from_user(range, (void __user *)arg, sizeof(range))) - return -EFAULT; +static int blkdev_getgeo(struct block_device *bdev, + struct hd_geometry __user *argp) +{ + struct gendisk *disk = bdev->bd_disk; + struct hd_geometry geo; + int ret; - return blk_ioctl_discard(bdev, range[0], range[1], - cmd == BLKSECDISCARD); - } - case BLKZEROOUT: { - uint64_t range[2]; + if (!argp) + return -EINVAL; + if (!disk->fops->getgeo) + return -ENOTTY; + + /* + * We need to set the startsect first, the driver may + * want to override it. + */ + memset(&geo, 0, sizeof(geo)); + geo.start = get_start_sect(bdev); + ret = disk->fops->getgeo(bdev, &geo); + if (ret) + return ret; + if (copy_to_user(argp, &geo, sizeof(geo))) + return -EFAULT; + return 0; +} - if (!(mode & FMODE_WRITE)) - return -EBADF; +/* set the logical block size */ +static int blkdev_bszset(struct block_device *bdev, fmode_t mode, + int __user *argp) +{ + int ret, n; - if (copy_from_user(range, (void __user *)arg, sizeof(range))) - return -EFAULT; + if (!capable(CAP_SYS_ADMIN)) + return -EACCES; + if (!argp) + return -EINVAL; + if (get_user(n, argp)) + return -EFAULT; - return blk_ioctl_zeroout(bdev, range[0], range[1]); + if (!(mode & FMODE_EXCL)) { + bdgrab(bdev); + if (blkdev_get(bdev, mode | FMODE_EXCL, &bdev) < 0) + return -EBUSY; } - case HDIO_GETGEO: { - struct hd_geometry geo; + ret = set_blocksize(bdev, n); + if (!(mode & FMODE_EXCL)) + blkdev_put(bdev, mode | FMODE_EXCL); + return ret; +} - if (!arg) - return -EINVAL; - if (!disk->fops->getgeo) - return -ENOTTY; - - /* - * We need to set the startsect first, the driver may - * want to override it. - */ - memset(&geo, 0, sizeof(geo)); - geo.start = get_start_sect(bdev); - ret = disk->fops->getgeo(bdev, &geo); - if (ret) - return ret; - if (copy_to_user((struct hd_geometry __user *)arg, &geo, - sizeof(geo))) - return -EFAULT; - return 0; - } +/* + * always keep this in sync with compat_blkdev_ioctl() + */ +int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, + unsigned long arg) +{ + struct backing_dev_info *bdi; + void __user *argp = (void __user *)arg; + loff_t size; + unsigned int max_sectors; + + switch (cmd) { + case BLKFLSBUF: + return blkdev_flushbuf(bdev, mode, cmd, arg); + case BLKROSET: + return blkdev_roset(bdev, mode, cmd, arg); + case BLKDISCARD: + return blk_ioctl_discard(bdev, mode, arg, 0); + case BLKSECDISCARD: + return blk_ioctl_discard(bdev, mode, arg, + BLKDEV_DISCARD_SECURE); + case BLKZEROOUT: + return blk_ioctl_zeroout(bdev, mode, arg); + case HDIO_GETGEO: + return blkdev_getgeo(bdev, argp); case BLKRAGET: case BLKFRAGET: if (!arg) @@ -414,28 +460,11 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, bdi->ra_pages = (arg * 512) / PAGE_CACHE_SIZE; return 0; case BLKBSZSET: - /* set the logical block size */ - if (!capable(CAP_SYS_ADMIN)) - return -EACCES; - if (!arg) - return -EINVAL; - if (get_user(n, (int __user *) arg)) - return -EFAULT; - if (!(mode & FMODE_EXCL)) { - bdgrab(bdev); - if (blkdev_get(bdev, mode | FMODE_EXCL, &bdev) < 0) - return -EBUSY; - } - ret = set_blocksize(bdev, n); - if (!(mode & FMODE_EXCL)) - blkdev_put(bdev, mode | FMODE_EXCL); - return ret; + return blkdev_bszset(bdev, mode, argp); case BLKPG: - ret = blkpg_ioctl(bdev, (struct blkpg_ioctl_arg __user *) arg); - break; + return blkpg_ioctl(bdev, argp); case BLKRRPART: - ret = blkdev_reread_part(bdev); - break; + return blkdev_reread_part(bdev); case BLKGETSIZE: size = i_size_read(bdev->bd_inode); if ((size >> 9) > ~0UL) @@ -447,11 +476,9 @@ int blkdev_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, case BLKTRACESTOP: case BLKTRACESETUP: case BLKTRACETEARDOWN: - ret = blk_trace_ioctl(bdev, cmd, (char __user *) arg); - break; + return blk_trace_ioctl(bdev, cmd, argp); default: - ret = __blkdev_driver_ioctl(bdev, mode, cmd, arg); + return __blkdev_driver_ioctl(bdev, mode, cmd, arg); } - return ret; } EXPORT_SYMBOL_GPL(blkdev_ioctl); -- 1.9.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Persistent Reservation API V3
@ 2015-08-26 16:06 Christoph Hellwig
2015-08-26 16:10 ` Christoph Hellwig
0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2015-08-26 16:06 UTC (permalink / raw)
This series adds support for a simplified Persistent Reservation API
to the block layer. The intent is that both in-kernel and userspace
consumers can use the API instead of having to hand craft SCSI or NVMe
command through the various pass through interfaces. It also adds
DM support as getting reservations through dm-multipath is a major
pain with the current scheme.
NVMe support currently isn't included as I don't have a multihost
NVMe setup to test on, but Keith offered to test it and I'll have
a patch for it shortly.
The ioctl API is documented in Documentation/block/pr.txt, but to
fully understand the concept you'll have to read up the SPC spec,
PRs are too complicated that trying to rephrase them into different
terminology is just going to create confusion.
Note that Mike wants to include the DM patches so through the DM
tree, so they are only included for reference.
I also have a set of simple test tools available at:
git://git.infradead.org/users/hch/pr-tests.git
Changes since V2:
- added an ignore flag to the reserve opertion as well, and redid
the ioctl API to have general flags fields
- rebased on top of the latest block layer tree updates
Changes since V1:
- rename DM ->ioctl to ->prepare_ioctl
- rename dm_get_ioctl_table to dm_get_live_table_for_ioctl
- merge two DM patches into one
- various spelling fixes
^ permalink raw reply [flat|nested] 7+ messages in thread* Persistent Reservation API V3 2015-08-26 16:06 Persistent Reservation API V3 Christoph Hellwig @ 2015-08-26 16:10 ` Christoph Hellwig 0 siblings, 0 replies; 7+ messages in thread From: Christoph Hellwig @ 2015-08-26 16:10 UTC (permalink / raw) Meh, looks like the train wifi is too bad to send out a whole patch series. I'll resend once I've arrived.. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Persistent Reservation API V3
@ 2015-08-26 16:56 Christoph Hellwig
2015-08-29 1:33 ` Jeremy Linton
0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2015-08-26 16:56 UTC (permalink / raw)
This series adds support for a simplified Persistent Reservation API
to the block layer. The intent is that both in-kernel and userspace
consumers can use the API instead of having to hand craft SCSI or NVMe
command through the various pass through interfaces. It also adds
DM support as getting reservations through dm-multipath is a major
pain with the current scheme.
NVMe support currently isn't included as I don't have a multihost
NVMe setup to test on, but Keith offered to test it and I'll have
a patch for it shortly.
The ioctl API is documented in Documentation/block/pr.txt, but to
fully understand the concept you'll have to read up the SPC spec,
PRs are too complicated that trying to rephrase them into different
terminology is just going to create confusion.
Note that Mike wants to include the DM patches so through the DM
tree, so they are only included for reference.
I also have a set of simple test tools available at:
git://git.infradead.org/users/hch/pr-tests.git
Changes since V2:
- added an ignore flag to the reserve opertion as well, and redid
the ioctl API to have general flags fields
- rebased on top of the latest block layer tree updates
Changes since V1:
- rename DM ->ioctl to ->prepare_ioctl
- rename dm_get_ioctl_table to dm_get_live_table_for_ioctl
- merge two DM patches into one
- various spelling fixes
^ permalink raw reply [flat|nested] 7+ messages in thread* Persistent Reservation API V3 2015-08-26 16:56 Christoph Hellwig @ 2015-08-29 1:33 ` Jeremy Linton 2015-08-29 13:52 ` Christoph Hellwig 0 siblings, 1 reply; 7+ messages in thread From: Jeremy Linton @ 2015-08-29 1:33 UTC (permalink / raw) Hello, So, looking at this, I don't see how it supports the algorithm I've been using for years. For that algorithm to successfully migrate PRs across multiple paths on a single machine without affecting other possible users (who may legitimately have PR'ed the same device) I need PR_IN SA 1, READ RESERVATIONS to assure the current node owns the reservation before attempting to preempt it on another path. This can also assure that the device hasn't been reserved with a legacy reservation. So, this leads me to two more general questions. The first is why isn't the PR API simply exported to filesystems as a general reserve/release so that the PR happens during mount/dismount. Then DM and friends can be setup to transparently migrate or share the reservation, rather than depending on userspace to handle these operations... Also, it seems to me the use of CLEAR is extremely dangerous in any environment where actual arbitration or sharing of the resource is taking place. thanks, On 8/26/2015 11:56 AM, Christoph Hellwig wrote: > This series adds support for a simplified Persistent Reservation API > to the block layer. The intent is that both in-kernel and userspace > consumers can use the API instead of having to hand craft SCSI or NVMe > command through the various pass through interfaces. It also adds > DM support as getting reservations through dm-multipath is a major > pain with the current scheme. > > NVMe support currently isn't included as I don't have a multihost > NVMe setup to test on, but Keith offered to test it and I'll have > a patch for it shortly. > > The ioctl API is documented in Documentation/block/pr.txt, but to > fully understand the concept you'll have to read up the SPC spec, > PRs are too complicated that trying to rephrase them into different > terminology is just going to create confusion. > > Note that Mike wants to include the DM patches so through the DM > tree, so they are only included for reference. > > I also have a set of simple test tools available at: > > git://git.infradead.org/users/hch/pr-tests.git > > Changes since V2: > - added an ignore flag to the reserve opertion as well, and redid > the ioctl API to have general flags fields > - rebased on top of the latest block layer tree updates > Changes since V1: > - rename DM ->ioctl to ->prepare_ioctl > - rename dm_get_ioctl_table to dm_get_live_table_for_ioctl > - merge two DM patches into one > - various spelling fixes > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > . > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Persistent Reservation API V3 2015-08-29 1:33 ` Jeremy Linton @ 2015-08-29 13:52 ` Christoph Hellwig 0 siblings, 0 replies; 7+ messages in thread From: Christoph Hellwig @ 2015-08-29 13:52 UTC (permalink / raw) On Fri, Aug 28, 2015@08:33:24PM -0500, Jeremy Linton wrote: > Hello, > So, looking at this, I don't see how it supports the algorithm I've been using > for years. For that algorithm to successfully migrate PRs across multiple paths > on a single machine without affecting other possible users (who may legitimately > have PR'ed the same device) I need PR_IN SA 1, READ RESERVATIONS to assure the > current node owns the reservation before attempting to preempt it on another > path. This can also assure that the device hasn't been reserved with a legacy > reservation. Do you have any code describing this in more detail? We could add the read side as well if there is strong interest. > So, this leads me to two more general questions. The first is why isn't the PR > API simply exported to filesystems as a general reserve/release so that the PR > happens during mount/dismount. Then DM and friends can be setup to transparently > migrate or share the reservation, rather than depending on userspace to handle > these operations... The API can be used by file systems, and my upcoming NFS SCSI layout support was the main reason to write this. > Also, it seems to me the use of CLEAR is extremely dangerous in any environment > where actual arbitration or sharing of the resource is taking place. Yes, but having it as a specific API isn't any less dangerous than having it issued using SG_IO. Reservations really only make sense if you assume every user of a LU is actually cooperating in some way and not actively hostile. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-08-29 13:52 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-08-26 16:03 Persistent Reservation API V3 Christoph Hellwig 2015-08-26 16:03 ` [PATCH 1/5] block: cleanup blkdev_ioctl Christoph Hellwig -- strict thread matches above, loose matches on Subject: below -- 2015-08-26 16:06 Persistent Reservation API V3 Christoph Hellwig 2015-08-26 16:10 ` Christoph Hellwig 2015-08-26 16:56 Christoph Hellwig 2015-08-29 1:33 ` Jeremy Linton 2015-08-29 13:52 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).