* [PATCH v6 01/11] documentation: Block Device Filtering Mechanism
2023-11-24 16:04 [PATCH v6 00/11] blksnap - block devices snapshots module Sergei Shtepa
@ 2023-11-24 16:04 ` Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 02/11] block: " Sergei Shtepa
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Sergei Shtepa @ 2023-11-24 16:04 UTC (permalink / raw)
To: axboe, hch, corbet, snitzer
Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, bristot, vschneid, viro, brauner,
gregkh, arnd, christian.koenig, yi.l.liu, jirislaby, stfrench,
jpanis, jgg, contact, dchinner, jack, linux, min15.li, dlemoal,
linux-block, linux-doc, linux-kernel, linux-fsdevel,
Sergei Shtepa
From: Sergei Shtepa <sergei.shtepa@veeam.com>
The document contains:
* Describes the purpose of the mechanism
* A little historical background on the capabilities of handling I/O
units of the Linux kernel
* Brief description of the design
* Reference to interface description
Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
Documentation/block/blkfilter.rst | 66 +++++++++++++++++++++++++++++++
Documentation/block/index.rst | 1 +
MAINTAINERS | 6 +++
3 files changed, 73 insertions(+)
create mode 100644 Documentation/block/blkfilter.rst
diff --git a/Documentation/block/blkfilter.rst b/Documentation/block/blkfilter.rst
new file mode 100644
index 000000000000..4e148e78f3d4
--- /dev/null
+++ b/Documentation/block/blkfilter.rst
@@ -0,0 +1,66 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+Block Device Filtering Mechanism
+================================
+
+The block device filtering mechanism provides the ability to attach block
+device filters. Block device filters allow performing additional processing
+for I/O units.
+
+Introduction
+============
+
+The idea of handling I/O units on block devices is not new. Back in the
+2.6 kernel, there was an undocumented possibility of handling I/O units
+by substituting the make_request_fn() function, which belonged to the
+request_queue structure. But none of the in-tree kernel modules used this
+feature, and it was eliminated in the 5.10 kernel.
+
+The block device filtering mechanism returns the ability to handle I/O units.
+It is possible to safely attach a filter to a block device "on the fly" without
+changing the structure of the block device's stack.
+
+It supports attaching one filter to one block device, because there is only
+one filter implementation in the kernel yet.
+See Documentation/block/blksnap.rst.
+
+Design
+======
+
+The block device filtering mechanism provides registration and unregistration
+for filter operations. The struct blkfilter_operations contains a pointer to
+the callback functions for the filter. After registering the filter operations,
+the filter can be managed using block device ioctls BLKFILTER_ATTACH,
+BLKFILTER_DETACH and BLKFILTER_CTL.
+
+When the filter is attached, the callback function is called for each I/O unit
+for a block device, providing I/O unit filtering. Depending on the result of
+filtering the I/O unit, it can either be passed for subsequent processing by
+the block layer, or skipped.
+
+The filter can be implemented as a loadable module. In this case, the filter
+module cannot be unloaded while the filter is attached to at least one of the
+block devices.
+
+Interface description
+=====================
+
+The ioctl BLKFILTER_ATTACH allows user-space programs to attach a block device
+filter to a block device. The ioctl BLKFILTER_DETACH allows user-space programs
+to detach it. Both ioctls use &struct blkfilter_name. The ioctl BLKFILTER_CTL
+allows user-space programs to send a filter-specific command. It use &struct
+blkfilter_ctl.
+
+.. kernel-doc:: include/uapi/linux/blk-filter.h
+
+To register in the system, the filter uses the &struct blkfilter_operations,
+which contains callback functions, unique filter name and module owner. When
+attaching a filter to a block device, the filter creates a &struct blkfilter.
+The pointer to the &struct blkfilter allows the filter to determine for which
+block device the callback functions are being called.
+
+.. kernel-doc:: include/linux/blk-filter.h
+
+.. kernel-doc:: block/blk-filter.c
+ :export:
diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst
index 9fea696f9daa..e9712f72cd6d 100644
--- a/Documentation/block/index.rst
+++ b/Documentation/block/index.rst
@@ -10,6 +10,7 @@ Block
bfq-iosched
biovecs
blk-mq
+ blkfilter
cmdline-partition
data-integrity
deadline-iosched
diff --git a/MAINTAINERS b/MAINTAINERS
index 97f51d5ec1cf..c20cbec81b58 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3584,6 +3584,12 @@ M: Jan-Simon Moeller <jansimon.moeller@gmx.de>
S: Maintained
F: drivers/leds/leds-blinkm.c
+BLOCK DEVICE FILTERING MECHANISM
+M: Sergei Shtepa <sergei.shtepa@veeam.com>
+L: linux-block@vger.kernel.org
+S: Supported
+F: Documentation/block/blkfilter.rst
+
BLOCK LAYER
M: Jens Axboe <axboe@kernel.dk>
L: linux-block@vger.kernel.org
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v6 02/11] block: Block Device Filtering Mechanism
2023-11-24 16:04 [PATCH v6 00/11] blksnap - block devices snapshots module Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 01/11] documentation: Block Device Filtering Mechanism Sergei Shtepa
@ 2023-11-24 16:04 ` Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 03/11] documentation: Block Devices Snapshots Module Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 04/11] blksnap: header file of the module interface Sergei Shtepa
3 siblings, 0 replies; 9+ messages in thread
From: Sergei Shtepa @ 2023-11-24 16:04 UTC (permalink / raw)
To: axboe, hch, corbet, snitzer
Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, bristot, vschneid, viro, brauner,
gregkh, arnd, christian.koenig, yi.l.liu, jirislaby, stfrench,
jpanis, jgg, contact, dchinner, jack, linux, min15.li, dlemoal,
linux-block, linux-doc, linux-kernel, linux-fsdevel,
Sergei Shtepa, Donald Buczek, Fabio Fantoni
From: Sergei Shtepa <sergei.shtepa@veeam.com>
The block device filtering mechanism is an API that allows to attach
block device filters. Block device filters allow perform additional
processing for I/O units.
The idea of handling I/O units on block devices is not new. Back in the
2.6 kernel, there was an undocumented possibility of handling I/O units
by substituting the make_request_fn() function, which belonged to the
request_queue structure. But none of the in-tree kernel modules used
this feature, and it was eliminated in the 5.10 kernel.
The block device filtering mechanism returns the ability to handle I/O
units. It is possible to safely attach filter to a block device "on the
fly" without changing the structure of block devices stack.
Co-developed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Donald Buczek <buczek@molgen.mpg.de>
Tested-by: Fabio Fantoni <fantonifabio@tiscali.it>
Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
MAINTAINERS | 3 +
block/Makefile | 3 +-
block/bdev.c | 2 +
block/blk-core.c | 35 ++++-
block/blk-filter.c | 238 ++++++++++++++++++++++++++++++++
block/blk.h | 11 ++
block/genhd.c | 10 ++
block/ioctl.c | 7 +
block/partitions/core.c | 9 ++
include/linux/blk-filter.h | 51 +++++++
include/linux/blk_types.h | 1 +
include/linux/blkdev.h | 1 +
include/linux/sched.h | 1 +
include/uapi/linux/blk-filter.h | 35 +++++
include/uapi/linux/fs.h | 3 +
15 files changed, 408 insertions(+), 2 deletions(-)
create mode 100644 block/blk-filter.c
create mode 100644 include/linux/blk-filter.h
create mode 100644 include/uapi/linux/blk-filter.h
diff --git a/MAINTAINERS b/MAINTAINERS
index c20cbec81b58..ef90cd0fec9c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3589,6 +3589,9 @@ M: Sergei Shtepa <sergei.shtepa@veeam.com>
L: linux-block@vger.kernel.org
S: Supported
F: Documentation/block/blkfilter.rst
+F: block/blk-filter.c
+F: include/linux/blk-filter.h
+F: include/uapi/linux/blk-filter.h
BLOCK LAYER
M: Jens Axboe <axboe@kernel.dk>
diff --git a/block/Makefile b/block/Makefile
index 46ada9dc8bbf..041c54eb0240 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -9,7 +9,8 @@ obj-y := bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \
- disk-events.o blk-ia-ranges.o early-lookup.o
+ disk-events.o blk-ia-ranges.o early-lookup.o \
+ blk-filter.o
obj-$(CONFIG_BOUNCE) += bounce.o
obj-$(CONFIG_BLK_DEV_BSG_COMMON) += bsg.o
diff --git a/block/bdev.c b/block/bdev.c
index e4cfb7adb645..6039d99b3a75 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -412,6 +412,7 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
return NULL;
}
bdev->bd_disk = disk;
+ bdev->bd_filter = NULL;
return bdev;
}
@@ -1018,6 +1019,7 @@ void bdev_mark_dead(struct block_device *bdev, bool surprise)
}
invalidate_bdev(bdev);
+ blkfilter_detach(bdev);
}
/*
* New drivers should not use this directly. There are some drivers however
diff --git a/block/blk-core.c b/block/blk-core.c
index fdf25b8d6e78..1de74240892a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -18,6 +18,7 @@
#include <linux/blkdev.h>
#include <linux/blk-pm.h>
#include <linux/blk-integrity.h>
+#include <linux/blk-filter.h>
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
@@ -592,12 +593,34 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
static void __submit_bio(struct bio *bio)
{
+ struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+ bool skip_bio = false;
+
+ if (unlikely(bio_queue_enter(bio)))
+ return;
+
+ if (bio->bi_bdev->bd_filter &&
+ bio->bi_bdev->bd_filter != current->blk_filter) {
+ struct blkfilter *prev = current->blk_filter;
+
+ current->blk_filter = bio->bi_bdev->bd_filter;
+ skip_bio = bio->bi_bdev->bd_filter->ops->submit_bio(bio);
+ current->blk_filter = prev;
+ }
+
+ blk_queue_exit(q);
+ if (skip_bio)
+ return;
+
if (unlikely(!blk_crypto_bio_prep(&bio)))
return;
if (!bio->bi_bdev->bd_has_submit_bio) {
blk_mq_submit_bio(bio);
- } else if (likely(bio_queue_enter(bio) == 0)) {
+ return;
+ }
+
+ if (likely(bio_queue_enter(bio) == 0)) {
struct gendisk *disk = bio->bi_bdev->bd_disk;
disk->fops->submit_bio(bio);
@@ -681,6 +704,15 @@ static void __submit_bio_noacct_mq(struct bio *bio)
current->bio_list = NULL;
}
+/**
+ * submit_bio_noacct_nocheck - re-submit a bio to the block device layer for I/O
+ * from block device filter.
+ * @bio: The bio describing the location in memory and on the device.
+ *
+ * This is a version of submit_bio() that shall only be used for I/O that is
+ * resubmitted to lower level by block device filters. All file systems and
+ * other upper level users of the block layer should use submit_bio() instead.
+ */
void submit_bio_noacct_nocheck(struct bio *bio)
{
blk_cgroup_bio_start(bio);
@@ -708,6 +740,7 @@ void submit_bio_noacct_nocheck(struct bio *bio)
else
__submit_bio_noacct(bio);
}
+EXPORT_SYMBOL_GPL(submit_bio_noacct_nocheck);
/**
* submit_bio_noacct - re-submit a bio to the block device layer for I/O
diff --git a/block/blk-filter.c b/block/blk-filter.c
new file mode 100644
index 000000000000..8e2550bed0c5
--- /dev/null
+++ b/block/blk-filter.c
@@ -0,0 +1,238 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#include <linux/blk-filter.h>
+#include <linux/blk-mq.h>
+#include <linux/module.h>
+
+#include "blk.h"
+
+static LIST_HEAD(blkfilters);
+static DEFINE_SPINLOCK(blkfilters_lock);
+
+static inline struct blkfilter_operations *__blkfilter_find(const char *name)
+{
+ struct blkfilter_operations *ops;
+
+ list_for_each_entry(ops, &blkfilters, link)
+ if (strncmp(ops->name, name, BLKFILTER_NAME_LENGTH) == 0)
+ return ops;
+
+ return NULL;
+}
+
+static inline struct blkfilter_operations *blkfilter_find_get(const char *name)
+{
+ struct blkfilter_operations *ops;
+
+ spin_lock(&blkfilters_lock);
+ ops = __blkfilter_find(name);
+ if (ops && !try_module_get(ops->owner))
+ ops = NULL;
+ spin_unlock(&blkfilters_lock);
+
+ return ops;
+}
+
+static inline void blkfilter_put(const struct blkfilter_operations *ops)
+{
+ module_put(ops->owner);
+}
+
+int blkfilter_ioctl_attach(struct block_device *bdev,
+ struct blkfilter_name __user *argp)
+{
+ struct blkfilter_name name;
+ struct blkfilter_operations *ops;
+ struct blkfilter *flt;
+ int ret;
+
+ if (copy_from_user(&name, argp, sizeof(name)))
+ return -EFAULT;
+
+ ops = blkfilter_find_get(name.name);
+ if (!ops)
+ return -ENOENT;
+
+ mutex_lock(&bdev->bd_disk->open_mutex);
+ if (!disk_live(bdev->bd_disk)) {
+ ret = -ENODEV;
+ goto out_mutex_unlock;
+ }
+ ret = freeze_bdev(bdev);
+ if (ret)
+ goto out_mutex_unlock;
+ blk_mq_freeze_queue(bdev->bd_queue);
+
+ if (bdev->bd_filter) {
+ if (bdev->bd_filter->ops == ops)
+ ret = -EALREADY;
+ else
+ ret = -EBUSY;
+ goto out_unfreeze;
+ }
+
+ flt = ops->attach(bdev);
+ if (IS_ERR(flt)) {
+ ret = PTR_ERR(flt);
+ goto out_unfreeze;
+ }
+
+ flt->ops = ops;
+ bdev->bd_filter = flt;
+
+out_unfreeze:
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+ thaw_bdev(bdev);
+out_mutex_unlock:
+ mutex_unlock(&bdev->bd_disk->open_mutex);
+ if (ret)
+ blkfilter_put(ops);
+ return ret;
+}
+
+static void __blkfilter_detach(struct block_device *bdev)
+{
+ struct blkfilter *flt = bdev->bd_filter;
+ const struct blkfilter_operations *ops = flt->ops;
+
+ bdev->bd_filter = NULL;
+ ops->detach(flt);
+ blkfilter_put(ops);
+}
+
+void blkfilter_detach(struct block_device *bdev)
+{
+ if (bdev->bd_filter) {
+ blk_mq_freeze_queue(bdev->bd_queue);
+ __blkfilter_detach(bdev);
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+ }
+}
+
+int blkfilter_ioctl_detach(struct block_device *bdev,
+ struct blkfilter_name __user *argp)
+{
+ struct blkfilter_name name;
+ int ret = 0;
+
+ if (copy_from_user(&name, argp, sizeof(name)))
+ return -EFAULT;
+
+ mutex_lock(&bdev->bd_disk->open_mutex);
+ if (!disk_live(bdev->bd_disk)) {
+ ret = -ENODEV;
+ goto out_mutex_unlock;
+ }
+ blk_mq_freeze_queue(bdev->bd_queue);
+ if (!bdev->bd_filter) {
+ ret = -ENOENT;
+ goto out_unfreeze;
+ }
+ if (strncmp(bdev->bd_filter->ops->name, name.name,
+ BLKFILTER_NAME_LENGTH)) {
+ ret = -EINVAL;
+ goto out_unfreeze;
+ }
+
+ __blkfilter_detach(bdev);
+out_unfreeze:
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+out_mutex_unlock:
+ mutex_unlock(&bdev->bd_disk->open_mutex);
+ return ret;
+}
+
+int blkfilter_ioctl_ctl(struct block_device *bdev,
+ struct blkfilter_ctl __user *argp)
+{
+ struct blkfilter_ctl ctl;
+ struct blkfilter *flt;
+ int ret;
+
+ if (copy_from_user(&ctl, argp, sizeof(ctl)))
+ return -EFAULT;
+
+ mutex_lock(&bdev->bd_disk->open_mutex);
+ if (!disk_live(bdev->bd_disk)) {
+ ret = -ENODEV;
+ goto out_mutex_unlock;
+ }
+ ret = blk_queue_enter(bdev_get_queue(bdev), 0);
+ if (ret)
+ goto out_mutex_unlock;
+
+ flt = bdev->bd_filter;
+ if (!flt || strncmp(flt->ops->name, ctl.name, BLKFILTER_NAME_LENGTH)) {
+ ret = -ENOENT;
+ goto out_queue_exit;
+ }
+
+ if (!flt->ops->ctl) {
+ ret = -ENOTTY;
+ goto out_queue_exit;
+ }
+
+ ret = flt->ops->ctl(flt, ctl.cmd, u64_to_user_ptr(ctl.opt),
+ &ctl.optlen);
+out_queue_exit:
+ blk_queue_exit(bdev_get_queue(bdev));
+out_mutex_unlock:
+ mutex_unlock(&bdev->bd_disk->open_mutex);
+ return ret;
+}
+
+ssize_t blkfilter_show(struct block_device *bdev, char *buf)
+{
+ ssize_t ret = 0;
+
+ blk_mq_freeze_queue(bdev->bd_queue);
+ if (bdev->bd_filter)
+ ret = sprintf(buf, "%s\n", bdev->bd_filter->ops->name);
+ else
+ ret = sprintf(buf, "\n");
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+
+ return ret;
+}
+
+/**
+ * blkfilter_register() - Register block device filter operations
+ * @ops: The operations to register.
+ *
+ * Return:
+ * 0 if succeeded,
+ * -EBUSY if a block device filter with the same name is already
+ * registered.
+ */
+int blkfilter_register(struct blkfilter_operations *ops)
+{
+ struct blkfilter_operations *found;
+ int ret = 0;
+
+ spin_lock(&blkfilters_lock);
+ found = __blkfilter_find(ops->name);
+ if (found)
+ ret = -EBUSY;
+ else
+ list_add_tail(&ops->link, &blkfilters);
+ spin_unlock(&blkfilters_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(blkfilter_register);
+
+/**
+ * blkfilter_unregister() - Unregister block device filter operations
+ * @ops: The operations to unregister.
+ *
+ * Important: before unloading, it is necessary to detach the filter from all
+ * block devices.
+ *
+ */
+void blkfilter_unregister(struct blkfilter_operations *ops)
+{
+ spin_lock(&blkfilters_lock);
+ list_del(&ops->link);
+ spin_unlock(&blkfilters_lock);
+}
+EXPORT_SYMBOL_GPL(blkfilter_unregister);
diff --git a/block/blk.h b/block/blk.h
index 08a358bc0919..1f104f4865c3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -7,6 +7,8 @@
#include <xen/xen.h>
#include "blk-crypto-internal.h"
+struct blkfilter_ctl;
+struct blkfilter_name;
struct elevator_type;
/* Max future timer expiry for timeouts */
@@ -474,6 +476,15 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
extern const struct address_space_operations def_blk_aops;
+int blkfilter_ioctl_attach(struct block_device *bdev,
+ struct blkfilter_name __user *argp);
+int blkfilter_ioctl_detach(struct block_device *bdev,
+ struct blkfilter_name __user *argp);
+int blkfilter_ioctl_ctl(struct block_device *bdev,
+ struct blkfilter_ctl __user *argp);
+void blkfilter_detach(struct block_device *bdev);
+ssize_t blkfilter_show(struct block_device *bdev, char *buf);
+
int disk_register_independent_access_ranges(struct gendisk *disk);
void disk_unregister_independent_access_ranges(struct gendisk *disk);
diff --git a/block/genhd.c b/block/genhd.c
index c9d06f72c587..ba744e3fd581 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -26,6 +26,7 @@
#include <linux/badblocks.h>
#include <linux/part_stat.h>
#include <linux/blktrace_api.h>
+#include <linux/blk-filter.h>
#include "blk-throttle.h"
#include "blk.h"
@@ -654,6 +655,7 @@ void del_gendisk(struct gendisk *disk)
mutex_lock(&disk->open_mutex);
xa_for_each(&disk->part_tbl, idx, part)
remove_inode_hash(part->bd_inode);
+ blkfilter_detach(disk->part0);
mutex_unlock(&disk->open_mutex);
/*
@@ -1044,6 +1046,12 @@ static ssize_t diskseq_show(struct device *dev,
return sprintf(buf, "%llu\n", disk->diskseq);
}
+static ssize_t disk_filter_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return blkfilter_show(dev_to_bdev(dev), buf);
+}
+
static DEVICE_ATTR(range, 0444, disk_range_show, NULL);
static DEVICE_ATTR(ext_range, 0444, disk_ext_range_show, NULL);
static DEVICE_ATTR(removable, 0444, disk_removable_show, NULL);
@@ -1057,6 +1065,7 @@ static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
static DEVICE_ATTR(badblocks, 0644, disk_badblocks_show, disk_badblocks_store);
static DEVICE_ATTR(diskseq, 0444, diskseq_show, NULL);
+static DEVICE_ATTR(filter, 0444, disk_filter_show, NULL);
#ifdef CONFIG_FAIL_MAKE_REQUEST
ssize_t part_fail_show(struct device *dev,
@@ -1103,6 +1112,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_events_async.attr,
&dev_attr_events_poll_msecs.attr,
&dev_attr_diskseq.attr,
+ &dev_attr_filter.attr,
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
diff --git a/block/ioctl.c b/block/ioctl.c
index 4160f4e6bd5b..1b11303e213b 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -2,6 +2,7 @@
#include <linux/capability.h>
#include <linux/compat.h>
#include <linux/blkdev.h>
+#include <linux/blk-filter.h>
#include <linux/export.h>
#include <linux/gfp.h>
#include <linux/blkpg.h>
@@ -572,6 +573,12 @@ static int blkdev_common_ioctl(struct block_device *bdev, blk_mode_t mode,
return blkdev_pr_preempt(bdev, mode, argp, true);
case IOC_PR_CLEAR:
return blkdev_pr_clear(bdev, mode, argp);
+ case BLKFILTER_ATTACH:
+ return blkfilter_ioctl_attach(bdev, argp);
+ case BLKFILTER_DETACH:
+ return blkfilter_ioctl_detach(bdev, argp);
+ case BLKFILTER_CTL:
+ return blkfilter_ioctl_ctl(bdev, argp);
default:
return -ENOIOCTLCMD;
}
diff --git a/block/partitions/core.c b/block/partitions/core.c
index f47ffcfdfcec..19c69dc23d2c 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -10,6 +10,7 @@
#include <linux/ctype.h>
#include <linux/vmalloc.h>
#include <linux/raid/detect.h>
+#include <linux/blk-filter.h>
#include "check.h"
static int (*const check_part[])(struct parsed_partitions *) = {
@@ -200,6 +201,12 @@ static ssize_t part_discard_alignment_show(struct device *dev,
return sprintf(buf, "%u\n", bdev_discard_alignment(dev_to_bdev(dev)));
}
+static ssize_t part_filter_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return blkfilter_show(dev_to_bdev(dev), buf);
+}
+
static DEVICE_ATTR(partition, 0444, part_partition_show, NULL);
static DEVICE_ATTR(start, 0444, part_start_show, NULL);
static DEVICE_ATTR(size, 0444, part_size_show, NULL);
@@ -208,6 +215,7 @@ static DEVICE_ATTR(alignment_offset, 0444, part_alignment_offset_show, NULL);
static DEVICE_ATTR(discard_alignment, 0444, part_discard_alignment_show, NULL);
static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
+static DEVICE_ATTR(filter, 0444, part_filter_show, NULL);
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, 0644, part_fail_show, part_fail_store);
@@ -222,6 +230,7 @@ static struct attribute *part_attrs[] = {
&dev_attr_discard_alignment.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+ &dev_attr_filter.attr,
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
diff --git a/include/linux/blk-filter.h b/include/linux/blk-filter.h
new file mode 100644
index 000000000000..0afdb40f3bab
--- /dev/null
+++ b/include/linux/blk-filter.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef _LINUX_BLK_FILTER_H
+#define _LINUX_BLK_FILTER_H
+
+#include <uapi/linux/blk-filter.h>
+
+struct bio;
+struct block_device;
+struct blkfilter_operations;
+
+/**
+ * struct blkfilter - Block device filter.
+ *
+ * @ops: Block device filter operations.
+ *
+ * For each filtered block device, the filter creates a data structure
+ * associated with this device. The data in this structure is specific to the
+ * filter, but it must contain a pointer to the block device filter account.
+ */
+struct blkfilter {
+ const struct blkfilter_operations *ops;
+};
+
+/**
+ * struct blkfilter_operations - Block device filter operations.
+ *
+ * @link: Entry in the global list of filter drivers
+ * (must not be accessed by the driver).
+ * @owner: Module implementing the filter driver.
+ * @name: Name of the filter driver.
+ * @attach: Attach the filter driver to the block device.
+ * @detach: Detach the filter driver from the block device.
+ * @ctl: Send a control command to the filter driver.
+ * @submit_bio: Handle bio submissions to the filter driver.
+ */
+struct blkfilter_operations {
+ struct list_head link;
+ struct module *owner;
+ const char *name;
+ struct blkfilter *(*attach)(struct block_device *bdev);
+ void (*detach)(struct blkfilter *flt);
+ int (*ctl)(struct blkfilter *flt, const unsigned int cmd,
+ __u8 __user *buf, __u32 *plen);
+ bool (*submit_bio)(struct bio *bio);
+};
+
+int blkfilter_register(struct blkfilter_operations *ops);
+void blkfilter_unregister(struct blkfilter_operations *ops);
+
+#endif /* _UAPI_LINUX_BLK_FILTER_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d5c5e59ddbd2..490865292fde 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -74,6 +74,7 @@ struct block_device {
* path
*/
struct device bd_device;
+ struct blkfilter *bd_filter;
} __randomize_layout;
#define bdev_whole(_bdev) \
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 51fa7ffdee83..6a0754007d1d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -834,6 +834,7 @@ void blk_request_module(dev_t devt);
extern int blk_register_queue(struct gendisk *disk);
extern void blk_unregister_queue(struct gendisk *disk);
+void submit_bio_noacct_nocheck(struct bio *bio);
void submit_bio_noacct(struct bio *bio);
struct bio *bio_split_to_limits(struct bio *bio);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 292c31697248..e7c3cd490a80 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1190,6 +1190,7 @@ struct task_struct {
/* Stack plugging: */
struct blk_plug *plug;
+ struct blkfilter *blk_filter;
/* VM state: */
struct reclaim_state *reclaim_state;
diff --git a/include/uapi/linux/blk-filter.h b/include/uapi/linux/blk-filter.h
new file mode 100644
index 000000000000..18885dc1b717
--- /dev/null
+++ b/include/uapi/linux/blk-filter.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef _UAPI_LINUX_BLK_FILTER_H
+#define _UAPI_LINUX_BLK_FILTER_H
+
+#include <linux/types.h>
+
+#define BLKFILTER_NAME_LENGTH 32
+
+/**
+ * struct blkfilter_name - parameter for BLKFILTER_ATTACH and BLKFILTER_DETACH
+ * ioctl.
+ *
+ * @name: Name of block device filter.
+ */
+struct blkfilter_name {
+ __u8 name[BLKFILTER_NAME_LENGTH];
+};
+
+/**
+ * struct blkfilter_ctl - parameter for BLKFILTER_CTL ioctl
+ *
+ * @name: Name of block device filter.
+ * @cmd: The filter-specific operation code of the command.
+ * @optlen: Size of data at @opt.
+ * @opt: Userspace buffer with options.
+ */
+struct blkfilter_ctl {
+ __u8 name[BLKFILTER_NAME_LENGTH];
+ __u32 cmd;
+ __u32 optlen;
+ __u64 opt;
+};
+
+#endif /* _UAPI_LINUX_BLK_FILTER_H */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index da43810b7485..f96809cd2f50 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -189,6 +189,9 @@ struct fsxattr {
* A jump here: 130-136 are reserved for zoned block devices
* (see uapi/linux/blkzoned.h)
*/
+#define BLKFILTER_ATTACH _IOWR(0x12, 140, struct blkfilter_name)
+#define BLKFILTER_DETACH _IOWR(0x12, 141, struct blkfilter_name)
+#define BLKFILTER_CTL _IOWR(0x12, 142, struct blkfilter_ctl)
#define BMAP_IOCTL 1 /* obsolete - kept for compatibility */
#define FIBMAP _IO(0x00,1) /* bmap access */
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v6 03/11] documentation: Block Devices Snapshots Module
2023-11-24 16:04 [PATCH v6 00/11] blksnap - block devices snapshots module Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 01/11] documentation: Block Device Filtering Mechanism Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 02/11] block: " Sergei Shtepa
@ 2023-11-24 16:04 ` Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 04/11] blksnap: header file of the module interface Sergei Shtepa
3 siblings, 0 replies; 9+ messages in thread
From: Sergei Shtepa @ 2023-11-24 16:04 UTC (permalink / raw)
To: axboe, hch, corbet, snitzer
Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, bristot, vschneid, viro, brauner,
gregkh, arnd, christian.koenig, yi.l.liu, jirislaby, stfrench,
jpanis, jgg, contact, dchinner, jack, linux, min15.li, dlemoal,
linux-block, linux-doc, linux-kernel, linux-fsdevel,
Sergei Shtepa, Bagas Sanjaya, Fabio Fantoni
From: Sergei Shtepa <sergei.shtepa@veeam.com>
The document contains:
* Describes the purpose of the mechanism
* Description of features
* Description of algorithms
* Recommendations about using the module from the user-space side
* Reference to module interface description
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Fabio Fantoni <fantonifabio@tiscali.it>
Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
Documentation/block/blksnap.rst | 352 ++++++++++++++++++++++++++++++++
Documentation/block/index.rst | 1 +
MAINTAINERS | 6 +
3 files changed, 359 insertions(+)
create mode 100644 Documentation/block/blksnap.rst
diff --git a/Documentation/block/blksnap.rst b/Documentation/block/blksnap.rst
new file mode 100644
index 000000000000..ef6010e46858
--- /dev/null
+++ b/Documentation/block/blksnap.rst
@@ -0,0 +1,352 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================================
+Block Devices Snapshots Module (blksnap)
+========================================
+
+Introduction
+============
+
+At first glance, there is no novelty in the idea of creating snapshots for
+block devices. The Linux kernel already has mechanisms for creating snapshots.
+Device Mapper includes dm-snap, which allows to create snapshots of block
+devices. BTRFS supports snapshots at the filesystem level. However, both of
+these options have flaws that do not allow to use them as a universal tool for
+creating backups.
+
+The main properties that a backup tool should have are:
+
+- Simplicity and universality of use
+- Reliability
+- Minimal consumption of system resources during backup
+- Minimal time required for recovery or replication of the entire system
+
+Taking above properties into account, blksnap module features:
+
+- Change tracker
+- Snapshots at the block device level
+- Dynamic allocation of space for storing differences
+- Snapshot overflow resistance
+- Coherent snapshot of multiple block devices
+
+Features
+========
+
+Change tracker
+--------------
+
+The change tracker allows to determine which blocks were changed during the
+time between the last snapshot created and any of the previous snapshots.
+With a map of changes, it is enough to copy only the changed blocks, and no
+need to reread the entire block device completely. The change tracker allows
+to implement the logic of both incremental and differential backups.
+Incremental backup is critical for large file repositories whose size can be
+hundreds of terabytes and whose full backup time can take more than a day.
+On such servers, the use of backup tools without a change tracker becomes
+practically impossible.
+
+Snapshot at the block device level
+----------------------------------
+
+A snapshot at the block device level allows to simplify the backup algorithm
+and reduce consumption of system resources. It also allows to perform linear
+reading of disk space directly, which allows to achieve maximum reading speed
+with minimal use of processor time. At the same time, the universality of
+creating snapshots for any block device is achieved, regardless of the file
+system located on it. The exceptions are BTRFS, ZFS and cluster file systems.
+
+Dynamic allocation of storage space for differences
+---------------------------------------------------
+
+To store differences, the module does not require a pre-reserved space on
+filesystem. The space for storing differences can be allocated in file in any
+filesystem. In addition, the size of the difference storage can be increased
+after the snapshot is created, but only for a filesystem that supports
+fallocate. A shared difference storage for all images of snapshot block devices
+allows to optimize the use of storage space. However, there is one limitation.
+A snapshot cannot be taken from a block device on which the difference storage
+is located.
+
+Snapshot overflow resistance
+----------------------------
+
+To create images of snapshots of block devices, the module stores blocks
+of the original block device that have been changed since the snapshot
+was taken. To do this, the module handles write requests and reads blocks
+that need to be overwritten. This algorithm guarantees safety of the data
+of the original block device in the event of an overflow of the snapshot,
+and even in the case of unpredictable critical errors. If a problem occurs
+during backup, the difference storage is released, the snapshot is closed,
+no backup is created, but the server continues to work.
+
+Coherent snapshot of multiple block devices
+-------------------------------------------
+
+A snapshot is created simultaneously for all block devices for which a backup
+is being created, ensuring their coherent state.
+
+
+Algorithms
+==========
+
+Overview
+--------
+
+The blksnap module is a block-level filter. It handles all write I/O units.
+The filter is attached to the block device when the snapshot is created
+for the first time. The change tracker marks all overwritten blocks.
+Information about the history of changes on the block device is available
+while holding the snapshot. The module reads the blocks that need to be
+overwritten and stores them in the difference storage. When reading from
+a snapshot image, reading is performed either from the original device or
+from the difference storage.
+
+Change tracking
+---------------
+
+A change tracker map is created for each block device. One byte of this map
+corresponds to one block. The block size is set by the
+``tracking_block_minimum_shift`` and ``tracking_block_maximum_count``
+module parameters. The ``tracking_block_minimum_shift`` parameter limits
+the minimum block size for tracking, while ``tracking_block_maximum_count``
+defines the maximum allowed number of blocks. The size of the change tracker
+block is determined depending on the size of the block device when adding
+a tracking device, that is, when the snapshot is taken for the first time.
+The block size must be a power of two. The ``tracking_block_maximum_shift``
+module parameter allows to limit the maximum block size for tracking. If the
+block size reaches the allowable limit, the number of blocks will exceed the
+``tracking_block_maximum_count`` parameter.
+
+The byte of the change map stores a number from 0 to 255. This is the
+snapshot number, since the creation of which there have been changes in
+the block. Each time a snapshot is created, the number of the current
+snapshot is increased by one. This number is written to the cell of the
+change map when writing to the block. Thus, knowing the number of one of
+the previous snapshots and the number of the last snapshot, one can determine
+from the change map which blocks have been changed. When the number of the
+current change reaches the maximum allowed value for the map of 255, at the
+time when the next snapshot is created, the map of changes is reset to zero,
+and the number of the current snapshot is assigned the value 1. The change
+tracker is reset, and a new UUID is generated - a unique identifier of the
+snapshot generation. The snapshot generation identifier allows to identify
+that a change tracking reset has been performed.
+
+The change map has two copies. One copy is active, it tracks the current
+changes on the block device. The second copy is available for reading
+while the snapshot is being held, and contains the history up to the moment
+the snapshot is taken. Copies are synchronized at the moment of snapshot
+creation. After the snapshot is released, a second copy of the map is not
+needed, but it is not released, so as not to allocate memory for it again
+the next time the snapshot is created.
+
+Copy on write
+-------------
+
+Data is copied in blocks, or rather in chunks. The term "chunk" is used to
+avoid confusion with change tracker blocks and I/O blocks. In addition,
+the "chunk" in the blksnap module means about the same as the "chunk" in
+the dm-snap module.
+
+The size of the chunk is determined by the ``chunk_minimum_shift`` and
+``chunk_maximum_count`` module parameters. The ``chunk_minimum_shift``
+parameter limits the minimum size of the chunk, while ``chunk_maximum_count``
+defines the maximum allowed number of chunks. The size of the chunk is
+determined depending on the size of the block device at the time of taking the
+snapshot. The size of the chunk must be a power of two. The module parameter
+``chunk_maximum_shift`` allows to limit the maximum chunk size. If the chunk
+size reaches the allowable limit, the number of chunks will exceed the
+``chunk_maximum_count`` parameter.
+
+One chunk is described by the ``struct chunk`` structure. A map of structures
+is created for each block device. The structure contains all the necessary
+information to copy the chunks data from the original block device to the
+difference storage. This information allows to describe the snapshot image.
+A semaphore is located in the structure, which allows synchronization of threads
+accessing the chunk.
+
+The block level in Linux has a feature. If a read I/O unit was sent, and a
+write I/O unit was sent after it, then a write can be performed first, and only
+then a read. Therefore, the copy-on-write algorithm is executed synchronously.
+If the write request is handled, the execution of this I/O unit will be delayed
+until the overwritten chunks are read from the original device for later
+storing to the difference store. But if, when handling a write I/O unit, it
+turns out that the written range of sectors has already been prepared for
+storing to the difference storage, then the I/O unit is simply passed.
+
+This algorithm makes it possible to efficiently perform backup even systems
+with a Round-Robin databases. Such databases can be overwritten several times
+during the system backup. Of course, the value of a backup of the RRD monitoring
+system data can be questioned. However, it is often a task to make a backup
+of the entire enterprise infrastructure in order to restore or replicate it
+entirely in case of problems.
+
+There is also a flaw in the algorithm. When overwriting at least one sector,
+an entire chunk is copied. Thus, a situation of rapid filling of the difference
+storage when writing data to a block device in small portions in random order
+is possible. This situation is possible in case of strong fragmentation of
+data on the filesystem. But it must be borne in mind that with such data
+fragmentation, performance of systems usually degrades greatly. So, this
+problem does not occur on real servers, although it can easily be created
+by artificial tests.
+
+Difference storage
+------------------
+
+The difference storage can be a block device or it can be a file on a
+filesystem. Using a block device allows to achieve slightly higher performance,
+but in this case, the block device is used by the kernel module exclusively.
+Usually the disk space is marked up so that there is no available free space
+for backup purposes. Using a file allows to place the difference storage on a
+filesystem.
+
+The difference storage can be expanded already while the snapshot is being held,
+but only if the filesystem supports fallocate(). If the free space in the
+difference storage remains less than half of the value of the module parameter
+``diff_storage_minimum``, then the kernel module can expand the difference
+storage file within the specified limits. This limit is set when creating a
+snapshot.
+
+If free space in the difference storage runs out, an event to user land is
+generated about the overflow of the snapshot. Such a snapshot is considered
+corrupted, and read I/O units to snapshot images will be terminated with an
+error code. The difference storage stores outdated data required for snapshot
+images, so when the snapshot is overflowed, the backup process is interrupted,
+but the system maintains its operability without data loss.
+
+The difference storage has a limitation. The device cannot be added to the
+snapshot where the difference storage is located. In this case, the difference
+storage can be located in virtual memory, which consists of RAM and a swap
+partition (or file). To do this, it is enough to use a file in /dev/shm, or a
+new tmpfs filesystem can be created for this purpose. Obviously, this variant
+can be useful if the system has a lot of RAM or a large swap. The good news is
+that the modern Linux kernel allows to increase the size of the swap file "on
+the fly" without changing the system configuration.
+
+A regular file or a block device file for the difference storage must be opened
+with the O_EXCL flag. If an unnamed file with the O_TMPFILE flag is created,
+then such a file will be automatically released when the snapshot is destroyed.
+In addition, the use of an unnamed temporary file ensures that no one can open
+this file and read its contents.
+
+Performing I/O for a snapshot image
+-----------------------------------
+
+To read snapshot data, when taking a snapshot, block devices of snapshot images
+are created. The snapshot image block devices support the write operation.
+This allows to perform additional data preparation on the filesystem before
+creating a backup.
+
+To process the I/O unit, clones of the I/O unit are created, which redirect
+the I/O unit either to the original block device or to the difference storage.
+When processing of cloned I/O units is completed, the original I/O unit is
+marked as completed too.
+
+An I/O unit can be partially processed without accessing to block devices if
+the I/O unit refers to a chunk that is in the queue for storing to the
+difference storage. In this case, the data is read or written in a buffer in
+memory.
+
+If, when processing the write I/O unit, it turns out that the data of the
+referred chunk has not yet been stored to the difference storage or has not
+even been read from the original device, then an I/O unit to read data from the
+original device is initiated beforehand. After the reading from original device
+is performed, their data from the I/O unit is partially overwritten directly in
+the buffer of the chunk in memory, and the chunk is scheduled to be saved to the
+difference storage.
+
+How to use
+==========
+
+Depending on the needs and the selected license, you can choose different
+options for managing the module:
+
+- Using ioctl directly
+- Using a static C++ library
+- Using the blksnap console tool
+
+Using a BLKFILTER_CTL for block device
+--------------------------------------
+
+BLKFILTER_CTL allows to send a filter-specific command to the filter on block
+device and get the result of its execution. The module provides the
+``include/uapi/blksnap.h`` header file with a description of the commands and
+their data structures.
+
+1. ``blkfilter_ctl_blksnap_cbtinfo`` allows to get information from the
+ change tracker.
+2. ``blkfilter_ctl_blksnap_cbtmap`` reads the change tracker table. If a write
+ operation was performed for the snapshot, then the change tracker takes this
+ into account. Therefore, it is necessary to receive tracker data after write
+ operations have been completed.
+3. ``blkfilter_ctl_blksnap_cbtdirty`` mark blocks as changed in the change
+ tracker table. This is necessary if post-processing is performed after the
+ backup is created, which changes the backup blocks.
+4. ``blkfilter_ctl_blksnap_snapshotadd`` adds a block device to the snapshot.
+5. ``blkfilter_ctl_blksnap_snapshotinfo`` allows to get the name of the snapshot
+ image block device and the presence of an error.
+
+Using ioctl
+-----------
+
+Using a BLKFILTER_CTL ioctl does not allow to fully implement the management of
+the blksnap module. A control file ``blksnap-control`` is created to manage
+snapshots. The control commands are also described in the file
+``include/uapi/blksnap.h``.
+
+1. ``blksnap_ioctl_version`` get the version number.
+2. ``blk_snap_ioctl_snapshot_create`` initiates the snapshot creation process.
+3. ``blk_snap_ioctl_snapshot_append_storage`` add the range of blocks to
+ difference storage.
+4. ``blk_snap_ioctl_snapshot_take`` creates block devices of block device
+ snapshot images.
+5. ``blk_snap_ioctl_snapshot_collect`` collect all created snapshots.
+6. ``blk_snap_ioctl_snapshot_wait_event`` allows to track the status of
+ snapshots and receive events about the requirement to expand the difference
+ storage or about snapshot overflow.
+7. ``blk_snap_ioctl_snapshot_destroy`` releases the snapshot.
+
+Static C++ library
+------------------
+
+The [#userspace_libs]_ library was created primarily to simplify creation of
+tests in C++, and it is also a good example of using the module interface.
+When creating applications, direct use of control calls is preferable.
+However, the library can be used in an application with a GPL-2+ license,
+or a library with an LGPL-2+ license can be created, with which even a
+proprietary application can be dynamically linked.
+
+blksnap console tool
+--------------------
+
+The blksnap [#userspace_tools]_ console tool allows to control the module from
+the command line. The tool contains detailed built-in help. To get list of
+commands with usage description, see ``blksnap --help`` command. The ``blksnap
+<command name> --help`` command allows to get detailed information about the
+parameters of each command call. This option may be convenient when creating
+proprietary software, as it allows not to compile with the open source code.
+At the same time, the blksnap tool can be used for creating backup scripts.
+For example, rsync can be called to synchronize files on the filesystem of
+the mounted snapshot image and files in the archive on a filesystem that
+supports compression.
+
+Tests
+-----
+
+A set of tests was created for regression testing [#userspace_tests]_.
+Tests with simple algorithms that use the ``blksnap`` console tool to
+control the module are written in Bash. More complex testing algorithms
+are implemented in C++.
+
+References
+==========
+
+.. [#userspace_libs] https://github.com/veeam/blksnap/tree/stable-v2.0/lib
+
+.. [#userspace_tools] https://github.com/veeam/blksnap/tree/stable-v2.0/tools
+
+.. [#userspace_tests] https://github.com/veeam/blksnap/tree/stable-v2.0/tests
+
+Module interface description
+============================
+
+.. kernel-doc:: include/uapi/linux/blksnap.h
diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst
index e9712f72cd6d..696ff150c6b7 100644
--- a/Documentation/block/index.rst
+++ b/Documentation/block/index.rst
@@ -11,6 +11,7 @@ Block
biovecs
blk-mq
blkfilter
+ blksnap
cmdline-partition
data-integrity
deadline-iosched
diff --git a/MAINTAINERS b/MAINTAINERS
index ef90cd0fec9c..9c81e4c83139 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3593,6 +3593,12 @@ F: block/blk-filter.c
F: include/linux/blk-filter.h
F: include/uapi/linux/blk-filter.h
+BLOCK DEVICE SNAPSHOTS MODULE
+M: Sergei Shtepa <sergei.shtepa@veeam.com>
+L: linux-block@vger.kernel.org
+S: Supported
+F: Documentation/block/blksnap.rst
+
BLOCK LAYER
M: Jens Axboe <axboe@kernel.dk>
L: linux-block@vger.kernel.org
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v6 04/11] blksnap: header file of the module interface
2023-11-24 16:04 [PATCH v6 00/11] blksnap - block devices snapshots module Sergei Shtepa
` (2 preceding siblings ...)
2023-11-24 16:04 ` [PATCH v6 03/11] documentation: Block Devices Snapshots Module Sergei Shtepa
@ 2023-11-24 16:04 ` Sergei Shtepa
3 siblings, 0 replies; 9+ messages in thread
From: Sergei Shtepa @ 2023-11-24 16:04 UTC (permalink / raw)
To: axboe, hch, corbet, snitzer
Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, bristot, vschneid, viro, brauner,
gregkh, arnd, christian.koenig, yi.l.liu, jirislaby, stfrench,
jpanis, jgg, contact, dchinner, jack, linux, min15.li, dlemoal,
linux-block, linux-doc, linux-kernel, linux-fsdevel,
Sergei Shtepa, Donald Buczek
From: Sergei Shtepa <sergei.shtepa@veeam.com>
The header file contains a set of declarations, structures and control
requests (ioctl) that allows to manage the module from the user space.
Co-developed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
.../userspace-api/ioctl/ioctl-number.rst | 1 +
MAINTAINERS | 1 +
include/uapi/linux/blksnap.h | 388 ++++++++++++++++++
3 files changed, 390 insertions(+)
create mode 100644 include/uapi/linux/blksnap.h
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 4ea5b837399a..81acae1b1859 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -203,6 +203,7 @@ Code Seq# Include File Comments
'V' C0 linux/ivtvfb.h conflict!
'V' C0 linux/ivtv.h conflict!
'V' C0 media/si4713.h conflict!
+'V' 00-1F uapi/linux/blksnap.h conflict!
'W' 00-1F linux/watchdog.h conflict!
'W' 00-1F linux/wanrouter.h conflict! (pre 3.9)
'W' 00-3F sound/asound.h conflict!
diff --git a/MAINTAINERS b/MAINTAINERS
index 9c81e4c83139..9770c4d4b15d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3598,6 +3598,7 @@ M: Sergei Shtepa <sergei.shtepa@veeam.com>
L: linux-block@vger.kernel.org
S: Supported
F: Documentation/block/blksnap.rst
+F: include/uapi/linux/blksnap.h
BLOCK LAYER
M: Jens Axboe <axboe@kernel.dk>
diff --git a/include/uapi/linux/blksnap.h b/include/uapi/linux/blksnap.h
new file mode 100644
index 000000000000..be1474f2025c
--- /dev/null
+++ b/include/uapi/linux/blksnap.h
@@ -0,0 +1,388 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef _UAPI_LINUX_BLKSNAP_H
+#define _UAPI_LINUX_BLKSNAP_H
+
+#include <linux/types.h>
+
+#define BLKSNAP_CTL "blksnap-control"
+#define BLKSNAP_IMAGE_NAME "blksnap-image"
+#define BLKSNAP 'V'
+
+/**
+ * DOC: Block device filter interface.
+ *
+ * Control commands that are transmitted through the block device filter
+ * interface.
+ */
+
+/**
+ * enum blkfilter_ctl_blksnap - List of commands for BLKFILTER_CTL ioctl
+ *
+ * @blkfilter_ctl_blksnap_cbtinfo:
+ * Get CBT information.
+ * The result of executing the command is a &struct blksnap_cbtinfo.
+ * Return 0 if succeeded, negative errno otherwise.
+ * @blkfilter_ctl_blksnap_cbtmap:
+ * Read the CBT map.
+ * The option passes the &struct blksnap_cbtmap.
+ * The size of the table can be quite large. Thus, the table is read in
+ * a loop, in each cycle of which the next offset is set to
+ * &blksnap_tracker_read_cbt_bitmap.offset.
+ * Return a count of bytes read if succeeded, negative errno otherwise.
+ * @blkfilter_ctl_blksnap_cbtdirty:
+ * Set dirty blocks in the CBT map.
+ * The option passes the &struct blksnap_cbtdirty.
+ * There are cases when some blocks need to be marked as changed.
+ * This ioctl allows to do this.
+ * Return 0 if succeeded, negative errno otherwise.
+ * @blkfilter_ctl_blksnap_snapshotadd:
+ * Add device to snapshot.
+ * The option passes the &struct blksnap_snapshotadd.
+ * Return 0 if succeeded, negative errno otherwise.
+ * @blkfilter_ctl_blksnap_snapshotinfo:
+ * Get information about snapshot.
+ * The result of executing the command is a &struct blksnap_snapshotinfo.
+ * Return 0 if succeeded, negative errno otherwise.
+ */
+enum blkfilter_ctl_blksnap {
+ blkfilter_ctl_blksnap_cbtinfo,
+ blkfilter_ctl_blksnap_cbtmap,
+ blkfilter_ctl_blksnap_cbtdirty,
+ blkfilter_ctl_blksnap_snapshotadd,
+ blkfilter_ctl_blksnap_snapshotinfo,
+};
+
+#ifndef UUID_SIZE
+#define UUID_SIZE 16
+#endif
+
+/**
+ * struct blksnap_uuid - Unique 16-byte identifier.
+ *
+ * @b:
+ * An array of 16 bytes.
+ */
+struct blksnap_uuid {
+ __u8 b[UUID_SIZE];
+};
+
+/**
+ * struct blksnap_cbtinfo - Result for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtinfo.
+ *
+ * @device_capacity:
+ * Device capacity in bytes.
+ * @block_size:
+ * Block size in bytes.
+ * @block_count:
+ * Number of blocks.
+ * @generation_id:
+ * Unique identifier of change tracking generation.
+ * @changes_number:
+ * Current changes number.
+ */
+struct blksnap_cbtinfo {
+ __u64 device_capacity;
+ __u32 block_size;
+ __u32 block_count;
+ struct blksnap_uuid generation_id;
+ __u8 changes_number;
+};
+
+/**
+ * struct blksnap_cbtmap - Option for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtmap.
+ *
+ * @offset:
+ * Offset from the beginning of the CBT bitmap in bytes.
+ * @length:
+ * Size of @buff in bytes.
+ * @buffer:
+ * Pointer to the buffer for output.
+ */
+struct blksnap_cbtmap {
+ __u32 offset;
+ __u32 length;
+ __u64 buffer;
+};
+
+/**
+ * struct blksnap_sectors - Description of the block device region.
+ *
+ * @offset:
+ * Offset from the beginning of the disk in sectors.
+ * @count:
+ * Count of sectors.
+ */
+struct blksnap_sectors {
+ __u64 offset;
+ __u64 count;
+};
+
+/**
+ * struct blksnap_cbtdirty - Option for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtdirty.
+ *
+ * @count:
+ * Count of elements in the @dirty_sectors.
+ * @dirty_sectors:
+ * Pointer to the array of &struct blksnap_sectors.
+ */
+struct blksnap_cbtdirty {
+ __u32 count;
+ __u64 dirty_sectors;
+};
+
+/**
+ * struct blksnap_snapshotadd - Option for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_snapshotadd.
+ *
+ * @id:
+ * ID of the snapshot to which the block device should be added.
+ */
+struct blksnap_snapshotadd {
+ struct blksnap_uuid id;
+};
+
+#define IMAGE_DISK_NAME_LEN 32
+
+/**
+ * struct blksnap_snapshotinfo - Result for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_snapshotinfo.
+ *
+ * @error_code:
+ * Zero if there were no errors while holding the snapshot.
+ * The error code -ENOSPC means that while holding the snapshot, a snapshot
+ * overflow situation has occurred. Other error codes mean other reasons
+ * for failure.
+ * The error code is reset when the device is added to a new snapshot.
+ * @image:
+ * If the snapshot was taken, it stores the block device name of the
+ * image, or empty string otherwise.
+ */
+struct blksnap_snapshotinfo {
+ __s32 error_code;
+ __u8 image[IMAGE_DISK_NAME_LEN];
+};
+
+/**
+ * DOC: Interface for managing snapshots
+ *
+ * Control commands that are transmitted through the blksnap module interface.
+ */
+enum blksnap_ioctl {
+ blksnap_ioctl_version,
+ blksnap_ioctl_snapshot_create,
+ blksnap_ioctl_snapshot_destroy,
+ blksnap_ioctl_snapshot_take,
+ blksnap_ioctl_snapshot_collect,
+ blksnap_ioctl_snapshot_wait_event,
+};
+
+/**
+ * struct blksnap_version - Module version.
+ *
+ * @major:
+ * Version major part.
+ * @minor:
+ * Version minor part.
+ * @revision:
+ * Revision number.
+ * @build:
+ * Build number. Should be zero.
+ */
+struct blksnap_version {
+ __u16 major;
+ __u16 minor;
+ __u16 revision;
+ __u16 build;
+};
+
+/**
+ * define IOCTL_BLKSNAP_VERSION - Get module version.
+ *
+ * The version may increase when the API changes. But linking the user space
+ * behavior to the version code does not seem to be a good idea.
+ * To ensure backward compatibility, API changes should be made by adding new
+ * ioctl without changing the behavior of existing ones. The version should be
+ * used for logs.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_VERSION \
+ _IOR(BLKSNAP, blksnap_ioctl_version, struct blksnap_version)
+
+/**
+ * struct blksnap_snapshot_create - Argument for the
+ * &IOCTL_BLKSNAP_SNAPSHOT_CREATE control.
+ *
+ * @diff_storage_limit_sect:
+ * The maximum allowed difference storage size in sectors.
+ * @diff_storage_fd:
+ * The difference storage file descriptor.
+ * @id:
+ * Generated new snapshot ID.
+ */
+struct blksnap_snapshot_create {
+ __u64 diff_storage_limit_sect;
+ __u32 diff_storage_fd;
+ struct blksnap_uuid id;
+};
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_CREATE - Create snapshot.
+ *
+ * Creates a snapshot structure and initializes the difference storage.
+ * A snapshot is created for several block devices at once. Several snapshots
+ * can be created at the same time, but with the condition that one block
+ * device can only be included in one snapshot.
+ *
+ * The difference storage can be dynamically increase as it fills up.
+ * The file is increased in portions, the size of which is determined by the
+ * module parameter &diff_storage_minimum. Each time the amount of free space
+ * in the difference storage is reduced to the half of &diff_storage_minimum,
+ * the file is expanded by a portion, until it reaches the allowable limit
+ * &diff_storage_limit_sect.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_CREATE \
+ _IOWR(BLKSNAP, blksnap_ioctl_snapshot_create, \
+ struct blksnap_snapshot_create)
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_DESTROY - Release and destroy the snapshot.
+ *
+ * Destroys snapshot with &blksnap_snapshot_destroy.id. This leads to the
+ * deletion of all block device images of the snapshot. The difference storage
+ * is being released. But the change tracker keeps tracking.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_DESTROY \
+ _IOW(BLKSNAP, blksnap_ioctl_snapshot_destroy, \
+ struct blksnap_uuid)
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_TAKE - Take snapshot.
+ *
+ * Creates snapshot images of block devices and switches change trackers tables.
+ * The snapshot must be created before this call, and the areas of block
+ * devices should be added to the difference storage.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_TAKE \
+ _IOW(BLKSNAP, blksnap_ioctl_snapshot_take, \
+ struct blksnap_uuid)
+
+/**
+ * struct blksnap_snapshot_collect - Argument for the
+ * &IOCTL_BLKSNAP_SNAPSHOT_COLLECT control.
+ *
+ * @count:
+ * Size of &blksnap_snapshot_collect.ids in the number of 16-byte UUID.
+ * @ids:
+ * Pointer to the array of struct blksnap_uuid for output.
+ */
+struct blksnap_snapshot_collect {
+ __u32 count;
+ __u64 ids;
+};
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_COLLECT - Get collection of created snapshots.
+ *
+ * Multiple snapshots can be created at the same time. This allows for one
+ * system to create backups for different data with a independent schedules.
+ *
+ * If in &blksnap_snapshot_collect.count is less than required to store the
+ * &blksnap_snapshot_collect.ids, the array is not filled, and the ioctl
+ * returns the required count for &blksnap_snapshot_collect.ids.
+ *
+ * So, it is recommended to call the ioctl twice. The first call with an null
+ * pointer &blksnap_snapshot_collect.ids and a zero value in
+ * &blksnap_snapshot_collect.count. It will set the required array size in
+ * &blksnap_snapshot_collect.count. The second call with a pointer
+ * &blksnap_snapshot_collect.ids to an array of the required size will allow to
+ * get collection of active snapshots.
+ *
+ * Return: 0 if succeeded, -ENODATA if there is not enough space in the array
+ * to store collection of active snapshots, or negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_COLLECT \
+ _IOR(BLKSNAP, blksnap_ioctl_snapshot_collect, \
+ struct blksnap_snapshot_collect)
+
+/**
+ * enum blksnap_event_codes - Variants of event codes.
+ *
+ * @blksnap_event_code_corrupted:
+ * Snapshot image is corrupted event.
+ * If a chunk could not be allocated when trying to save data to the
+ * difference storage, this event is generated. However, this does not mean
+ * that the backup process was interrupted with an error. If the snapshot
+ * image has been read to the end by this time, the backup process is
+ * considered successful.
+ */
+enum blksnap_event_codes {
+ blksnap_event_code_corrupted,
+};
+
+/**
+ * struct blksnap_snapshot_event - Argument for the
+ * &IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT control.
+ *
+ * @id:
+ * Snapshot ID.
+ * @timeout_ms:
+ * Timeout for waiting in milliseconds.
+ * @time_label:
+ * Timestamp of the received event.
+ * @code:
+ * Code of the received event &enum blksnap_event_codes.
+ * @data:
+ * The received event body.
+ */
+struct blksnap_snapshot_event {
+ struct blksnap_uuid id;
+ __u32 timeout_ms;
+ __u32 code;
+ __s64 time_label;
+ __u8 data[4096 - 32];
+};
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT - Wait and get the event from the
+ * snapshot.
+ *
+ * While holding the snapshot, the kernel module can transmit information about
+ * changes in its state in the form of events to the user level.
+ * It is very important to receive these events as quickly as possible, so the
+ * user's thread is in the state of interruptible sleep.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT \
+ _IOR(BLKSNAP, blksnap_ioctl_snapshot_wait_event, \
+ struct blksnap_snapshot_event)
+
+/**
+ * struct blksnap_event_corrupted - Data for the
+ * &blksnap_event_code_corrupted event.
+ *
+ * @dev_id_mj:
+ * Major part of original device ID.
+ * @dev_id_mn:
+ * Minor part of original device ID.
+ * @err_code:
+ * Error code.
+ */
+struct blksnap_event_corrupted {
+ __u32 dev_id_mj;
+ __u32 dev_id_mn;
+ __s32 err_code;
+};
+
+#endif /* _UAPI_LINUX_BLKSNAP_H */
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v6 02/11] block: Block Device Filtering Mechanism
2023-11-24 16:38 [PATCH v6 00/11] blksnap - block devices snapshots module Sergei Shtepa
@ 2023-11-24 16:38 ` Sergei Shtepa
0 siblings, 0 replies; 9+ messages in thread
From: Sergei Shtepa @ 2023-11-24 16:38 UTC (permalink / raw)
To: axboe, hch, corbet, snitzer
Cc: mingo, peterz, juri.lelli, vincent.guittot, mgorman, vschneid,
viro, brauner, linux-block, linux-doc, linux-kernel,
linux-fsdevel, Sergei Shtepa, Donald Buczek, Fabio Fantoni
From: Sergei Shtepa <sergei.shtepa@veeam.com>
The block device filtering mechanism is an API that allows to attach
block device filters. Block device filters allow perform additional
processing for I/O units.
The idea of handling I/O units on block devices is not new. Back in the
2.6 kernel, there was an undocumented possibility of handling I/O units
by substituting the make_request_fn() function, which belonged to the
request_queue structure. But none of the in-tree kernel modules used
this feature, and it was eliminated in the 5.10 kernel.
The block device filtering mechanism returns the ability to handle I/O
units. It is possible to safely attach filter to a block device "on the
fly" without changing the structure of block devices stack.
Co-developed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Donald Buczek <buczek@molgen.mpg.de>
Tested-by: Fabio Fantoni <fantonifabio@tiscali.it>
Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
MAINTAINERS | 3 +
block/Makefile | 3 +-
block/bdev.c | 2 +
block/blk-core.c | 35 ++++-
block/blk-filter.c | 238 ++++++++++++++++++++++++++++++++
block/blk.h | 11 ++
block/genhd.c | 10 ++
block/ioctl.c | 7 +
block/partitions/core.c | 9 ++
include/linux/blk-filter.h | 51 +++++++
include/linux/blk_types.h | 1 +
include/linux/blkdev.h | 1 +
include/linux/sched.h | 1 +
include/uapi/linux/blk-filter.h | 35 +++++
include/uapi/linux/fs.h | 3 +
15 files changed, 408 insertions(+), 2 deletions(-)
create mode 100644 block/blk-filter.c
create mode 100644 include/linux/blk-filter.h
create mode 100644 include/uapi/linux/blk-filter.h
diff --git a/MAINTAINERS b/MAINTAINERS
index c20cbec81b58..ef90cd0fec9c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3589,6 +3589,9 @@ M: Sergei Shtepa <sergei.shtepa@veeam.com>
L: linux-block@vger.kernel.org
S: Supported
F: Documentation/block/blkfilter.rst
+F: block/blk-filter.c
+F: include/linux/blk-filter.h
+F: include/uapi/linux/blk-filter.h
BLOCK LAYER
M: Jens Axboe <axboe@kernel.dk>
diff --git a/block/Makefile b/block/Makefile
index 46ada9dc8bbf..041c54eb0240 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -9,7 +9,8 @@ obj-y := bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \
- disk-events.o blk-ia-ranges.o early-lookup.o
+ disk-events.o blk-ia-ranges.o early-lookup.o \
+ blk-filter.o
obj-$(CONFIG_BOUNCE) += bounce.o
obj-$(CONFIG_BLK_DEV_BSG_COMMON) += bsg.o
diff --git a/block/bdev.c b/block/bdev.c
index e4cfb7adb645..6039d99b3a75 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -412,6 +412,7 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
return NULL;
}
bdev->bd_disk = disk;
+ bdev->bd_filter = NULL;
return bdev;
}
@@ -1018,6 +1019,7 @@ void bdev_mark_dead(struct block_device *bdev, bool surprise)
}
invalidate_bdev(bdev);
+ blkfilter_detach(bdev);
}
/*
* New drivers should not use this directly. There are some drivers however
diff --git a/block/blk-core.c b/block/blk-core.c
index fdf25b8d6e78..1de74240892a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -18,6 +18,7 @@
#include <linux/blkdev.h>
#include <linux/blk-pm.h>
#include <linux/blk-integrity.h>
+#include <linux/blk-filter.h>
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
@@ -592,12 +593,34 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
static void __submit_bio(struct bio *bio)
{
+ struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+ bool skip_bio = false;
+
+ if (unlikely(bio_queue_enter(bio)))
+ return;
+
+ if (bio->bi_bdev->bd_filter &&
+ bio->bi_bdev->bd_filter != current->blk_filter) {
+ struct blkfilter *prev = current->blk_filter;
+
+ current->blk_filter = bio->bi_bdev->bd_filter;
+ skip_bio = bio->bi_bdev->bd_filter->ops->submit_bio(bio);
+ current->blk_filter = prev;
+ }
+
+ blk_queue_exit(q);
+ if (skip_bio)
+ return;
+
if (unlikely(!blk_crypto_bio_prep(&bio)))
return;
if (!bio->bi_bdev->bd_has_submit_bio) {
blk_mq_submit_bio(bio);
- } else if (likely(bio_queue_enter(bio) == 0)) {
+ return;
+ }
+
+ if (likely(bio_queue_enter(bio) == 0)) {
struct gendisk *disk = bio->bi_bdev->bd_disk;
disk->fops->submit_bio(bio);
@@ -681,6 +704,15 @@ static void __submit_bio_noacct_mq(struct bio *bio)
current->bio_list = NULL;
}
+/**
+ * submit_bio_noacct_nocheck - re-submit a bio to the block device layer for I/O
+ * from block device filter.
+ * @bio: The bio describing the location in memory and on the device.
+ *
+ * This is a version of submit_bio() that shall only be used for I/O that is
+ * resubmitted to lower level by block device filters. All file systems and
+ * other upper level users of the block layer should use submit_bio() instead.
+ */
void submit_bio_noacct_nocheck(struct bio *bio)
{
blk_cgroup_bio_start(bio);
@@ -708,6 +740,7 @@ void submit_bio_noacct_nocheck(struct bio *bio)
else
__submit_bio_noacct(bio);
}
+EXPORT_SYMBOL_GPL(submit_bio_noacct_nocheck);
/**
* submit_bio_noacct - re-submit a bio to the block device layer for I/O
diff --git a/block/blk-filter.c b/block/blk-filter.c
new file mode 100644
index 000000000000..8e2550bed0c5
--- /dev/null
+++ b/block/blk-filter.c
@@ -0,0 +1,238 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#include <linux/blk-filter.h>
+#include <linux/blk-mq.h>
+#include <linux/module.h>
+
+#include "blk.h"
+
+static LIST_HEAD(blkfilters);
+static DEFINE_SPINLOCK(blkfilters_lock);
+
+static inline struct blkfilter_operations *__blkfilter_find(const char *name)
+{
+ struct blkfilter_operations *ops;
+
+ list_for_each_entry(ops, &blkfilters, link)
+ if (strncmp(ops->name, name, BLKFILTER_NAME_LENGTH) == 0)
+ return ops;
+
+ return NULL;
+}
+
+static inline struct blkfilter_operations *blkfilter_find_get(const char *name)
+{
+ struct blkfilter_operations *ops;
+
+ spin_lock(&blkfilters_lock);
+ ops = __blkfilter_find(name);
+ if (ops && !try_module_get(ops->owner))
+ ops = NULL;
+ spin_unlock(&blkfilters_lock);
+
+ return ops;
+}
+
+static inline void blkfilter_put(const struct blkfilter_operations *ops)
+{
+ module_put(ops->owner);
+}
+
+int blkfilter_ioctl_attach(struct block_device *bdev,
+ struct blkfilter_name __user *argp)
+{
+ struct blkfilter_name name;
+ struct blkfilter_operations *ops;
+ struct blkfilter *flt;
+ int ret;
+
+ if (copy_from_user(&name, argp, sizeof(name)))
+ return -EFAULT;
+
+ ops = blkfilter_find_get(name.name);
+ if (!ops)
+ return -ENOENT;
+
+ mutex_lock(&bdev->bd_disk->open_mutex);
+ if (!disk_live(bdev->bd_disk)) {
+ ret = -ENODEV;
+ goto out_mutex_unlock;
+ }
+ ret = freeze_bdev(bdev);
+ if (ret)
+ goto out_mutex_unlock;
+ blk_mq_freeze_queue(bdev->bd_queue);
+
+ if (bdev->bd_filter) {
+ if (bdev->bd_filter->ops == ops)
+ ret = -EALREADY;
+ else
+ ret = -EBUSY;
+ goto out_unfreeze;
+ }
+
+ flt = ops->attach(bdev);
+ if (IS_ERR(flt)) {
+ ret = PTR_ERR(flt);
+ goto out_unfreeze;
+ }
+
+ flt->ops = ops;
+ bdev->bd_filter = flt;
+
+out_unfreeze:
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+ thaw_bdev(bdev);
+out_mutex_unlock:
+ mutex_unlock(&bdev->bd_disk->open_mutex);
+ if (ret)
+ blkfilter_put(ops);
+ return ret;
+}
+
+static void __blkfilter_detach(struct block_device *bdev)
+{
+ struct blkfilter *flt = bdev->bd_filter;
+ const struct blkfilter_operations *ops = flt->ops;
+
+ bdev->bd_filter = NULL;
+ ops->detach(flt);
+ blkfilter_put(ops);
+}
+
+void blkfilter_detach(struct block_device *bdev)
+{
+ if (bdev->bd_filter) {
+ blk_mq_freeze_queue(bdev->bd_queue);
+ __blkfilter_detach(bdev);
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+ }
+}
+
+int blkfilter_ioctl_detach(struct block_device *bdev,
+ struct blkfilter_name __user *argp)
+{
+ struct blkfilter_name name;
+ int ret = 0;
+
+ if (copy_from_user(&name, argp, sizeof(name)))
+ return -EFAULT;
+
+ mutex_lock(&bdev->bd_disk->open_mutex);
+ if (!disk_live(bdev->bd_disk)) {
+ ret = -ENODEV;
+ goto out_mutex_unlock;
+ }
+ blk_mq_freeze_queue(bdev->bd_queue);
+ if (!bdev->bd_filter) {
+ ret = -ENOENT;
+ goto out_unfreeze;
+ }
+ if (strncmp(bdev->bd_filter->ops->name, name.name,
+ BLKFILTER_NAME_LENGTH)) {
+ ret = -EINVAL;
+ goto out_unfreeze;
+ }
+
+ __blkfilter_detach(bdev);
+out_unfreeze:
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+out_mutex_unlock:
+ mutex_unlock(&bdev->bd_disk->open_mutex);
+ return ret;
+}
+
+int blkfilter_ioctl_ctl(struct block_device *bdev,
+ struct blkfilter_ctl __user *argp)
+{
+ struct blkfilter_ctl ctl;
+ struct blkfilter *flt;
+ int ret;
+
+ if (copy_from_user(&ctl, argp, sizeof(ctl)))
+ return -EFAULT;
+
+ mutex_lock(&bdev->bd_disk->open_mutex);
+ if (!disk_live(bdev->bd_disk)) {
+ ret = -ENODEV;
+ goto out_mutex_unlock;
+ }
+ ret = blk_queue_enter(bdev_get_queue(bdev), 0);
+ if (ret)
+ goto out_mutex_unlock;
+
+ flt = bdev->bd_filter;
+ if (!flt || strncmp(flt->ops->name, ctl.name, BLKFILTER_NAME_LENGTH)) {
+ ret = -ENOENT;
+ goto out_queue_exit;
+ }
+
+ if (!flt->ops->ctl) {
+ ret = -ENOTTY;
+ goto out_queue_exit;
+ }
+
+ ret = flt->ops->ctl(flt, ctl.cmd, u64_to_user_ptr(ctl.opt),
+ &ctl.optlen);
+out_queue_exit:
+ blk_queue_exit(bdev_get_queue(bdev));
+out_mutex_unlock:
+ mutex_unlock(&bdev->bd_disk->open_mutex);
+ return ret;
+}
+
+ssize_t blkfilter_show(struct block_device *bdev, char *buf)
+{
+ ssize_t ret = 0;
+
+ blk_mq_freeze_queue(bdev->bd_queue);
+ if (bdev->bd_filter)
+ ret = sprintf(buf, "%s\n", bdev->bd_filter->ops->name);
+ else
+ ret = sprintf(buf, "\n");
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+
+ return ret;
+}
+
+/**
+ * blkfilter_register() - Register block device filter operations
+ * @ops: The operations to register.
+ *
+ * Return:
+ * 0 if succeeded,
+ * -EBUSY if a block device filter with the same name is already
+ * registered.
+ */
+int blkfilter_register(struct blkfilter_operations *ops)
+{
+ struct blkfilter_operations *found;
+ int ret = 0;
+
+ spin_lock(&blkfilters_lock);
+ found = __blkfilter_find(ops->name);
+ if (found)
+ ret = -EBUSY;
+ else
+ list_add_tail(&ops->link, &blkfilters);
+ spin_unlock(&blkfilters_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(blkfilter_register);
+
+/**
+ * blkfilter_unregister() - Unregister block device filter operations
+ * @ops: The operations to unregister.
+ *
+ * Important: before unloading, it is necessary to detach the filter from all
+ * block devices.
+ *
+ */
+void blkfilter_unregister(struct blkfilter_operations *ops)
+{
+ spin_lock(&blkfilters_lock);
+ list_del(&ops->link);
+ spin_unlock(&blkfilters_lock);
+}
+EXPORT_SYMBOL_GPL(blkfilter_unregister);
diff --git a/block/blk.h b/block/blk.h
index 08a358bc0919..1f104f4865c3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -7,6 +7,8 @@
#include <xen/xen.h>
#include "blk-crypto-internal.h"
+struct blkfilter_ctl;
+struct blkfilter_name;
struct elevator_type;
/* Max future timer expiry for timeouts */
@@ -474,6 +476,15 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
extern const struct address_space_operations def_blk_aops;
+int blkfilter_ioctl_attach(struct block_device *bdev,
+ struct blkfilter_name __user *argp);
+int blkfilter_ioctl_detach(struct block_device *bdev,
+ struct blkfilter_name __user *argp);
+int blkfilter_ioctl_ctl(struct block_device *bdev,
+ struct blkfilter_ctl __user *argp);
+void blkfilter_detach(struct block_device *bdev);
+ssize_t blkfilter_show(struct block_device *bdev, char *buf);
+
int disk_register_independent_access_ranges(struct gendisk *disk);
void disk_unregister_independent_access_ranges(struct gendisk *disk);
diff --git a/block/genhd.c b/block/genhd.c
index c9d06f72c587..ba744e3fd581 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -26,6 +26,7 @@
#include <linux/badblocks.h>
#include <linux/part_stat.h>
#include <linux/blktrace_api.h>
+#include <linux/blk-filter.h>
#include "blk-throttle.h"
#include "blk.h"
@@ -654,6 +655,7 @@ void del_gendisk(struct gendisk *disk)
mutex_lock(&disk->open_mutex);
xa_for_each(&disk->part_tbl, idx, part)
remove_inode_hash(part->bd_inode);
+ blkfilter_detach(disk->part0);
mutex_unlock(&disk->open_mutex);
/*
@@ -1044,6 +1046,12 @@ static ssize_t diskseq_show(struct device *dev,
return sprintf(buf, "%llu\n", disk->diskseq);
}
+static ssize_t disk_filter_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return blkfilter_show(dev_to_bdev(dev), buf);
+}
+
static DEVICE_ATTR(range, 0444, disk_range_show, NULL);
static DEVICE_ATTR(ext_range, 0444, disk_ext_range_show, NULL);
static DEVICE_ATTR(removable, 0444, disk_removable_show, NULL);
@@ -1057,6 +1065,7 @@ static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
static DEVICE_ATTR(badblocks, 0644, disk_badblocks_show, disk_badblocks_store);
static DEVICE_ATTR(diskseq, 0444, diskseq_show, NULL);
+static DEVICE_ATTR(filter, 0444, disk_filter_show, NULL);
#ifdef CONFIG_FAIL_MAKE_REQUEST
ssize_t part_fail_show(struct device *dev,
@@ -1103,6 +1112,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_events_async.attr,
&dev_attr_events_poll_msecs.attr,
&dev_attr_diskseq.attr,
+ &dev_attr_filter.attr,
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
diff --git a/block/ioctl.c b/block/ioctl.c
index 4160f4e6bd5b..1b11303e213b 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -2,6 +2,7 @@
#include <linux/capability.h>
#include <linux/compat.h>
#include <linux/blkdev.h>
+#include <linux/blk-filter.h>
#include <linux/export.h>
#include <linux/gfp.h>
#include <linux/blkpg.h>
@@ -572,6 +573,12 @@ static int blkdev_common_ioctl(struct block_device *bdev, blk_mode_t mode,
return blkdev_pr_preempt(bdev, mode, argp, true);
case IOC_PR_CLEAR:
return blkdev_pr_clear(bdev, mode, argp);
+ case BLKFILTER_ATTACH:
+ return blkfilter_ioctl_attach(bdev, argp);
+ case BLKFILTER_DETACH:
+ return blkfilter_ioctl_detach(bdev, argp);
+ case BLKFILTER_CTL:
+ return blkfilter_ioctl_ctl(bdev, argp);
default:
return -ENOIOCTLCMD;
}
diff --git a/block/partitions/core.c b/block/partitions/core.c
index f47ffcfdfcec..19c69dc23d2c 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -10,6 +10,7 @@
#include <linux/ctype.h>
#include <linux/vmalloc.h>
#include <linux/raid/detect.h>
+#include <linux/blk-filter.h>
#include "check.h"
static int (*const check_part[])(struct parsed_partitions *) = {
@@ -200,6 +201,12 @@ static ssize_t part_discard_alignment_show(struct device *dev,
return sprintf(buf, "%u\n", bdev_discard_alignment(dev_to_bdev(dev)));
}
+static ssize_t part_filter_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return blkfilter_show(dev_to_bdev(dev), buf);
+}
+
static DEVICE_ATTR(partition, 0444, part_partition_show, NULL);
static DEVICE_ATTR(start, 0444, part_start_show, NULL);
static DEVICE_ATTR(size, 0444, part_size_show, NULL);
@@ -208,6 +215,7 @@ static DEVICE_ATTR(alignment_offset, 0444, part_alignment_offset_show, NULL);
static DEVICE_ATTR(discard_alignment, 0444, part_discard_alignment_show, NULL);
static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
+static DEVICE_ATTR(filter, 0444, part_filter_show, NULL);
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, 0644, part_fail_show, part_fail_store);
@@ -222,6 +230,7 @@ static struct attribute *part_attrs[] = {
&dev_attr_discard_alignment.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+ &dev_attr_filter.attr,
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
diff --git a/include/linux/blk-filter.h b/include/linux/blk-filter.h
new file mode 100644
index 000000000000..0afdb40f3bab
--- /dev/null
+++ b/include/linux/blk-filter.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef _LINUX_BLK_FILTER_H
+#define _LINUX_BLK_FILTER_H
+
+#include <uapi/linux/blk-filter.h>
+
+struct bio;
+struct block_device;
+struct blkfilter_operations;
+
+/**
+ * struct blkfilter - Block device filter.
+ *
+ * @ops: Block device filter operations.
+ *
+ * For each filtered block device, the filter creates a data structure
+ * associated with this device. The data in this structure is specific to the
+ * filter, but it must contain a pointer to the block device filter account.
+ */
+struct blkfilter {
+ const struct blkfilter_operations *ops;
+};
+
+/**
+ * struct blkfilter_operations - Block device filter operations.
+ *
+ * @link: Entry in the global list of filter drivers
+ * (must not be accessed by the driver).
+ * @owner: Module implementing the filter driver.
+ * @name: Name of the filter driver.
+ * @attach: Attach the filter driver to the block device.
+ * @detach: Detach the filter driver from the block device.
+ * @ctl: Send a control command to the filter driver.
+ * @submit_bio: Handle bio submissions to the filter driver.
+ */
+struct blkfilter_operations {
+ struct list_head link;
+ struct module *owner;
+ const char *name;
+ struct blkfilter *(*attach)(struct block_device *bdev);
+ void (*detach)(struct blkfilter *flt);
+ int (*ctl)(struct blkfilter *flt, const unsigned int cmd,
+ __u8 __user *buf, __u32 *plen);
+ bool (*submit_bio)(struct bio *bio);
+};
+
+int blkfilter_register(struct blkfilter_operations *ops);
+void blkfilter_unregister(struct blkfilter_operations *ops);
+
+#endif /* _UAPI_LINUX_BLK_FILTER_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d5c5e59ddbd2..490865292fde 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -74,6 +74,7 @@ struct block_device {
* path
*/
struct device bd_device;
+ struct blkfilter *bd_filter;
} __randomize_layout;
#define bdev_whole(_bdev) \
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 51fa7ffdee83..6a0754007d1d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -834,6 +834,7 @@ void blk_request_module(dev_t devt);
extern int blk_register_queue(struct gendisk *disk);
extern void blk_unregister_queue(struct gendisk *disk);
+void submit_bio_noacct_nocheck(struct bio *bio);
void submit_bio_noacct(struct bio *bio);
struct bio *bio_split_to_limits(struct bio *bio);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 292c31697248..e7c3cd490a80 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1190,6 +1190,7 @@ struct task_struct {
/* Stack plugging: */
struct blk_plug *plug;
+ struct blkfilter *blk_filter;
/* VM state: */
struct reclaim_state *reclaim_state;
diff --git a/include/uapi/linux/blk-filter.h b/include/uapi/linux/blk-filter.h
new file mode 100644
index 000000000000..18885dc1b717
--- /dev/null
+++ b/include/uapi/linux/blk-filter.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef _UAPI_LINUX_BLK_FILTER_H
+#define _UAPI_LINUX_BLK_FILTER_H
+
+#include <linux/types.h>
+
+#define BLKFILTER_NAME_LENGTH 32
+
+/**
+ * struct blkfilter_name - parameter for BLKFILTER_ATTACH and BLKFILTER_DETACH
+ * ioctl.
+ *
+ * @name: Name of block device filter.
+ */
+struct blkfilter_name {
+ __u8 name[BLKFILTER_NAME_LENGTH];
+};
+
+/**
+ * struct blkfilter_ctl - parameter for BLKFILTER_CTL ioctl
+ *
+ * @name: Name of block device filter.
+ * @cmd: The filter-specific operation code of the command.
+ * @optlen: Size of data at @opt.
+ * @opt: Userspace buffer with options.
+ */
+struct blkfilter_ctl {
+ __u8 name[BLKFILTER_NAME_LENGTH];
+ __u32 cmd;
+ __u32 optlen;
+ __u64 opt;
+};
+
+#endif /* _UAPI_LINUX_BLK_FILTER_H */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index da43810b7485..f96809cd2f50 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -189,6 +189,9 @@ struct fsxattr {
* A jump here: 130-136 are reserved for zoned block devices
* (see uapi/linux/blkzoned.h)
*/
+#define BLKFILTER_ATTACH _IOWR(0x12, 140, struct blkfilter_name)
+#define BLKFILTER_DETACH _IOWR(0x12, 141, struct blkfilter_name)
+#define BLKFILTER_CTL _IOWR(0x12, 142, struct blkfilter_ctl)
#define BMAP_IOCTL 1 /* obsolete - kept for compatibility */
#define FIBMAP _IO(0x00,1) /* bmap access */
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v6 02/11] block: Block Device Filtering Mechanism
2023-11-24 16:59 [PATCH v6 00/11] blksnap - block devices snapshots module Sergei Shtepa
@ 2023-11-24 16:59 ` Sergei Shtepa
2023-12-07 7:44 ` Christoph Hellwig
0 siblings, 1 reply; 9+ messages in thread
From: Sergei Shtepa @ 2023-11-24 16:59 UTC (permalink / raw)
To: axboe, hch, corbet, snitzer
Cc: mingo, peterz, juri.lelli, viro, brauner, linux-block, linux-doc,
linux-kernel, linux-fsdevel, Sergei Shtepa, Donald Buczek,
Fabio Fantoni
From: Sergei Shtepa <sergei.shtepa@veeam.com>
The block device filtering mechanism is an API that allows to attach
block device filters. Block device filters allow perform additional
processing for I/O units.
The idea of handling I/O units on block devices is not new. Back in the
2.6 kernel, there was an undocumented possibility of handling I/O units
by substituting the make_request_fn() function, which belonged to the
request_queue structure. But none of the in-tree kernel modules used
this feature, and it was eliminated in the 5.10 kernel.
The block device filtering mechanism returns the ability to handle I/O
units. It is possible to safely attach filter to a block device "on the
fly" without changing the structure of block devices stack.
Co-developed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Donald Buczek <buczek@molgen.mpg.de>
Tested-by: Fabio Fantoni <fantonifabio@tiscali.it>
Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
MAINTAINERS | 3 +
block/Makefile | 3 +-
block/bdev.c | 2 +
block/blk-core.c | 35 ++++-
block/blk-filter.c | 238 ++++++++++++++++++++++++++++++++
block/blk.h | 11 ++
block/genhd.c | 10 ++
block/ioctl.c | 7 +
block/partitions/core.c | 9 ++
include/linux/blk-filter.h | 51 +++++++
include/linux/blk_types.h | 1 +
include/linux/blkdev.h | 1 +
include/linux/sched.h | 1 +
include/uapi/linux/blk-filter.h | 35 +++++
include/uapi/linux/fs.h | 3 +
15 files changed, 408 insertions(+), 2 deletions(-)
create mode 100644 block/blk-filter.c
create mode 100644 include/linux/blk-filter.h
create mode 100644 include/uapi/linux/blk-filter.h
diff --git a/MAINTAINERS b/MAINTAINERS
index c20cbec81b58..ef90cd0fec9c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3589,6 +3589,9 @@ M: Sergei Shtepa <sergei.shtepa@veeam.com>
L: linux-block@vger.kernel.org
S: Supported
F: Documentation/block/blkfilter.rst
+F: block/blk-filter.c
+F: include/linux/blk-filter.h
+F: include/uapi/linux/blk-filter.h
BLOCK LAYER
M: Jens Axboe <axboe@kernel.dk>
diff --git a/block/Makefile b/block/Makefile
index 46ada9dc8bbf..041c54eb0240 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -9,7 +9,8 @@ obj-y := bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \
- disk-events.o blk-ia-ranges.o early-lookup.o
+ disk-events.o blk-ia-ranges.o early-lookup.o \
+ blk-filter.o
obj-$(CONFIG_BOUNCE) += bounce.o
obj-$(CONFIG_BLK_DEV_BSG_COMMON) += bsg.o
diff --git a/block/bdev.c b/block/bdev.c
index e4cfb7adb645..6039d99b3a75 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -412,6 +412,7 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
return NULL;
}
bdev->bd_disk = disk;
+ bdev->bd_filter = NULL;
return bdev;
}
@@ -1018,6 +1019,7 @@ void bdev_mark_dead(struct block_device *bdev, bool surprise)
}
invalidate_bdev(bdev);
+ blkfilter_detach(bdev);
}
/*
* New drivers should not use this directly. There are some drivers however
diff --git a/block/blk-core.c b/block/blk-core.c
index fdf25b8d6e78..1de74240892a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -18,6 +18,7 @@
#include <linux/blkdev.h>
#include <linux/blk-pm.h>
#include <linux/blk-integrity.h>
+#include <linux/blk-filter.h>
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
@@ -592,12 +593,34 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
static void __submit_bio(struct bio *bio)
{
+ struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+ bool skip_bio = false;
+
+ if (unlikely(bio_queue_enter(bio)))
+ return;
+
+ if (bio->bi_bdev->bd_filter &&
+ bio->bi_bdev->bd_filter != current->blk_filter) {
+ struct blkfilter *prev = current->blk_filter;
+
+ current->blk_filter = bio->bi_bdev->bd_filter;
+ skip_bio = bio->bi_bdev->bd_filter->ops->submit_bio(bio);
+ current->blk_filter = prev;
+ }
+
+ blk_queue_exit(q);
+ if (skip_bio)
+ return;
+
if (unlikely(!blk_crypto_bio_prep(&bio)))
return;
if (!bio->bi_bdev->bd_has_submit_bio) {
blk_mq_submit_bio(bio);
- } else if (likely(bio_queue_enter(bio) == 0)) {
+ return;
+ }
+
+ if (likely(bio_queue_enter(bio) == 0)) {
struct gendisk *disk = bio->bi_bdev->bd_disk;
disk->fops->submit_bio(bio);
@@ -681,6 +704,15 @@ static void __submit_bio_noacct_mq(struct bio *bio)
current->bio_list = NULL;
}
+/**
+ * submit_bio_noacct_nocheck - re-submit a bio to the block device layer for I/O
+ * from block device filter.
+ * @bio: The bio describing the location in memory and on the device.
+ *
+ * This is a version of submit_bio() that shall only be used for I/O that is
+ * resubmitted to lower level by block device filters. All file systems and
+ * other upper level users of the block layer should use submit_bio() instead.
+ */
void submit_bio_noacct_nocheck(struct bio *bio)
{
blk_cgroup_bio_start(bio);
@@ -708,6 +740,7 @@ void submit_bio_noacct_nocheck(struct bio *bio)
else
__submit_bio_noacct(bio);
}
+EXPORT_SYMBOL_GPL(submit_bio_noacct_nocheck);
/**
* submit_bio_noacct - re-submit a bio to the block device layer for I/O
diff --git a/block/blk-filter.c b/block/blk-filter.c
new file mode 100644
index 000000000000..8e2550bed0c5
--- /dev/null
+++ b/block/blk-filter.c
@@ -0,0 +1,238 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#include <linux/blk-filter.h>
+#include <linux/blk-mq.h>
+#include <linux/module.h>
+
+#include "blk.h"
+
+static LIST_HEAD(blkfilters);
+static DEFINE_SPINLOCK(blkfilters_lock);
+
+static inline struct blkfilter_operations *__blkfilter_find(const char *name)
+{
+ struct blkfilter_operations *ops;
+
+ list_for_each_entry(ops, &blkfilters, link)
+ if (strncmp(ops->name, name, BLKFILTER_NAME_LENGTH) == 0)
+ return ops;
+
+ return NULL;
+}
+
+static inline struct blkfilter_operations *blkfilter_find_get(const char *name)
+{
+ struct blkfilter_operations *ops;
+
+ spin_lock(&blkfilters_lock);
+ ops = __blkfilter_find(name);
+ if (ops && !try_module_get(ops->owner))
+ ops = NULL;
+ spin_unlock(&blkfilters_lock);
+
+ return ops;
+}
+
+static inline void blkfilter_put(const struct blkfilter_operations *ops)
+{
+ module_put(ops->owner);
+}
+
+int blkfilter_ioctl_attach(struct block_device *bdev,
+ struct blkfilter_name __user *argp)
+{
+ struct blkfilter_name name;
+ struct blkfilter_operations *ops;
+ struct blkfilter *flt;
+ int ret;
+
+ if (copy_from_user(&name, argp, sizeof(name)))
+ return -EFAULT;
+
+ ops = blkfilter_find_get(name.name);
+ if (!ops)
+ return -ENOENT;
+
+ mutex_lock(&bdev->bd_disk->open_mutex);
+ if (!disk_live(bdev->bd_disk)) {
+ ret = -ENODEV;
+ goto out_mutex_unlock;
+ }
+ ret = freeze_bdev(bdev);
+ if (ret)
+ goto out_mutex_unlock;
+ blk_mq_freeze_queue(bdev->bd_queue);
+
+ if (bdev->bd_filter) {
+ if (bdev->bd_filter->ops == ops)
+ ret = -EALREADY;
+ else
+ ret = -EBUSY;
+ goto out_unfreeze;
+ }
+
+ flt = ops->attach(bdev);
+ if (IS_ERR(flt)) {
+ ret = PTR_ERR(flt);
+ goto out_unfreeze;
+ }
+
+ flt->ops = ops;
+ bdev->bd_filter = flt;
+
+out_unfreeze:
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+ thaw_bdev(bdev);
+out_mutex_unlock:
+ mutex_unlock(&bdev->bd_disk->open_mutex);
+ if (ret)
+ blkfilter_put(ops);
+ return ret;
+}
+
+static void __blkfilter_detach(struct block_device *bdev)
+{
+ struct blkfilter *flt = bdev->bd_filter;
+ const struct blkfilter_operations *ops = flt->ops;
+
+ bdev->bd_filter = NULL;
+ ops->detach(flt);
+ blkfilter_put(ops);
+}
+
+void blkfilter_detach(struct block_device *bdev)
+{
+ if (bdev->bd_filter) {
+ blk_mq_freeze_queue(bdev->bd_queue);
+ __blkfilter_detach(bdev);
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+ }
+}
+
+int blkfilter_ioctl_detach(struct block_device *bdev,
+ struct blkfilter_name __user *argp)
+{
+ struct blkfilter_name name;
+ int ret = 0;
+
+ if (copy_from_user(&name, argp, sizeof(name)))
+ return -EFAULT;
+
+ mutex_lock(&bdev->bd_disk->open_mutex);
+ if (!disk_live(bdev->bd_disk)) {
+ ret = -ENODEV;
+ goto out_mutex_unlock;
+ }
+ blk_mq_freeze_queue(bdev->bd_queue);
+ if (!bdev->bd_filter) {
+ ret = -ENOENT;
+ goto out_unfreeze;
+ }
+ if (strncmp(bdev->bd_filter->ops->name, name.name,
+ BLKFILTER_NAME_LENGTH)) {
+ ret = -EINVAL;
+ goto out_unfreeze;
+ }
+
+ __blkfilter_detach(bdev);
+out_unfreeze:
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+out_mutex_unlock:
+ mutex_unlock(&bdev->bd_disk->open_mutex);
+ return ret;
+}
+
+int blkfilter_ioctl_ctl(struct block_device *bdev,
+ struct blkfilter_ctl __user *argp)
+{
+ struct blkfilter_ctl ctl;
+ struct blkfilter *flt;
+ int ret;
+
+ if (copy_from_user(&ctl, argp, sizeof(ctl)))
+ return -EFAULT;
+
+ mutex_lock(&bdev->bd_disk->open_mutex);
+ if (!disk_live(bdev->bd_disk)) {
+ ret = -ENODEV;
+ goto out_mutex_unlock;
+ }
+ ret = blk_queue_enter(bdev_get_queue(bdev), 0);
+ if (ret)
+ goto out_mutex_unlock;
+
+ flt = bdev->bd_filter;
+ if (!flt || strncmp(flt->ops->name, ctl.name, BLKFILTER_NAME_LENGTH)) {
+ ret = -ENOENT;
+ goto out_queue_exit;
+ }
+
+ if (!flt->ops->ctl) {
+ ret = -ENOTTY;
+ goto out_queue_exit;
+ }
+
+ ret = flt->ops->ctl(flt, ctl.cmd, u64_to_user_ptr(ctl.opt),
+ &ctl.optlen);
+out_queue_exit:
+ blk_queue_exit(bdev_get_queue(bdev));
+out_mutex_unlock:
+ mutex_unlock(&bdev->bd_disk->open_mutex);
+ return ret;
+}
+
+ssize_t blkfilter_show(struct block_device *bdev, char *buf)
+{
+ ssize_t ret = 0;
+
+ blk_mq_freeze_queue(bdev->bd_queue);
+ if (bdev->bd_filter)
+ ret = sprintf(buf, "%s\n", bdev->bd_filter->ops->name);
+ else
+ ret = sprintf(buf, "\n");
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+
+ return ret;
+}
+
+/**
+ * blkfilter_register() - Register block device filter operations
+ * @ops: The operations to register.
+ *
+ * Return:
+ * 0 if succeeded,
+ * -EBUSY if a block device filter with the same name is already
+ * registered.
+ */
+int blkfilter_register(struct blkfilter_operations *ops)
+{
+ struct blkfilter_operations *found;
+ int ret = 0;
+
+ spin_lock(&blkfilters_lock);
+ found = __blkfilter_find(ops->name);
+ if (found)
+ ret = -EBUSY;
+ else
+ list_add_tail(&ops->link, &blkfilters);
+ spin_unlock(&blkfilters_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(blkfilter_register);
+
+/**
+ * blkfilter_unregister() - Unregister block device filter operations
+ * @ops: The operations to unregister.
+ *
+ * Important: before unloading, it is necessary to detach the filter from all
+ * block devices.
+ *
+ */
+void blkfilter_unregister(struct blkfilter_operations *ops)
+{
+ spin_lock(&blkfilters_lock);
+ list_del(&ops->link);
+ spin_unlock(&blkfilters_lock);
+}
+EXPORT_SYMBOL_GPL(blkfilter_unregister);
diff --git a/block/blk.h b/block/blk.h
index 08a358bc0919..1f104f4865c3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -7,6 +7,8 @@
#include <xen/xen.h>
#include "blk-crypto-internal.h"
+struct blkfilter_ctl;
+struct blkfilter_name;
struct elevator_type;
/* Max future timer expiry for timeouts */
@@ -474,6 +476,15 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
extern const struct address_space_operations def_blk_aops;
+int blkfilter_ioctl_attach(struct block_device *bdev,
+ struct blkfilter_name __user *argp);
+int blkfilter_ioctl_detach(struct block_device *bdev,
+ struct blkfilter_name __user *argp);
+int blkfilter_ioctl_ctl(struct block_device *bdev,
+ struct blkfilter_ctl __user *argp);
+void blkfilter_detach(struct block_device *bdev);
+ssize_t blkfilter_show(struct block_device *bdev, char *buf);
+
int disk_register_independent_access_ranges(struct gendisk *disk);
void disk_unregister_independent_access_ranges(struct gendisk *disk);
diff --git a/block/genhd.c b/block/genhd.c
index c9d06f72c587..ba744e3fd581 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -26,6 +26,7 @@
#include <linux/badblocks.h>
#include <linux/part_stat.h>
#include <linux/blktrace_api.h>
+#include <linux/blk-filter.h>
#include "blk-throttle.h"
#include "blk.h"
@@ -654,6 +655,7 @@ void del_gendisk(struct gendisk *disk)
mutex_lock(&disk->open_mutex);
xa_for_each(&disk->part_tbl, idx, part)
remove_inode_hash(part->bd_inode);
+ blkfilter_detach(disk->part0);
mutex_unlock(&disk->open_mutex);
/*
@@ -1044,6 +1046,12 @@ static ssize_t diskseq_show(struct device *dev,
return sprintf(buf, "%llu\n", disk->diskseq);
}
+static ssize_t disk_filter_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return blkfilter_show(dev_to_bdev(dev), buf);
+}
+
static DEVICE_ATTR(range, 0444, disk_range_show, NULL);
static DEVICE_ATTR(ext_range, 0444, disk_ext_range_show, NULL);
static DEVICE_ATTR(removable, 0444, disk_removable_show, NULL);
@@ -1057,6 +1065,7 @@ static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
static DEVICE_ATTR(badblocks, 0644, disk_badblocks_show, disk_badblocks_store);
static DEVICE_ATTR(diskseq, 0444, diskseq_show, NULL);
+static DEVICE_ATTR(filter, 0444, disk_filter_show, NULL);
#ifdef CONFIG_FAIL_MAKE_REQUEST
ssize_t part_fail_show(struct device *dev,
@@ -1103,6 +1112,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_events_async.attr,
&dev_attr_events_poll_msecs.attr,
&dev_attr_diskseq.attr,
+ &dev_attr_filter.attr,
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
diff --git a/block/ioctl.c b/block/ioctl.c
index 4160f4e6bd5b..1b11303e213b 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -2,6 +2,7 @@
#include <linux/capability.h>
#include <linux/compat.h>
#include <linux/blkdev.h>
+#include <linux/blk-filter.h>
#include <linux/export.h>
#include <linux/gfp.h>
#include <linux/blkpg.h>
@@ -572,6 +573,12 @@ static int blkdev_common_ioctl(struct block_device *bdev, blk_mode_t mode,
return blkdev_pr_preempt(bdev, mode, argp, true);
case IOC_PR_CLEAR:
return blkdev_pr_clear(bdev, mode, argp);
+ case BLKFILTER_ATTACH:
+ return blkfilter_ioctl_attach(bdev, argp);
+ case BLKFILTER_DETACH:
+ return blkfilter_ioctl_detach(bdev, argp);
+ case BLKFILTER_CTL:
+ return blkfilter_ioctl_ctl(bdev, argp);
default:
return -ENOIOCTLCMD;
}
diff --git a/block/partitions/core.c b/block/partitions/core.c
index f47ffcfdfcec..19c69dc23d2c 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -10,6 +10,7 @@
#include <linux/ctype.h>
#include <linux/vmalloc.h>
#include <linux/raid/detect.h>
+#include <linux/blk-filter.h>
#include "check.h"
static int (*const check_part[])(struct parsed_partitions *) = {
@@ -200,6 +201,12 @@ static ssize_t part_discard_alignment_show(struct device *dev,
return sprintf(buf, "%u\n", bdev_discard_alignment(dev_to_bdev(dev)));
}
+static ssize_t part_filter_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return blkfilter_show(dev_to_bdev(dev), buf);
+}
+
static DEVICE_ATTR(partition, 0444, part_partition_show, NULL);
static DEVICE_ATTR(start, 0444, part_start_show, NULL);
static DEVICE_ATTR(size, 0444, part_size_show, NULL);
@@ -208,6 +215,7 @@ static DEVICE_ATTR(alignment_offset, 0444, part_alignment_offset_show, NULL);
static DEVICE_ATTR(discard_alignment, 0444, part_discard_alignment_show, NULL);
static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
+static DEVICE_ATTR(filter, 0444, part_filter_show, NULL);
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, 0644, part_fail_show, part_fail_store);
@@ -222,6 +230,7 @@ static struct attribute *part_attrs[] = {
&dev_attr_discard_alignment.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+ &dev_attr_filter.attr,
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
diff --git a/include/linux/blk-filter.h b/include/linux/blk-filter.h
new file mode 100644
index 000000000000..0afdb40f3bab
--- /dev/null
+++ b/include/linux/blk-filter.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef _LINUX_BLK_FILTER_H
+#define _LINUX_BLK_FILTER_H
+
+#include <uapi/linux/blk-filter.h>
+
+struct bio;
+struct block_device;
+struct blkfilter_operations;
+
+/**
+ * struct blkfilter - Block device filter.
+ *
+ * @ops: Block device filter operations.
+ *
+ * For each filtered block device, the filter creates a data structure
+ * associated with this device. The data in this structure is specific to the
+ * filter, but it must contain a pointer to the block device filter account.
+ */
+struct blkfilter {
+ const struct blkfilter_operations *ops;
+};
+
+/**
+ * struct blkfilter_operations - Block device filter operations.
+ *
+ * @link: Entry in the global list of filter drivers
+ * (must not be accessed by the driver).
+ * @owner: Module implementing the filter driver.
+ * @name: Name of the filter driver.
+ * @attach: Attach the filter driver to the block device.
+ * @detach: Detach the filter driver from the block device.
+ * @ctl: Send a control command to the filter driver.
+ * @submit_bio: Handle bio submissions to the filter driver.
+ */
+struct blkfilter_operations {
+ struct list_head link;
+ struct module *owner;
+ const char *name;
+ struct blkfilter *(*attach)(struct block_device *bdev);
+ void (*detach)(struct blkfilter *flt);
+ int (*ctl)(struct blkfilter *flt, const unsigned int cmd,
+ __u8 __user *buf, __u32 *plen);
+ bool (*submit_bio)(struct bio *bio);
+};
+
+int blkfilter_register(struct blkfilter_operations *ops);
+void blkfilter_unregister(struct blkfilter_operations *ops);
+
+#endif /* _UAPI_LINUX_BLK_FILTER_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d5c5e59ddbd2..490865292fde 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -74,6 +74,7 @@ struct block_device {
* path
*/
struct device bd_device;
+ struct blkfilter *bd_filter;
} __randomize_layout;
#define bdev_whole(_bdev) \
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 51fa7ffdee83..6a0754007d1d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -834,6 +834,7 @@ void blk_request_module(dev_t devt);
extern int blk_register_queue(struct gendisk *disk);
extern void blk_unregister_queue(struct gendisk *disk);
+void submit_bio_noacct_nocheck(struct bio *bio);
void submit_bio_noacct(struct bio *bio);
struct bio *bio_split_to_limits(struct bio *bio);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 292c31697248..e7c3cd490a80 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1190,6 +1190,7 @@ struct task_struct {
/* Stack plugging: */
struct blk_plug *plug;
+ struct blkfilter *blk_filter;
/* VM state: */
struct reclaim_state *reclaim_state;
diff --git a/include/uapi/linux/blk-filter.h b/include/uapi/linux/blk-filter.h
new file mode 100644
index 000000000000..18885dc1b717
--- /dev/null
+++ b/include/uapi/linux/blk-filter.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef _UAPI_LINUX_BLK_FILTER_H
+#define _UAPI_LINUX_BLK_FILTER_H
+
+#include <linux/types.h>
+
+#define BLKFILTER_NAME_LENGTH 32
+
+/**
+ * struct blkfilter_name - parameter for BLKFILTER_ATTACH and BLKFILTER_DETACH
+ * ioctl.
+ *
+ * @name: Name of block device filter.
+ */
+struct blkfilter_name {
+ __u8 name[BLKFILTER_NAME_LENGTH];
+};
+
+/**
+ * struct blkfilter_ctl - parameter for BLKFILTER_CTL ioctl
+ *
+ * @name: Name of block device filter.
+ * @cmd: The filter-specific operation code of the command.
+ * @optlen: Size of data at @opt.
+ * @opt: Userspace buffer with options.
+ */
+struct blkfilter_ctl {
+ __u8 name[BLKFILTER_NAME_LENGTH];
+ __u32 cmd;
+ __u32 optlen;
+ __u64 opt;
+};
+
+#endif /* _UAPI_LINUX_BLK_FILTER_H */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index da43810b7485..f96809cd2f50 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -189,6 +189,9 @@ struct fsxattr {
* A jump here: 130-136 are reserved for zoned block devices
* (see uapi/linux/blkzoned.h)
*/
+#define BLKFILTER_ATTACH _IOWR(0x12, 140, struct blkfilter_name)
+#define BLKFILTER_DETACH _IOWR(0x12, 141, struct blkfilter_name)
+#define BLKFILTER_CTL _IOWR(0x12, 142, struct blkfilter_ctl)
#define BMAP_IOCTL 1 /* obsolete - kept for compatibility */
#define FIBMAP _IO(0x00,1) /* bmap access */
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH v6 02/11] block: Block Device Filtering Mechanism
2023-11-24 16:59 ` [PATCH v6 02/11] block: Block Device Filtering Mechanism Sergei Shtepa
@ 2023-12-07 7:44 ` Christoph Hellwig
2023-12-07 11:22 ` Sergei Shtepa
0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2023-12-07 7:44 UTC (permalink / raw)
To: Sergei Shtepa
Cc: axboe, hch, corbet, snitzer, mingo, peterz, juri.lelli, viro,
brauner, linux-block, linux-doc, linux-kernel, linux-fsdevel,
Sergei Shtepa, Donald Buczek, Fabio Fantoni
> + struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> + bool skip_bio = false;
> +
> + if (unlikely(bio_queue_enter(bio)))
> + return;
> +
> + if (bio->bi_bdev->bd_filter &&
> + bio->bi_bdev->bd_filter != current->blk_filter) {
> + struct blkfilter *prev = current->blk_filter;
> +
> + current->blk_filter = bio->bi_bdev->bd_filter;
> + skip_bio = bio->bi_bdev->bd_filter->ops->submit_bio(bio);
> + current->blk_filter = prev;
> + }
> +
> + blk_queue_exit(q);
This currently adds a queue enter/exit pair even if no filter driver
is used, which іs probably not acceptable. We probably need some
way to avoid the check in the fast path. In general an unlocked check
for bio->bi_bdev->bd_filter outside the protection seems fine to here,
we just need to find a good way to make sure it is visible by the
time a filter is actually set and the filter driver initialization.
> if (!bio->bi_bdev->bd_has_submit_bio) {
> blk_mq_submit_bio(bio);
> - } else if (likely(bio_queue_enter(bio) == 0)) {
> + return;
> + }
> +
> + if (likely(bio_queue_enter(bio) == 0)) {
This is just stray reformatting and we can drop it.
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH v6 02/11] block: Block Device Filtering Mechanism
2023-12-07 7:44 ` Christoph Hellwig
@ 2023-12-07 11:22 ` Sergei Shtepa
0 siblings, 0 replies; 9+ messages in thread
From: Sergei Shtepa @ 2023-12-07 11:22 UTC (permalink / raw)
To: Christoph Hellwig
Cc: axboe, corbet, snitzer, mingo, peterz, juri.lelli, viro, brauner,
linux-block, linux-doc, linux-kernel, linux-fsdevel,
Sergei Shtepa, Donald Buczek, Fabio Fantoni
Hi Christoph.
Thanks for the review this patch set.
You've given me a good deal to think about.
^ permalink raw reply [flat|nested] 9+ messages in thread