* [PATCH v6 01/11] documentation: Block Device Filtering Mechanism
2023-11-24 16:04 [PATCH v6 00/11] blksnap - block devices snapshots module Sergei Shtepa
@ 2023-11-24 16:04 ` Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 02/11] block: " Sergei Shtepa
` (2 subsequent siblings)
3 siblings, 0 replies; 9+ messages in thread
From: Sergei Shtepa @ 2023-11-24 16:04 UTC (permalink / raw)
To: axboe, hch, corbet, snitzer
Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, bristot, vschneid, viro, brauner,
gregkh, arnd, christian.koenig, yi.l.liu, jirislaby, stfrench,
jpanis, jgg, contact, dchinner, jack, linux, min15.li, dlemoal,
linux-block, linux-doc, linux-kernel, linux-fsdevel,
Sergei Shtepa
From: Sergei Shtepa <sergei.shtepa@veeam.com>
The document contains:
* Describes the purpose of the mechanism
* A little historical background on the capabilities of handling I/O
units of the Linux kernel
* Brief description of the design
* Reference to interface description
Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
Documentation/block/blkfilter.rst | 66 +++++++++++++++++++++++++++++++
Documentation/block/index.rst | 1 +
MAINTAINERS | 6 +++
3 files changed, 73 insertions(+)
create mode 100644 Documentation/block/blkfilter.rst
diff --git a/Documentation/block/blkfilter.rst b/Documentation/block/blkfilter.rst
new file mode 100644
index 000000000000..4e148e78f3d4
--- /dev/null
+++ b/Documentation/block/blkfilter.rst
@@ -0,0 +1,66 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+Block Device Filtering Mechanism
+================================
+
+The block device filtering mechanism provides the ability to attach block
+device filters. Block device filters allow performing additional processing
+for I/O units.
+
+Introduction
+============
+
+The idea of handling I/O units on block devices is not new. Back in the
+2.6 kernel, there was an undocumented possibility of handling I/O units
+by substituting the make_request_fn() function, which belonged to the
+request_queue structure. But none of the in-tree kernel modules used this
+feature, and it was eliminated in the 5.10 kernel.
+
+The block device filtering mechanism returns the ability to handle I/O units.
+It is possible to safely attach a filter to a block device "on the fly" without
+changing the structure of the block device's stack.
+
+It supports attaching one filter to one block device, because there is only
+one filter implementation in the kernel yet.
+See Documentation/block/blksnap.rst.
+
+Design
+======
+
+The block device filtering mechanism provides registration and unregistration
+for filter operations. The struct blkfilter_operations contains a pointer to
+the callback functions for the filter. After registering the filter operations,
+the filter can be managed using block device ioctls BLKFILTER_ATTACH,
+BLKFILTER_DETACH and BLKFILTER_CTL.
+
+When the filter is attached, the callback function is called for each I/O unit
+for a block device, providing I/O unit filtering. Depending on the result of
+filtering the I/O unit, it can either be passed for subsequent processing by
+the block layer, or skipped.
+
+The filter can be implemented as a loadable module. In this case, the filter
+module cannot be unloaded while the filter is attached to at least one of the
+block devices.
+
+Interface description
+=====================
+
+The ioctl BLKFILTER_ATTACH allows user-space programs to attach a block device
+filter to a block device. The ioctl BLKFILTER_DETACH allows user-space programs
+to detach it. Both ioctls use &struct blkfilter_name. The ioctl BLKFILTER_CTL
+allows user-space programs to send a filter-specific command. It use &struct
+blkfilter_ctl.
+
+.. kernel-doc:: include/uapi/linux/blk-filter.h
+
+To register in the system, the filter uses the &struct blkfilter_operations,
+which contains callback functions, unique filter name and module owner. When
+attaching a filter to a block device, the filter creates a &struct blkfilter.
+The pointer to the &struct blkfilter allows the filter to determine for which
+block device the callback functions are being called.
+
+.. kernel-doc:: include/linux/blk-filter.h
+
+.. kernel-doc:: block/blk-filter.c
+ :export:
diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst
index 9fea696f9daa..e9712f72cd6d 100644
--- a/Documentation/block/index.rst
+++ b/Documentation/block/index.rst
@@ -10,6 +10,7 @@ Block
bfq-iosched
biovecs
blk-mq
+ blkfilter
cmdline-partition
data-integrity
deadline-iosched
diff --git a/MAINTAINERS b/MAINTAINERS
index 97f51d5ec1cf..c20cbec81b58 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3584,6 +3584,12 @@ M: Jan-Simon Moeller <jansimon.moeller@gmx.de>
S: Maintained
F: drivers/leds/leds-blinkm.c
+BLOCK DEVICE FILTERING MECHANISM
+M: Sergei Shtepa <sergei.shtepa@veeam.com>
+L: linux-block@vger.kernel.org
+S: Supported
+F: Documentation/block/blkfilter.rst
+
BLOCK LAYER
M: Jens Axboe <axboe@kernel.dk>
L: linux-block@vger.kernel.org
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v6 02/11] block: Block Device Filtering Mechanism
2023-11-24 16:04 [PATCH v6 00/11] blksnap - block devices snapshots module Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 01/11] documentation: Block Device Filtering Mechanism Sergei Shtepa
@ 2023-11-24 16:04 ` Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 03/11] documentation: Block Devices Snapshots Module Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 04/11] blksnap: header file of the module interface Sergei Shtepa
3 siblings, 0 replies; 9+ messages in thread
From: Sergei Shtepa @ 2023-11-24 16:04 UTC (permalink / raw)
To: axboe, hch, corbet, snitzer
Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, bristot, vschneid, viro, brauner,
gregkh, arnd, christian.koenig, yi.l.liu, jirislaby, stfrench,
jpanis, jgg, contact, dchinner, jack, linux, min15.li, dlemoal,
linux-block, linux-doc, linux-kernel, linux-fsdevel,
Sergei Shtepa, Donald Buczek, Fabio Fantoni
From: Sergei Shtepa <sergei.shtepa@veeam.com>
The block device filtering mechanism is an API that allows to attach
block device filters. Block device filters allow perform additional
processing for I/O units.
The idea of handling I/O units on block devices is not new. Back in the
2.6 kernel, there was an undocumented possibility of handling I/O units
by substituting the make_request_fn() function, which belonged to the
request_queue structure. But none of the in-tree kernel modules used
this feature, and it was eliminated in the 5.10 kernel.
The block device filtering mechanism returns the ability to handle I/O
units. It is possible to safely attach filter to a block device "on the
fly" without changing the structure of block devices stack.
Co-developed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Donald Buczek <buczek@molgen.mpg.de>
Tested-by: Fabio Fantoni <fantonifabio@tiscali.it>
Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
MAINTAINERS | 3 +
block/Makefile | 3 +-
block/bdev.c | 2 +
block/blk-core.c | 35 ++++-
block/blk-filter.c | 238 ++++++++++++++++++++++++++++++++
block/blk.h | 11 ++
block/genhd.c | 10 ++
block/ioctl.c | 7 +
block/partitions/core.c | 9 ++
include/linux/blk-filter.h | 51 +++++++
include/linux/blk_types.h | 1 +
include/linux/blkdev.h | 1 +
include/linux/sched.h | 1 +
include/uapi/linux/blk-filter.h | 35 +++++
include/uapi/linux/fs.h | 3 +
15 files changed, 408 insertions(+), 2 deletions(-)
create mode 100644 block/blk-filter.c
create mode 100644 include/linux/blk-filter.h
create mode 100644 include/uapi/linux/blk-filter.h
diff --git a/MAINTAINERS b/MAINTAINERS
index c20cbec81b58..ef90cd0fec9c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3589,6 +3589,9 @@ M: Sergei Shtepa <sergei.shtepa@veeam.com>
L: linux-block@vger.kernel.org
S: Supported
F: Documentation/block/blkfilter.rst
+F: block/blk-filter.c
+F: include/linux/blk-filter.h
+F: include/uapi/linux/blk-filter.h
BLOCK LAYER
M: Jens Axboe <axboe@kernel.dk>
diff --git a/block/Makefile b/block/Makefile
index 46ada9dc8bbf..041c54eb0240 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -9,7 +9,8 @@ obj-y := bdev.o fops.o bio.o elevator.o blk-core.o blk-sysfs.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o \
- disk-events.o blk-ia-ranges.o early-lookup.o
+ disk-events.o blk-ia-ranges.o early-lookup.o \
+ blk-filter.o
obj-$(CONFIG_BOUNCE) += bounce.o
obj-$(CONFIG_BLK_DEV_BSG_COMMON) += bsg.o
diff --git a/block/bdev.c b/block/bdev.c
index e4cfb7adb645..6039d99b3a75 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -412,6 +412,7 @@ struct block_device *bdev_alloc(struct gendisk *disk, u8 partno)
return NULL;
}
bdev->bd_disk = disk;
+ bdev->bd_filter = NULL;
return bdev;
}
@@ -1018,6 +1019,7 @@ void bdev_mark_dead(struct block_device *bdev, bool surprise)
}
invalidate_bdev(bdev);
+ blkfilter_detach(bdev);
}
/*
* New drivers should not use this directly. There are some drivers however
diff --git a/block/blk-core.c b/block/blk-core.c
index fdf25b8d6e78..1de74240892a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -18,6 +18,7 @@
#include <linux/blkdev.h>
#include <linux/blk-pm.h>
#include <linux/blk-integrity.h>
+#include <linux/blk-filter.h>
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
@@ -592,12 +593,34 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
static void __submit_bio(struct bio *bio)
{
+ struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+ bool skip_bio = false;
+
+ if (unlikely(bio_queue_enter(bio)))
+ return;
+
+ if (bio->bi_bdev->bd_filter &&
+ bio->bi_bdev->bd_filter != current->blk_filter) {
+ struct blkfilter *prev = current->blk_filter;
+
+ current->blk_filter = bio->bi_bdev->bd_filter;
+ skip_bio = bio->bi_bdev->bd_filter->ops->submit_bio(bio);
+ current->blk_filter = prev;
+ }
+
+ blk_queue_exit(q);
+ if (skip_bio)
+ return;
+
if (unlikely(!blk_crypto_bio_prep(&bio)))
return;
if (!bio->bi_bdev->bd_has_submit_bio) {
blk_mq_submit_bio(bio);
- } else if (likely(bio_queue_enter(bio) == 0)) {
+ return;
+ }
+
+ if (likely(bio_queue_enter(bio) == 0)) {
struct gendisk *disk = bio->bi_bdev->bd_disk;
disk->fops->submit_bio(bio);
@@ -681,6 +704,15 @@ static void __submit_bio_noacct_mq(struct bio *bio)
current->bio_list = NULL;
}
+/**
+ * submit_bio_noacct_nocheck - re-submit a bio to the block device layer for I/O
+ * from block device filter.
+ * @bio: The bio describing the location in memory and on the device.
+ *
+ * This is a version of submit_bio() that shall only be used for I/O that is
+ * resubmitted to lower level by block device filters. All file systems and
+ * other upper level users of the block layer should use submit_bio() instead.
+ */
void submit_bio_noacct_nocheck(struct bio *bio)
{
blk_cgroup_bio_start(bio);
@@ -708,6 +740,7 @@ void submit_bio_noacct_nocheck(struct bio *bio)
else
__submit_bio_noacct(bio);
}
+EXPORT_SYMBOL_GPL(submit_bio_noacct_nocheck);
/**
* submit_bio_noacct - re-submit a bio to the block device layer for I/O
diff --git a/block/blk-filter.c b/block/blk-filter.c
new file mode 100644
index 000000000000..8e2550bed0c5
--- /dev/null
+++ b/block/blk-filter.c
@@ -0,0 +1,238 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#include <linux/blk-filter.h>
+#include <linux/blk-mq.h>
+#include <linux/module.h>
+
+#include "blk.h"
+
+static LIST_HEAD(blkfilters);
+static DEFINE_SPINLOCK(blkfilters_lock);
+
+static inline struct blkfilter_operations *__blkfilter_find(const char *name)
+{
+ struct blkfilter_operations *ops;
+
+ list_for_each_entry(ops, &blkfilters, link)
+ if (strncmp(ops->name, name, BLKFILTER_NAME_LENGTH) == 0)
+ return ops;
+
+ return NULL;
+}
+
+static inline struct blkfilter_operations *blkfilter_find_get(const char *name)
+{
+ struct blkfilter_operations *ops;
+
+ spin_lock(&blkfilters_lock);
+ ops = __blkfilter_find(name);
+ if (ops && !try_module_get(ops->owner))
+ ops = NULL;
+ spin_unlock(&blkfilters_lock);
+
+ return ops;
+}
+
+static inline void blkfilter_put(const struct blkfilter_operations *ops)
+{
+ module_put(ops->owner);
+}
+
+int blkfilter_ioctl_attach(struct block_device *bdev,
+ struct blkfilter_name __user *argp)
+{
+ struct blkfilter_name name;
+ struct blkfilter_operations *ops;
+ struct blkfilter *flt;
+ int ret;
+
+ if (copy_from_user(&name, argp, sizeof(name)))
+ return -EFAULT;
+
+ ops = blkfilter_find_get(name.name);
+ if (!ops)
+ return -ENOENT;
+
+ mutex_lock(&bdev->bd_disk->open_mutex);
+ if (!disk_live(bdev->bd_disk)) {
+ ret = -ENODEV;
+ goto out_mutex_unlock;
+ }
+ ret = freeze_bdev(bdev);
+ if (ret)
+ goto out_mutex_unlock;
+ blk_mq_freeze_queue(bdev->bd_queue);
+
+ if (bdev->bd_filter) {
+ if (bdev->bd_filter->ops == ops)
+ ret = -EALREADY;
+ else
+ ret = -EBUSY;
+ goto out_unfreeze;
+ }
+
+ flt = ops->attach(bdev);
+ if (IS_ERR(flt)) {
+ ret = PTR_ERR(flt);
+ goto out_unfreeze;
+ }
+
+ flt->ops = ops;
+ bdev->bd_filter = flt;
+
+out_unfreeze:
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+ thaw_bdev(bdev);
+out_mutex_unlock:
+ mutex_unlock(&bdev->bd_disk->open_mutex);
+ if (ret)
+ blkfilter_put(ops);
+ return ret;
+}
+
+static void __blkfilter_detach(struct block_device *bdev)
+{
+ struct blkfilter *flt = bdev->bd_filter;
+ const struct blkfilter_operations *ops = flt->ops;
+
+ bdev->bd_filter = NULL;
+ ops->detach(flt);
+ blkfilter_put(ops);
+}
+
+void blkfilter_detach(struct block_device *bdev)
+{
+ if (bdev->bd_filter) {
+ blk_mq_freeze_queue(bdev->bd_queue);
+ __blkfilter_detach(bdev);
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+ }
+}
+
+int blkfilter_ioctl_detach(struct block_device *bdev,
+ struct blkfilter_name __user *argp)
+{
+ struct blkfilter_name name;
+ int ret = 0;
+
+ if (copy_from_user(&name, argp, sizeof(name)))
+ return -EFAULT;
+
+ mutex_lock(&bdev->bd_disk->open_mutex);
+ if (!disk_live(bdev->bd_disk)) {
+ ret = -ENODEV;
+ goto out_mutex_unlock;
+ }
+ blk_mq_freeze_queue(bdev->bd_queue);
+ if (!bdev->bd_filter) {
+ ret = -ENOENT;
+ goto out_unfreeze;
+ }
+ if (strncmp(bdev->bd_filter->ops->name, name.name,
+ BLKFILTER_NAME_LENGTH)) {
+ ret = -EINVAL;
+ goto out_unfreeze;
+ }
+
+ __blkfilter_detach(bdev);
+out_unfreeze:
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+out_mutex_unlock:
+ mutex_unlock(&bdev->bd_disk->open_mutex);
+ return ret;
+}
+
+int blkfilter_ioctl_ctl(struct block_device *bdev,
+ struct blkfilter_ctl __user *argp)
+{
+ struct blkfilter_ctl ctl;
+ struct blkfilter *flt;
+ int ret;
+
+ if (copy_from_user(&ctl, argp, sizeof(ctl)))
+ return -EFAULT;
+
+ mutex_lock(&bdev->bd_disk->open_mutex);
+ if (!disk_live(bdev->bd_disk)) {
+ ret = -ENODEV;
+ goto out_mutex_unlock;
+ }
+ ret = blk_queue_enter(bdev_get_queue(bdev), 0);
+ if (ret)
+ goto out_mutex_unlock;
+
+ flt = bdev->bd_filter;
+ if (!flt || strncmp(flt->ops->name, ctl.name, BLKFILTER_NAME_LENGTH)) {
+ ret = -ENOENT;
+ goto out_queue_exit;
+ }
+
+ if (!flt->ops->ctl) {
+ ret = -ENOTTY;
+ goto out_queue_exit;
+ }
+
+ ret = flt->ops->ctl(flt, ctl.cmd, u64_to_user_ptr(ctl.opt),
+ &ctl.optlen);
+out_queue_exit:
+ blk_queue_exit(bdev_get_queue(bdev));
+out_mutex_unlock:
+ mutex_unlock(&bdev->bd_disk->open_mutex);
+ return ret;
+}
+
+ssize_t blkfilter_show(struct block_device *bdev, char *buf)
+{
+ ssize_t ret = 0;
+
+ blk_mq_freeze_queue(bdev->bd_queue);
+ if (bdev->bd_filter)
+ ret = sprintf(buf, "%s\n", bdev->bd_filter->ops->name);
+ else
+ ret = sprintf(buf, "\n");
+ blk_mq_unfreeze_queue(bdev->bd_queue);
+
+ return ret;
+}
+
+/**
+ * blkfilter_register() - Register block device filter operations
+ * @ops: The operations to register.
+ *
+ * Return:
+ * 0 if succeeded,
+ * -EBUSY if a block device filter with the same name is already
+ * registered.
+ */
+int blkfilter_register(struct blkfilter_operations *ops)
+{
+ struct blkfilter_operations *found;
+ int ret = 0;
+
+ spin_lock(&blkfilters_lock);
+ found = __blkfilter_find(ops->name);
+ if (found)
+ ret = -EBUSY;
+ else
+ list_add_tail(&ops->link, &blkfilters);
+ spin_unlock(&blkfilters_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(blkfilter_register);
+
+/**
+ * blkfilter_unregister() - Unregister block device filter operations
+ * @ops: The operations to unregister.
+ *
+ * Important: before unloading, it is necessary to detach the filter from all
+ * block devices.
+ *
+ */
+void blkfilter_unregister(struct blkfilter_operations *ops)
+{
+ spin_lock(&blkfilters_lock);
+ list_del(&ops->link);
+ spin_unlock(&blkfilters_lock);
+}
+EXPORT_SYMBOL_GPL(blkfilter_unregister);
diff --git a/block/blk.h b/block/blk.h
index 08a358bc0919..1f104f4865c3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -7,6 +7,8 @@
#include <xen/xen.h>
#include "blk-crypto-internal.h"
+struct blkfilter_ctl;
+struct blkfilter_name;
struct elevator_type;
/* Max future timer expiry for timeouts */
@@ -474,6 +476,15 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg);
extern const struct address_space_operations def_blk_aops;
+int blkfilter_ioctl_attach(struct block_device *bdev,
+ struct blkfilter_name __user *argp);
+int blkfilter_ioctl_detach(struct block_device *bdev,
+ struct blkfilter_name __user *argp);
+int blkfilter_ioctl_ctl(struct block_device *bdev,
+ struct blkfilter_ctl __user *argp);
+void blkfilter_detach(struct block_device *bdev);
+ssize_t blkfilter_show(struct block_device *bdev, char *buf);
+
int disk_register_independent_access_ranges(struct gendisk *disk);
void disk_unregister_independent_access_ranges(struct gendisk *disk);
diff --git a/block/genhd.c b/block/genhd.c
index c9d06f72c587..ba744e3fd581 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -26,6 +26,7 @@
#include <linux/badblocks.h>
#include <linux/part_stat.h>
#include <linux/blktrace_api.h>
+#include <linux/blk-filter.h>
#include "blk-throttle.h"
#include "blk.h"
@@ -654,6 +655,7 @@ void del_gendisk(struct gendisk *disk)
mutex_lock(&disk->open_mutex);
xa_for_each(&disk->part_tbl, idx, part)
remove_inode_hash(part->bd_inode);
+ blkfilter_detach(disk->part0);
mutex_unlock(&disk->open_mutex);
/*
@@ -1044,6 +1046,12 @@ static ssize_t diskseq_show(struct device *dev,
return sprintf(buf, "%llu\n", disk->diskseq);
}
+static ssize_t disk_filter_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return blkfilter_show(dev_to_bdev(dev), buf);
+}
+
static DEVICE_ATTR(range, 0444, disk_range_show, NULL);
static DEVICE_ATTR(ext_range, 0444, disk_ext_range_show, NULL);
static DEVICE_ATTR(removable, 0444, disk_removable_show, NULL);
@@ -1057,6 +1065,7 @@ static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
static DEVICE_ATTR(badblocks, 0644, disk_badblocks_show, disk_badblocks_store);
static DEVICE_ATTR(diskseq, 0444, diskseq_show, NULL);
+static DEVICE_ATTR(filter, 0444, disk_filter_show, NULL);
#ifdef CONFIG_FAIL_MAKE_REQUEST
ssize_t part_fail_show(struct device *dev,
@@ -1103,6 +1112,7 @@ static struct attribute *disk_attrs[] = {
&dev_attr_events_async.attr,
&dev_attr_events_poll_msecs.attr,
&dev_attr_diskseq.attr,
+ &dev_attr_filter.attr,
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
diff --git a/block/ioctl.c b/block/ioctl.c
index 4160f4e6bd5b..1b11303e213b 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -2,6 +2,7 @@
#include <linux/capability.h>
#include <linux/compat.h>
#include <linux/blkdev.h>
+#include <linux/blk-filter.h>
#include <linux/export.h>
#include <linux/gfp.h>
#include <linux/blkpg.h>
@@ -572,6 +573,12 @@ static int blkdev_common_ioctl(struct block_device *bdev, blk_mode_t mode,
return blkdev_pr_preempt(bdev, mode, argp, true);
case IOC_PR_CLEAR:
return blkdev_pr_clear(bdev, mode, argp);
+ case BLKFILTER_ATTACH:
+ return blkfilter_ioctl_attach(bdev, argp);
+ case BLKFILTER_DETACH:
+ return blkfilter_ioctl_detach(bdev, argp);
+ case BLKFILTER_CTL:
+ return blkfilter_ioctl_ctl(bdev, argp);
default:
return -ENOIOCTLCMD;
}
diff --git a/block/partitions/core.c b/block/partitions/core.c
index f47ffcfdfcec..19c69dc23d2c 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -10,6 +10,7 @@
#include <linux/ctype.h>
#include <linux/vmalloc.h>
#include <linux/raid/detect.h>
+#include <linux/blk-filter.h>
#include "check.h"
static int (*const check_part[])(struct parsed_partitions *) = {
@@ -200,6 +201,12 @@ static ssize_t part_discard_alignment_show(struct device *dev,
return sprintf(buf, "%u\n", bdev_discard_alignment(dev_to_bdev(dev)));
}
+static ssize_t part_filter_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return blkfilter_show(dev_to_bdev(dev), buf);
+}
+
static DEVICE_ATTR(partition, 0444, part_partition_show, NULL);
static DEVICE_ATTR(start, 0444, part_start_show, NULL);
static DEVICE_ATTR(size, 0444, part_size_show, NULL);
@@ -208,6 +215,7 @@ static DEVICE_ATTR(alignment_offset, 0444, part_alignment_offset_show, NULL);
static DEVICE_ATTR(discard_alignment, 0444, part_discard_alignment_show, NULL);
static DEVICE_ATTR(stat, 0444, part_stat_show, NULL);
static DEVICE_ATTR(inflight, 0444, part_inflight_show, NULL);
+static DEVICE_ATTR(filter, 0444, part_filter_show, NULL);
#ifdef CONFIG_FAIL_MAKE_REQUEST
static struct device_attribute dev_attr_fail =
__ATTR(make-it-fail, 0644, part_fail_show, part_fail_store);
@@ -222,6 +230,7 @@ static struct attribute *part_attrs[] = {
&dev_attr_discard_alignment.attr,
&dev_attr_stat.attr,
&dev_attr_inflight.attr,
+ &dev_attr_filter.attr,
#ifdef CONFIG_FAIL_MAKE_REQUEST
&dev_attr_fail.attr,
#endif
diff --git a/include/linux/blk-filter.h b/include/linux/blk-filter.h
new file mode 100644
index 000000000000..0afdb40f3bab
--- /dev/null
+++ b/include/linux/blk-filter.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef _LINUX_BLK_FILTER_H
+#define _LINUX_BLK_FILTER_H
+
+#include <uapi/linux/blk-filter.h>
+
+struct bio;
+struct block_device;
+struct blkfilter_operations;
+
+/**
+ * struct blkfilter - Block device filter.
+ *
+ * @ops: Block device filter operations.
+ *
+ * For each filtered block device, the filter creates a data structure
+ * associated with this device. The data in this structure is specific to the
+ * filter, but it must contain a pointer to the block device filter account.
+ */
+struct blkfilter {
+ const struct blkfilter_operations *ops;
+};
+
+/**
+ * struct blkfilter_operations - Block device filter operations.
+ *
+ * @link: Entry in the global list of filter drivers
+ * (must not be accessed by the driver).
+ * @owner: Module implementing the filter driver.
+ * @name: Name of the filter driver.
+ * @attach: Attach the filter driver to the block device.
+ * @detach: Detach the filter driver from the block device.
+ * @ctl: Send a control command to the filter driver.
+ * @submit_bio: Handle bio submissions to the filter driver.
+ */
+struct blkfilter_operations {
+ struct list_head link;
+ struct module *owner;
+ const char *name;
+ struct blkfilter *(*attach)(struct block_device *bdev);
+ void (*detach)(struct blkfilter *flt);
+ int (*ctl)(struct blkfilter *flt, const unsigned int cmd,
+ __u8 __user *buf, __u32 *plen);
+ bool (*submit_bio)(struct bio *bio);
+};
+
+int blkfilter_register(struct blkfilter_operations *ops);
+void blkfilter_unregister(struct blkfilter_operations *ops);
+
+#endif /* _UAPI_LINUX_BLK_FILTER_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d5c5e59ddbd2..490865292fde 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -74,6 +74,7 @@ struct block_device {
* path
*/
struct device bd_device;
+ struct blkfilter *bd_filter;
} __randomize_layout;
#define bdev_whole(_bdev) \
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 51fa7ffdee83..6a0754007d1d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -834,6 +834,7 @@ void blk_request_module(dev_t devt);
extern int blk_register_queue(struct gendisk *disk);
extern void blk_unregister_queue(struct gendisk *disk);
+void submit_bio_noacct_nocheck(struct bio *bio);
void submit_bio_noacct(struct bio *bio);
struct bio *bio_split_to_limits(struct bio *bio);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 292c31697248..e7c3cd490a80 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1190,6 +1190,7 @@ struct task_struct {
/* Stack plugging: */
struct blk_plug *plug;
+ struct blkfilter *blk_filter;
/* VM state: */
struct reclaim_state *reclaim_state;
diff --git a/include/uapi/linux/blk-filter.h b/include/uapi/linux/blk-filter.h
new file mode 100644
index 000000000000..18885dc1b717
--- /dev/null
+++ b/include/uapi/linux/blk-filter.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef _UAPI_LINUX_BLK_FILTER_H
+#define _UAPI_LINUX_BLK_FILTER_H
+
+#include <linux/types.h>
+
+#define BLKFILTER_NAME_LENGTH 32
+
+/**
+ * struct blkfilter_name - parameter for BLKFILTER_ATTACH and BLKFILTER_DETACH
+ * ioctl.
+ *
+ * @name: Name of block device filter.
+ */
+struct blkfilter_name {
+ __u8 name[BLKFILTER_NAME_LENGTH];
+};
+
+/**
+ * struct blkfilter_ctl - parameter for BLKFILTER_CTL ioctl
+ *
+ * @name: Name of block device filter.
+ * @cmd: The filter-specific operation code of the command.
+ * @optlen: Size of data at @opt.
+ * @opt: Userspace buffer with options.
+ */
+struct blkfilter_ctl {
+ __u8 name[BLKFILTER_NAME_LENGTH];
+ __u32 cmd;
+ __u32 optlen;
+ __u64 opt;
+};
+
+#endif /* _UAPI_LINUX_BLK_FILTER_H */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index da43810b7485..f96809cd2f50 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -189,6 +189,9 @@ struct fsxattr {
* A jump here: 130-136 are reserved for zoned block devices
* (see uapi/linux/blkzoned.h)
*/
+#define BLKFILTER_ATTACH _IOWR(0x12, 140, struct blkfilter_name)
+#define BLKFILTER_DETACH _IOWR(0x12, 141, struct blkfilter_name)
+#define BLKFILTER_CTL _IOWR(0x12, 142, struct blkfilter_ctl)
#define BMAP_IOCTL 1 /* obsolete - kept for compatibility */
#define FIBMAP _IO(0x00,1) /* bmap access */
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v6 03/11] documentation: Block Devices Snapshots Module
2023-11-24 16:04 [PATCH v6 00/11] blksnap - block devices snapshots module Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 01/11] documentation: Block Device Filtering Mechanism Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 02/11] block: " Sergei Shtepa
@ 2023-11-24 16:04 ` Sergei Shtepa
2023-11-24 16:04 ` [PATCH v6 04/11] blksnap: header file of the module interface Sergei Shtepa
3 siblings, 0 replies; 9+ messages in thread
From: Sergei Shtepa @ 2023-11-24 16:04 UTC (permalink / raw)
To: axboe, hch, corbet, snitzer
Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, bristot, vschneid, viro, brauner,
gregkh, arnd, christian.koenig, yi.l.liu, jirislaby, stfrench,
jpanis, jgg, contact, dchinner, jack, linux, min15.li, dlemoal,
linux-block, linux-doc, linux-kernel, linux-fsdevel,
Sergei Shtepa, Bagas Sanjaya, Fabio Fantoni
From: Sergei Shtepa <sergei.shtepa@veeam.com>
The document contains:
* Describes the purpose of the mechanism
* Description of features
* Description of algorithms
* Recommendations about using the module from the user-space side
* Reference to module interface description
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Reviewed-by: Fabio Fantoni <fantonifabio@tiscali.it>
Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
Documentation/block/blksnap.rst | 352 ++++++++++++++++++++++++++++++++
Documentation/block/index.rst | 1 +
MAINTAINERS | 6 +
3 files changed, 359 insertions(+)
create mode 100644 Documentation/block/blksnap.rst
diff --git a/Documentation/block/blksnap.rst b/Documentation/block/blksnap.rst
new file mode 100644
index 000000000000..ef6010e46858
--- /dev/null
+++ b/Documentation/block/blksnap.rst
@@ -0,0 +1,352 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================================
+Block Devices Snapshots Module (blksnap)
+========================================
+
+Introduction
+============
+
+At first glance, there is no novelty in the idea of creating snapshots for
+block devices. The Linux kernel already has mechanisms for creating snapshots.
+Device Mapper includes dm-snap, which allows to create snapshots of block
+devices. BTRFS supports snapshots at the filesystem level. However, both of
+these options have flaws that do not allow to use them as a universal tool for
+creating backups.
+
+The main properties that a backup tool should have are:
+
+- Simplicity and universality of use
+- Reliability
+- Minimal consumption of system resources during backup
+- Minimal time required for recovery or replication of the entire system
+
+Taking above properties into account, blksnap module features:
+
+- Change tracker
+- Snapshots at the block device level
+- Dynamic allocation of space for storing differences
+- Snapshot overflow resistance
+- Coherent snapshot of multiple block devices
+
+Features
+========
+
+Change tracker
+--------------
+
+The change tracker allows to determine which blocks were changed during the
+time between the last snapshot created and any of the previous snapshots.
+With a map of changes, it is enough to copy only the changed blocks, and no
+need to reread the entire block device completely. The change tracker allows
+to implement the logic of both incremental and differential backups.
+Incremental backup is critical for large file repositories whose size can be
+hundreds of terabytes and whose full backup time can take more than a day.
+On such servers, the use of backup tools without a change tracker becomes
+practically impossible.
+
+Snapshot at the block device level
+----------------------------------
+
+A snapshot at the block device level allows to simplify the backup algorithm
+and reduce consumption of system resources. It also allows to perform linear
+reading of disk space directly, which allows to achieve maximum reading speed
+with minimal use of processor time. At the same time, the universality of
+creating snapshots for any block device is achieved, regardless of the file
+system located on it. The exceptions are BTRFS, ZFS and cluster file systems.
+
+Dynamic allocation of storage space for differences
+---------------------------------------------------
+
+To store differences, the module does not require a pre-reserved space on
+filesystem. The space for storing differences can be allocated in file in any
+filesystem. In addition, the size of the difference storage can be increased
+after the snapshot is created, but only for a filesystem that supports
+fallocate. A shared difference storage for all images of snapshot block devices
+allows to optimize the use of storage space. However, there is one limitation.
+A snapshot cannot be taken from a block device on which the difference storage
+is located.
+
+Snapshot overflow resistance
+----------------------------
+
+To create images of snapshots of block devices, the module stores blocks
+of the original block device that have been changed since the snapshot
+was taken. To do this, the module handles write requests and reads blocks
+that need to be overwritten. This algorithm guarantees safety of the data
+of the original block device in the event of an overflow of the snapshot,
+and even in the case of unpredictable critical errors. If a problem occurs
+during backup, the difference storage is released, the snapshot is closed,
+no backup is created, but the server continues to work.
+
+Coherent snapshot of multiple block devices
+-------------------------------------------
+
+A snapshot is created simultaneously for all block devices for which a backup
+is being created, ensuring their coherent state.
+
+
+Algorithms
+==========
+
+Overview
+--------
+
+The blksnap module is a block-level filter. It handles all write I/O units.
+The filter is attached to the block device when the snapshot is created
+for the first time. The change tracker marks all overwritten blocks.
+Information about the history of changes on the block device is available
+while holding the snapshot. The module reads the blocks that need to be
+overwritten and stores them in the difference storage. When reading from
+a snapshot image, reading is performed either from the original device or
+from the difference storage.
+
+Change tracking
+---------------
+
+A change tracker map is created for each block device. One byte of this map
+corresponds to one block. The block size is set by the
+``tracking_block_minimum_shift`` and ``tracking_block_maximum_count``
+module parameters. The ``tracking_block_minimum_shift`` parameter limits
+the minimum block size for tracking, while ``tracking_block_maximum_count``
+defines the maximum allowed number of blocks. The size of the change tracker
+block is determined depending on the size of the block device when adding
+a tracking device, that is, when the snapshot is taken for the first time.
+The block size must be a power of two. The ``tracking_block_maximum_shift``
+module parameter allows to limit the maximum block size for tracking. If the
+block size reaches the allowable limit, the number of blocks will exceed the
+``tracking_block_maximum_count`` parameter.
+
+The byte of the change map stores a number from 0 to 255. This is the
+snapshot number, since the creation of which there have been changes in
+the block. Each time a snapshot is created, the number of the current
+snapshot is increased by one. This number is written to the cell of the
+change map when writing to the block. Thus, knowing the number of one of
+the previous snapshots and the number of the last snapshot, one can determine
+from the change map which blocks have been changed. When the number of the
+current change reaches the maximum allowed value for the map of 255, at the
+time when the next snapshot is created, the map of changes is reset to zero,
+and the number of the current snapshot is assigned the value 1. The change
+tracker is reset, and a new UUID is generated - a unique identifier of the
+snapshot generation. The snapshot generation identifier allows to identify
+that a change tracking reset has been performed.
+
+The change map has two copies. One copy is active, it tracks the current
+changes on the block device. The second copy is available for reading
+while the snapshot is being held, and contains the history up to the moment
+the snapshot is taken. Copies are synchronized at the moment of snapshot
+creation. After the snapshot is released, a second copy of the map is not
+needed, but it is not released, so as not to allocate memory for it again
+the next time the snapshot is created.
+
+Copy on write
+-------------
+
+Data is copied in blocks, or rather in chunks. The term "chunk" is used to
+avoid confusion with change tracker blocks and I/O blocks. In addition,
+the "chunk" in the blksnap module means about the same as the "chunk" in
+the dm-snap module.
+
+The size of the chunk is determined by the ``chunk_minimum_shift`` and
+``chunk_maximum_count`` module parameters. The ``chunk_minimum_shift``
+parameter limits the minimum size of the chunk, while ``chunk_maximum_count``
+defines the maximum allowed number of chunks. The size of the chunk is
+determined depending on the size of the block device at the time of taking the
+snapshot. The size of the chunk must be a power of two. The module parameter
+``chunk_maximum_shift`` allows to limit the maximum chunk size. If the chunk
+size reaches the allowable limit, the number of chunks will exceed the
+``chunk_maximum_count`` parameter.
+
+One chunk is described by the ``struct chunk`` structure. A map of structures
+is created for each block device. The structure contains all the necessary
+information to copy the chunks data from the original block device to the
+difference storage. This information allows to describe the snapshot image.
+A semaphore is located in the structure, which allows synchronization of threads
+accessing the chunk.
+
+The block level in Linux has a feature. If a read I/O unit was sent, and a
+write I/O unit was sent after it, then a write can be performed first, and only
+then a read. Therefore, the copy-on-write algorithm is executed synchronously.
+If the write request is handled, the execution of this I/O unit will be delayed
+until the overwritten chunks are read from the original device for later
+storing to the difference store. But if, when handling a write I/O unit, it
+turns out that the written range of sectors has already been prepared for
+storing to the difference storage, then the I/O unit is simply passed.
+
+This algorithm makes it possible to efficiently perform backup even systems
+with a Round-Robin databases. Such databases can be overwritten several times
+during the system backup. Of course, the value of a backup of the RRD monitoring
+system data can be questioned. However, it is often a task to make a backup
+of the entire enterprise infrastructure in order to restore or replicate it
+entirely in case of problems.
+
+There is also a flaw in the algorithm. When overwriting at least one sector,
+an entire chunk is copied. Thus, a situation of rapid filling of the difference
+storage when writing data to a block device in small portions in random order
+is possible. This situation is possible in case of strong fragmentation of
+data on the filesystem. But it must be borne in mind that with such data
+fragmentation, performance of systems usually degrades greatly. So, this
+problem does not occur on real servers, although it can easily be created
+by artificial tests.
+
+Difference storage
+------------------
+
+The difference storage can be a block device or it can be a file on a
+filesystem. Using a block device allows to achieve slightly higher performance,
+but in this case, the block device is used by the kernel module exclusively.
+Usually the disk space is marked up so that there is no available free space
+for backup purposes. Using a file allows to place the difference storage on a
+filesystem.
+
+The difference storage can be expanded already while the snapshot is being held,
+but only if the filesystem supports fallocate(). If the free space in the
+difference storage remains less than half of the value of the module parameter
+``diff_storage_minimum``, then the kernel module can expand the difference
+storage file within the specified limits. This limit is set when creating a
+snapshot.
+
+If free space in the difference storage runs out, an event to user land is
+generated about the overflow of the snapshot. Such a snapshot is considered
+corrupted, and read I/O units to snapshot images will be terminated with an
+error code. The difference storage stores outdated data required for snapshot
+images, so when the snapshot is overflowed, the backup process is interrupted,
+but the system maintains its operability without data loss.
+
+The difference storage has a limitation. The device cannot be added to the
+snapshot where the difference storage is located. In this case, the difference
+storage can be located in virtual memory, which consists of RAM and a swap
+partition (or file). To do this, it is enough to use a file in /dev/shm, or a
+new tmpfs filesystem can be created for this purpose. Obviously, this variant
+can be useful if the system has a lot of RAM or a large swap. The good news is
+that the modern Linux kernel allows to increase the size of the swap file "on
+the fly" without changing the system configuration.
+
+A regular file or a block device file for the difference storage must be opened
+with the O_EXCL flag. If an unnamed file with the O_TMPFILE flag is created,
+then such a file will be automatically released when the snapshot is destroyed.
+In addition, the use of an unnamed temporary file ensures that no one can open
+this file and read its contents.
+
+Performing I/O for a snapshot image
+-----------------------------------
+
+To read snapshot data, when taking a snapshot, block devices of snapshot images
+are created. The snapshot image block devices support the write operation.
+This allows to perform additional data preparation on the filesystem before
+creating a backup.
+
+To process the I/O unit, clones of the I/O unit are created, which redirect
+the I/O unit either to the original block device or to the difference storage.
+When processing of cloned I/O units is completed, the original I/O unit is
+marked as completed too.
+
+An I/O unit can be partially processed without accessing to block devices if
+the I/O unit refers to a chunk that is in the queue for storing to the
+difference storage. In this case, the data is read or written in a buffer in
+memory.
+
+If, when processing the write I/O unit, it turns out that the data of the
+referred chunk has not yet been stored to the difference storage or has not
+even been read from the original device, then an I/O unit to read data from the
+original device is initiated beforehand. After the reading from original device
+is performed, their data from the I/O unit is partially overwritten directly in
+the buffer of the chunk in memory, and the chunk is scheduled to be saved to the
+difference storage.
+
+How to use
+==========
+
+Depending on the needs and the selected license, you can choose different
+options for managing the module:
+
+- Using ioctl directly
+- Using a static C++ library
+- Using the blksnap console tool
+
+Using a BLKFILTER_CTL for block device
+--------------------------------------
+
+BLKFILTER_CTL allows to send a filter-specific command to the filter on block
+device and get the result of its execution. The module provides the
+``include/uapi/blksnap.h`` header file with a description of the commands and
+their data structures.
+
+1. ``blkfilter_ctl_blksnap_cbtinfo`` allows to get information from the
+ change tracker.
+2. ``blkfilter_ctl_blksnap_cbtmap`` reads the change tracker table. If a write
+ operation was performed for the snapshot, then the change tracker takes this
+ into account. Therefore, it is necessary to receive tracker data after write
+ operations have been completed.
+3. ``blkfilter_ctl_blksnap_cbtdirty`` mark blocks as changed in the change
+ tracker table. This is necessary if post-processing is performed after the
+ backup is created, which changes the backup blocks.
+4. ``blkfilter_ctl_blksnap_snapshotadd`` adds a block device to the snapshot.
+5. ``blkfilter_ctl_blksnap_snapshotinfo`` allows to get the name of the snapshot
+ image block device and the presence of an error.
+
+Using ioctl
+-----------
+
+Using a BLKFILTER_CTL ioctl does not allow to fully implement the management of
+the blksnap module. A control file ``blksnap-control`` is created to manage
+snapshots. The control commands are also described in the file
+``include/uapi/blksnap.h``.
+
+1. ``blksnap_ioctl_version`` get the version number.
+2. ``blk_snap_ioctl_snapshot_create`` initiates the snapshot creation process.
+3. ``blk_snap_ioctl_snapshot_append_storage`` add the range of blocks to
+ difference storage.
+4. ``blk_snap_ioctl_snapshot_take`` creates block devices of block device
+ snapshot images.
+5. ``blk_snap_ioctl_snapshot_collect`` collect all created snapshots.
+6. ``blk_snap_ioctl_snapshot_wait_event`` allows to track the status of
+ snapshots and receive events about the requirement to expand the difference
+ storage or about snapshot overflow.
+7. ``blk_snap_ioctl_snapshot_destroy`` releases the snapshot.
+
+Static C++ library
+------------------
+
+The [#userspace_libs]_ library was created primarily to simplify creation of
+tests in C++, and it is also a good example of using the module interface.
+When creating applications, direct use of control calls is preferable.
+However, the library can be used in an application with a GPL-2+ license,
+or a library with an LGPL-2+ license can be created, with which even a
+proprietary application can be dynamically linked.
+
+blksnap console tool
+--------------------
+
+The blksnap [#userspace_tools]_ console tool allows to control the module from
+the command line. The tool contains detailed built-in help. To get list of
+commands with usage description, see ``blksnap --help`` command. The ``blksnap
+<command name> --help`` command allows to get detailed information about the
+parameters of each command call. This option may be convenient when creating
+proprietary software, as it allows not to compile with the open source code.
+At the same time, the blksnap tool can be used for creating backup scripts.
+For example, rsync can be called to synchronize files on the filesystem of
+the mounted snapshot image and files in the archive on a filesystem that
+supports compression.
+
+Tests
+-----
+
+A set of tests was created for regression testing [#userspace_tests]_.
+Tests with simple algorithms that use the ``blksnap`` console tool to
+control the module are written in Bash. More complex testing algorithms
+are implemented in C++.
+
+References
+==========
+
+.. [#userspace_libs] https://github.com/veeam/blksnap/tree/stable-v2.0/lib
+
+.. [#userspace_tools] https://github.com/veeam/blksnap/tree/stable-v2.0/tools
+
+.. [#userspace_tests] https://github.com/veeam/blksnap/tree/stable-v2.0/tests
+
+Module interface description
+============================
+
+.. kernel-doc:: include/uapi/linux/blksnap.h
diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst
index e9712f72cd6d..696ff150c6b7 100644
--- a/Documentation/block/index.rst
+++ b/Documentation/block/index.rst
@@ -11,6 +11,7 @@ Block
biovecs
blk-mq
blkfilter
+ blksnap
cmdline-partition
data-integrity
deadline-iosched
diff --git a/MAINTAINERS b/MAINTAINERS
index ef90cd0fec9c..9c81e4c83139 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3593,6 +3593,12 @@ F: block/blk-filter.c
F: include/linux/blk-filter.h
F: include/uapi/linux/blk-filter.h
+BLOCK DEVICE SNAPSHOTS MODULE
+M: Sergei Shtepa <sergei.shtepa@veeam.com>
+L: linux-block@vger.kernel.org
+S: Supported
+F: Documentation/block/blksnap.rst
+
BLOCK LAYER
M: Jens Axboe <axboe@kernel.dk>
L: linux-block@vger.kernel.org
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v6 04/11] blksnap: header file of the module interface
2023-11-24 16:04 [PATCH v6 00/11] blksnap - block devices snapshots module Sergei Shtepa
` (2 preceding siblings ...)
2023-11-24 16:04 ` [PATCH v6 03/11] documentation: Block Devices Snapshots Module Sergei Shtepa
@ 2023-11-24 16:04 ` Sergei Shtepa
3 siblings, 0 replies; 9+ messages in thread
From: Sergei Shtepa @ 2023-11-24 16:04 UTC (permalink / raw)
To: axboe, hch, corbet, snitzer
Cc: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
rostedt, bsegall, mgorman, bristot, vschneid, viro, brauner,
gregkh, arnd, christian.koenig, yi.l.liu, jirislaby, stfrench,
jpanis, jgg, contact, dchinner, jack, linux, min15.li, dlemoal,
linux-block, linux-doc, linux-kernel, linux-fsdevel,
Sergei Shtepa, Donald Buczek
From: Sergei Shtepa <sergei.shtepa@veeam.com>
The header file contains a set of declarations, structures and control
requests (ioctl) that allows to manage the module from the user space.
Co-developed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Donald Buczek <buczek@molgen.mpg.de>
Signed-off-by: Sergei Shtepa <sergei.shtepa@veeam.com>
---
.../userspace-api/ioctl/ioctl-number.rst | 1 +
MAINTAINERS | 1 +
include/uapi/linux/blksnap.h | 388 ++++++++++++++++++
3 files changed, 390 insertions(+)
create mode 100644 include/uapi/linux/blksnap.h
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 4ea5b837399a..81acae1b1859 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -203,6 +203,7 @@ Code Seq# Include File Comments
'V' C0 linux/ivtvfb.h conflict!
'V' C0 linux/ivtv.h conflict!
'V' C0 media/si4713.h conflict!
+'V' 00-1F uapi/linux/blksnap.h conflict!
'W' 00-1F linux/watchdog.h conflict!
'W' 00-1F linux/wanrouter.h conflict! (pre 3.9)
'W' 00-3F sound/asound.h conflict!
diff --git a/MAINTAINERS b/MAINTAINERS
index 9c81e4c83139..9770c4d4b15d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3598,6 +3598,7 @@ M: Sergei Shtepa <sergei.shtepa@veeam.com>
L: linux-block@vger.kernel.org
S: Supported
F: Documentation/block/blksnap.rst
+F: include/uapi/linux/blksnap.h
BLOCK LAYER
M: Jens Axboe <axboe@kernel.dk>
diff --git a/include/uapi/linux/blksnap.h b/include/uapi/linux/blksnap.h
new file mode 100644
index 000000000000..be1474f2025c
--- /dev/null
+++ b/include/uapi/linux/blksnap.h
@@ -0,0 +1,388 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (C) 2023 Veeam Software Group GmbH */
+#ifndef _UAPI_LINUX_BLKSNAP_H
+#define _UAPI_LINUX_BLKSNAP_H
+
+#include <linux/types.h>
+
+#define BLKSNAP_CTL "blksnap-control"
+#define BLKSNAP_IMAGE_NAME "blksnap-image"
+#define BLKSNAP 'V'
+
+/**
+ * DOC: Block device filter interface.
+ *
+ * Control commands that are transmitted through the block device filter
+ * interface.
+ */
+
+/**
+ * enum blkfilter_ctl_blksnap - List of commands for BLKFILTER_CTL ioctl
+ *
+ * @blkfilter_ctl_blksnap_cbtinfo:
+ * Get CBT information.
+ * The result of executing the command is a &struct blksnap_cbtinfo.
+ * Return 0 if succeeded, negative errno otherwise.
+ * @blkfilter_ctl_blksnap_cbtmap:
+ * Read the CBT map.
+ * The option passes the &struct blksnap_cbtmap.
+ * The size of the table can be quite large. Thus, the table is read in
+ * a loop, in each cycle of which the next offset is set to
+ * &blksnap_tracker_read_cbt_bitmap.offset.
+ * Return a count of bytes read if succeeded, negative errno otherwise.
+ * @blkfilter_ctl_blksnap_cbtdirty:
+ * Set dirty blocks in the CBT map.
+ * The option passes the &struct blksnap_cbtdirty.
+ * There are cases when some blocks need to be marked as changed.
+ * This ioctl allows to do this.
+ * Return 0 if succeeded, negative errno otherwise.
+ * @blkfilter_ctl_blksnap_snapshotadd:
+ * Add device to snapshot.
+ * The option passes the &struct blksnap_snapshotadd.
+ * Return 0 if succeeded, negative errno otherwise.
+ * @blkfilter_ctl_blksnap_snapshotinfo:
+ * Get information about snapshot.
+ * The result of executing the command is a &struct blksnap_snapshotinfo.
+ * Return 0 if succeeded, negative errno otherwise.
+ */
+enum blkfilter_ctl_blksnap {
+ blkfilter_ctl_blksnap_cbtinfo,
+ blkfilter_ctl_blksnap_cbtmap,
+ blkfilter_ctl_blksnap_cbtdirty,
+ blkfilter_ctl_blksnap_snapshotadd,
+ blkfilter_ctl_blksnap_snapshotinfo,
+};
+
+#ifndef UUID_SIZE
+#define UUID_SIZE 16
+#endif
+
+/**
+ * struct blksnap_uuid - Unique 16-byte identifier.
+ *
+ * @b:
+ * An array of 16 bytes.
+ */
+struct blksnap_uuid {
+ __u8 b[UUID_SIZE];
+};
+
+/**
+ * struct blksnap_cbtinfo - Result for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtinfo.
+ *
+ * @device_capacity:
+ * Device capacity in bytes.
+ * @block_size:
+ * Block size in bytes.
+ * @block_count:
+ * Number of blocks.
+ * @generation_id:
+ * Unique identifier of change tracking generation.
+ * @changes_number:
+ * Current changes number.
+ */
+struct blksnap_cbtinfo {
+ __u64 device_capacity;
+ __u32 block_size;
+ __u32 block_count;
+ struct blksnap_uuid generation_id;
+ __u8 changes_number;
+};
+
+/**
+ * struct blksnap_cbtmap - Option for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtmap.
+ *
+ * @offset:
+ * Offset from the beginning of the CBT bitmap in bytes.
+ * @length:
+ * Size of @buff in bytes.
+ * @buffer:
+ * Pointer to the buffer for output.
+ */
+struct blksnap_cbtmap {
+ __u32 offset;
+ __u32 length;
+ __u64 buffer;
+};
+
+/**
+ * struct blksnap_sectors - Description of the block device region.
+ *
+ * @offset:
+ * Offset from the beginning of the disk in sectors.
+ * @count:
+ * Count of sectors.
+ */
+struct blksnap_sectors {
+ __u64 offset;
+ __u64 count;
+};
+
+/**
+ * struct blksnap_cbtdirty - Option for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_cbtdirty.
+ *
+ * @count:
+ * Count of elements in the @dirty_sectors.
+ * @dirty_sectors:
+ * Pointer to the array of &struct blksnap_sectors.
+ */
+struct blksnap_cbtdirty {
+ __u32 count;
+ __u64 dirty_sectors;
+};
+
+/**
+ * struct blksnap_snapshotadd - Option for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_snapshotadd.
+ *
+ * @id:
+ * ID of the snapshot to which the block device should be added.
+ */
+struct blksnap_snapshotadd {
+ struct blksnap_uuid id;
+};
+
+#define IMAGE_DISK_NAME_LEN 32
+
+/**
+ * struct blksnap_snapshotinfo - Result for the command
+ * &blkfilter_ctl_blksnap.blkfilter_ctl_blksnap_snapshotinfo.
+ *
+ * @error_code:
+ * Zero if there were no errors while holding the snapshot.
+ * The error code -ENOSPC means that while holding the snapshot, a snapshot
+ * overflow situation has occurred. Other error codes mean other reasons
+ * for failure.
+ * The error code is reset when the device is added to a new snapshot.
+ * @image:
+ * If the snapshot was taken, it stores the block device name of the
+ * image, or empty string otherwise.
+ */
+struct blksnap_snapshotinfo {
+ __s32 error_code;
+ __u8 image[IMAGE_DISK_NAME_LEN];
+};
+
+/**
+ * DOC: Interface for managing snapshots
+ *
+ * Control commands that are transmitted through the blksnap module interface.
+ */
+enum blksnap_ioctl {
+ blksnap_ioctl_version,
+ blksnap_ioctl_snapshot_create,
+ blksnap_ioctl_snapshot_destroy,
+ blksnap_ioctl_snapshot_take,
+ blksnap_ioctl_snapshot_collect,
+ blksnap_ioctl_snapshot_wait_event,
+};
+
+/**
+ * struct blksnap_version - Module version.
+ *
+ * @major:
+ * Version major part.
+ * @minor:
+ * Version minor part.
+ * @revision:
+ * Revision number.
+ * @build:
+ * Build number. Should be zero.
+ */
+struct blksnap_version {
+ __u16 major;
+ __u16 minor;
+ __u16 revision;
+ __u16 build;
+};
+
+/**
+ * define IOCTL_BLKSNAP_VERSION - Get module version.
+ *
+ * The version may increase when the API changes. But linking the user space
+ * behavior to the version code does not seem to be a good idea.
+ * To ensure backward compatibility, API changes should be made by adding new
+ * ioctl without changing the behavior of existing ones. The version should be
+ * used for logs.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_VERSION \
+ _IOR(BLKSNAP, blksnap_ioctl_version, struct blksnap_version)
+
+/**
+ * struct blksnap_snapshot_create - Argument for the
+ * &IOCTL_BLKSNAP_SNAPSHOT_CREATE control.
+ *
+ * @diff_storage_limit_sect:
+ * The maximum allowed difference storage size in sectors.
+ * @diff_storage_fd:
+ * The difference storage file descriptor.
+ * @id:
+ * Generated new snapshot ID.
+ */
+struct blksnap_snapshot_create {
+ __u64 diff_storage_limit_sect;
+ __u32 diff_storage_fd;
+ struct blksnap_uuid id;
+};
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_CREATE - Create snapshot.
+ *
+ * Creates a snapshot structure and initializes the difference storage.
+ * A snapshot is created for several block devices at once. Several snapshots
+ * can be created at the same time, but with the condition that one block
+ * device can only be included in one snapshot.
+ *
+ * The difference storage can be dynamically increase as it fills up.
+ * The file is increased in portions, the size of which is determined by the
+ * module parameter &diff_storage_minimum. Each time the amount of free space
+ * in the difference storage is reduced to the half of &diff_storage_minimum,
+ * the file is expanded by a portion, until it reaches the allowable limit
+ * &diff_storage_limit_sect.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_CREATE \
+ _IOWR(BLKSNAP, blksnap_ioctl_snapshot_create, \
+ struct blksnap_snapshot_create)
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_DESTROY - Release and destroy the snapshot.
+ *
+ * Destroys snapshot with &blksnap_snapshot_destroy.id. This leads to the
+ * deletion of all block device images of the snapshot. The difference storage
+ * is being released. But the change tracker keeps tracking.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_DESTROY \
+ _IOW(BLKSNAP, blksnap_ioctl_snapshot_destroy, \
+ struct blksnap_uuid)
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_TAKE - Take snapshot.
+ *
+ * Creates snapshot images of block devices and switches change trackers tables.
+ * The snapshot must be created before this call, and the areas of block
+ * devices should be added to the difference storage.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_TAKE \
+ _IOW(BLKSNAP, blksnap_ioctl_snapshot_take, \
+ struct blksnap_uuid)
+
+/**
+ * struct blksnap_snapshot_collect - Argument for the
+ * &IOCTL_BLKSNAP_SNAPSHOT_COLLECT control.
+ *
+ * @count:
+ * Size of &blksnap_snapshot_collect.ids in the number of 16-byte UUID.
+ * @ids:
+ * Pointer to the array of struct blksnap_uuid for output.
+ */
+struct blksnap_snapshot_collect {
+ __u32 count;
+ __u64 ids;
+};
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_COLLECT - Get collection of created snapshots.
+ *
+ * Multiple snapshots can be created at the same time. This allows for one
+ * system to create backups for different data with a independent schedules.
+ *
+ * If in &blksnap_snapshot_collect.count is less than required to store the
+ * &blksnap_snapshot_collect.ids, the array is not filled, and the ioctl
+ * returns the required count for &blksnap_snapshot_collect.ids.
+ *
+ * So, it is recommended to call the ioctl twice. The first call with an null
+ * pointer &blksnap_snapshot_collect.ids and a zero value in
+ * &blksnap_snapshot_collect.count. It will set the required array size in
+ * &blksnap_snapshot_collect.count. The second call with a pointer
+ * &blksnap_snapshot_collect.ids to an array of the required size will allow to
+ * get collection of active snapshots.
+ *
+ * Return: 0 if succeeded, -ENODATA if there is not enough space in the array
+ * to store collection of active snapshots, or negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_COLLECT \
+ _IOR(BLKSNAP, blksnap_ioctl_snapshot_collect, \
+ struct blksnap_snapshot_collect)
+
+/**
+ * enum blksnap_event_codes - Variants of event codes.
+ *
+ * @blksnap_event_code_corrupted:
+ * Snapshot image is corrupted event.
+ * If a chunk could not be allocated when trying to save data to the
+ * difference storage, this event is generated. However, this does not mean
+ * that the backup process was interrupted with an error. If the snapshot
+ * image has been read to the end by this time, the backup process is
+ * considered successful.
+ */
+enum blksnap_event_codes {
+ blksnap_event_code_corrupted,
+};
+
+/**
+ * struct blksnap_snapshot_event - Argument for the
+ * &IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT control.
+ *
+ * @id:
+ * Snapshot ID.
+ * @timeout_ms:
+ * Timeout for waiting in milliseconds.
+ * @time_label:
+ * Timestamp of the received event.
+ * @code:
+ * Code of the received event &enum blksnap_event_codes.
+ * @data:
+ * The received event body.
+ */
+struct blksnap_snapshot_event {
+ struct blksnap_uuid id;
+ __u32 timeout_ms;
+ __u32 code;
+ __s64 time_label;
+ __u8 data[4096 - 32];
+};
+
+/**
+ * define IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT - Wait and get the event from the
+ * snapshot.
+ *
+ * While holding the snapshot, the kernel module can transmit information about
+ * changes in its state in the form of events to the user level.
+ * It is very important to receive these events as quickly as possible, so the
+ * user's thread is in the state of interruptible sleep.
+ *
+ * Return: 0 if succeeded, negative errno otherwise.
+ */
+#define IOCTL_BLKSNAP_SNAPSHOT_WAIT_EVENT \
+ _IOR(BLKSNAP, blksnap_ioctl_snapshot_wait_event, \
+ struct blksnap_snapshot_event)
+
+/**
+ * struct blksnap_event_corrupted - Data for the
+ * &blksnap_event_code_corrupted event.
+ *
+ * @dev_id_mj:
+ * Major part of original device ID.
+ * @dev_id_mn:
+ * Minor part of original device ID.
+ * @err_code:
+ * Error code.
+ */
+struct blksnap_event_corrupted {
+ __u32 dev_id_mj;
+ __u32 dev_id_mn;
+ __s32 err_code;
+};
+
+#endif /* _UAPI_LINUX_BLKSNAP_H */
--
2.20.1
^ permalink raw reply related [flat|nested] 9+ messages in thread