From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: [RFC v3 0/4] fs: Add generic file system event notifications Date: Tue, 16 Jun 2015 15:09:29 +0200 Message-ID: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> Return-path: Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Cc: greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org Hi All, First of all, apologies for the delay: illness ruled out my plans for having this out for a review as intended. Anyway this is an updated version of the patchset for generic filesystem events interface [1][2], hopefully handling issues raised during the previous run. Changes from v2: - Switched to kref for reference counting - Support for the events has been made optional (config option) - Use dynamically assigned id for multicast group instead of using static one - Verify if there are any net listeners prior to sending the msg - Make the interface more namespace-aware (handling mount dropped and showing the content of config file). As for the network namespaces - as before only the init net namespace is being supported. Changes from v1: - Improved synchronization: switched to RCU accompanied with ref counting mechanism - Limiting scope of supported event types along with default event codes - Slightly modified configuration (event types followed by arguments where required) - Updated documentation - Unified naming for netlink attributes - Updated netlink message format to include dev minor:major numbers despite the filesystem type - Switched to single cmd id for messages - Removed the per-config-entry ids --- [1] https://lkml.org/lkml/2015/4/15/46 [2] https://lkml.org/lkml/2015/4/27/244 --- Beata Michalska (4): fs: Add generic file system event notifications ext4: Add helper function to mark group as corrupted ext4: Add support for generic FS events shmem: Add support for generic FS events Documentation/filesystems/events.txt | 232 ++++++++++ fs/Kconfig | 2 + fs/Makefile | 1 + fs/events/Kconfig | 7 + fs/events/Makefile | 5 + fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ fs/events/fs_event.h | 22 + fs/events/fs_event_netlink.c | 104 +++++ fs/ext4/balloc.c | 25 +- fs/ext4/ext4.h | 10 + fs/ext4/ialloc.c | 5 +- fs/ext4/inode.c | 2 +- fs/ext4/mballoc.c | 17 +- fs/ext4/resize.c | 1 + fs/ext4/super.c | 39 ++ fs/namespace.c | 1 + include/linux/fs.h | 6 +- include/linux/fs_event.h | 72 +++ include/uapi/linux/Kbuild | 1 + include/uapi/linux/fs_event.h | 58 +++ mm/shmem.c | 33 +- 21 files changed, 1419 insertions(+), 33 deletions(-) create mode 100644 Documentation/filesystems/events.txt create mode 100644 fs/events/Kconfig create mode 100644 fs/events/Makefile create mode 100644 fs/events/fs_event.c create mode 100644 fs/events/fs_event.h create mode 100644 fs/events/fs_event_netlink.c create mode 100644 include/linux/fs_event.h create mode 100644 include/uapi/linux/fs_event.h -- 1.7.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: [RFC v3 1/4] fs: Add generic file system event notifications Date: Tue, 16 Jun 2015 15:09:30 +0200 Message-ID: <1434460173-18427-2-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> Return-path: In-reply-to: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Cc: greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org Introduce configurable generic interface for file system-wide event notifications, to provide file systems with a common way of reporting any potential issues as they emerge. The notifications are to be issued through generic netlink interface by newly introduced multicast group. Threshold notifications have been included, allowing triggering an event whenever the amount of free space drops below a certain level - or levels to be more precise as two of them are being supported: the lower and the upper range. The notifications work both ways: once the threshold level has been reached, an event shall be generated whenever the number of available blocks goes up again re-activating the threshold. The interface has been exposed through a vfs. Once mounted, it serves as an entry point for the set-up where one can register for particular file system events. Signed-off-by: Beata Michalska --- Documentation/filesystems/events.txt | 232 ++++++++++ fs/Kconfig | 2 + fs/Makefile | 1 + fs/events/Kconfig | 7 + fs/events/Makefile | 5 + fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ fs/events/fs_event.h | 22 + fs/events/fs_event_netlink.c | 104 +++++ fs/namespace.c | 1 + include/linux/fs.h | 6 +- include/linux/fs_event.h | 72 +++ include/uapi/linux/Kbuild | 1 + include/uapi/linux/fs_event.h | 58 +++ 13 files changed, 1319 insertions(+), 1 deletion(-) create mode 100644 Documentation/filesystems/events.txt create mode 100644 fs/events/Kconfig create mode 100644 fs/events/Makefile create mode 100644 fs/events/fs_event.c create mode 100644 fs/events/fs_event.h create mode 100644 fs/events/fs_event_netlink.c create mode 100644 include/linux/fs_event.h create mode 100644 include/uapi/linux/fs_event.h diff --git a/Documentation/filesystems/events.txt b/Documentation/filesystems/events.txt new file mode 100644 index 0000000..c2e6227 --- /dev/null +++ b/Documentation/filesystems/events.txt @@ -0,0 +1,232 @@ + + Generic file system event notification interface + +Document created 23 April 2015 by Beata Michalska + +1. The reason behind: +===================== + +There are many corner cases when things might get messy with the filesystems. +And it is not always obvious what and when went wrong. Sometimes you might +get some subtle hints that there is something going on - but by the time +you realise it, it might be too late as you are already out-of-space +or the filesystem has been remounted as read-only (i.e.). The generic +interface for the filesystem events fills the gap by providing a rather +easy way of real-time notifications triggered whenever something interesting +happens, allowing filesystems to report events in a common way, as they occur. + +2. How does it work: +==================== + +The interface itself has been exposed as fstrace-type Virtual File System, +primarily to ease the process of setting up the configuration for the +notifications. So for starters, it needs to get mounted (obviously): + + mount -t fstrace none /sys/fs/events + +This will unveil the single fstrace filesystem entry - the 'config' file, +through which the notification are being set-up. + +Activating notifications for particular filesystem is as straightforward +as writing into the 'config' file. Note that by default all events, despite +the actual filesystem type, are being disregarded. + +Synopsis of config: +------------------ + + MOUNT EVENT_TYPE [L1] [L2] + + MOUNT : the filesystem's mount point + EVENT_TYPE : event types - currently two of them are being supported: + + * generic events ("G") covering most common warnings + and errors that might be reported by any filesystem; + this option does not take any arguments; + + * threshold notifications ("T") - events sent whenever + the amount of available space drops below certain level; + it is possible to specify two threshold levels though + only one is required to properly setup the notifications; + as those refer to the number of available blocks, the lower + level [L1] needs to be higher than the upper one [L2] + +Sample request could look like the following: + + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config + +Multiple request might be specified provided they are separated with semicolon. + +The configuration itself might be modified at any time. One can add/remove +particular event types for given fielsystem, modify the threshold levels, +and remove single or all entries from the 'config' file. + + - Adding new event type: + + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config + +(Note that is is enough to provide the event type to be enabled without +the already set ones.) + + - Removing event type: + + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config + + - Updating threshold limits: + + $ echo MOUNT T L1 L2 > /sys/fs/events/config + + - Removing single entry: + + $ echo '!MOUNT' > /sys/fs/events/config + + - Removing all entries: + + $ echo > /sys/fs/events/config + +Reading the file will list all registered entries with their current set-up +along with some additional info like the filesystem type and the backing device +name if available. + +Final, though a very important note on the configuration: when and if the +actual events are being triggered falls way beyond the scope of the generic +filesystem events interface. It is up to a particular filesystem +implementation which events are to be supported - if any at all. So if +given filesystem does not support the event notifications, an attempt to +enable those through 'config' file will fail. + + +3. The generic netlink interface support: +========================================= + +Whenever an event notification is triggered (by given filesystem) the current +configuration is being validated to decide whether a userpsace notification +should be launched. If there has been no request (in a mean of 'config' file +entry) for given event, one will be silently disregarded. If, on the other +hand, someone is 'watching' given filesystem for specific events, a generic +netlink message will be sent. A dedicated multicast group has been provided +solely for this purpose so in order to receive such notifications, one should +subscribe to this new multicast group. As for now only the init network +namespace is being supported. + +3.1 Message format + +The FS_NL_C_EVENT shall be stored within the generic netlink message header +as the command field. The message payload will provide more detailed info: +the backing device major and minor numbers, the event code and the id of +the process which action led to the event occurrence. In case of threshold +notifications, the current number of available blocks will be included +in the payload as well. + + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | NETLINK MESSAGE HEADER | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | GENERIC NETLINK MESSAGE HEADER | + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Optional user specific message header | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | GENERIC MESSAGE PAYLOAD: | + +---------------------------------------------------------------+ + | FS_NL_A_EVENT_ID (NLA_U32) | + +---------------------------------------------------------------+ + | FS_NL_A_DEV_MAJOR (NLA_U32) | + +---------------------------------------------------------------+ + | FS_NL_A_DEV_MINOR (NLA_U32) | + +---------------------------------------------------------------+ + | FS_NL_A_CAUSED_ID (NLA_U32) | + +---------------------------------------------------------------+ + | FS_NL_A_DATA (NLA_U64) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + +The above figure is based on: + http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto#Message_Format + + +4. API Reference: +================= + + 4.1 Generic file system event interface data & operations + + #include + + struct fs_trace_info { + void __rcu *e_priv /* READ ONLY */ + unsigned int events_cap_mask; /* Supported notifications */ + const struct fs_trace_operations *ops; + }; + + struct fs_trace_operations { + void (*query)(struct super_block *, u64 *); + }; + + In order to get the fireworks and stuff, each filesystem needs to setup + the events_cap_mask field of the fs_trace_info structure, which has been + embedded within the super_block structure. This should reflect the type of + events the filesystem wants to support. In case of threshold notifications, + apart from setting the FS_EVENT_THRESH flag, the 'query' callback should + be provided as this enables the events interface to get the up-to-date + state of the number of available blocks whenever those notifications are + being requested. + + The 'e_priv' field of the fs_trace_info structure should be completely ignored + as it's for INTERNAL USE ONLY. So don't even think of messing with it, if you + do not want to get yourself into some real trouble. If still, you are tempted + to do so - feel free, it's gonna be pure fun. Consider yourself warned. + + + 4.2 Event notification: + + #include + void fs_event_notify(struct super_block *sb, unsigned int event_id); + + Notify the generic FS event interface of an occurring event. + This shall be used by any file system that wishes to inform any potential + listeners/watchers of a particular event. + - sb: the filesystem's super block + - event_id: an event identifier + + 4.3 Threshold notifications: + + #include + void fs_event_alloc_space(struct super_block *sb, u64 ncount); + void fs_event_free_space(struct super_block *sb, u64 ncount); + + Each filesystme supporting the threshold notifications should call + fs_event_alloc_space/fs_event_free_space respectively whenever the + amount of available blocks changes. + - sb: the filesystem's super block + - ncount: number of blocks being acquired/released + + Note that to properly handle the threshold notifications the fs events + interface needs to be kept up to date by the filesystems. Each should + register fs_trace_operations to enable querying the current number of + available blocks. + + 4.4 Sending message through generic netlink interface + + #include + + int fs_netlink_send_event(size_t size, unsigned int event_id, + int (*compose_msg)(struct sk_buff *skb, void *data), void *cbdata); + + Although the fs event interface is fully responsible for sending the messages + over the netlink, filesystems might use the FS_EVENT multicast group to send + their own custom messages. + - size: the size of the message payload + - event_id: the event identifier + - compose_msg: a callback responsible for filling-in the message payload + - cbdata: message custom data + + Calling fs_netlink_send_event will result in a message being sent by + the FS_EVENT multicast group. Note that the body of the message should be + prepared (set-up )by the caller - through compose_msg callback. The message's + sk_buff will be allocated on behalf of the caller (thus the size parameter). + The compose_msg should only fill the payload with proper data. Unless + the event id is specified as FS_EVENT_NONE, it's value shall be added + to the payload prior to calling the compose_msg. + + diff --git a/fs/Kconfig b/fs/Kconfig index ec35851..a89e678 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -69,6 +69,8 @@ config FILE_LOCKING for filesystems like NFS and for the flock() system call. Disabling this option saves about 11k. +source "fs/events/Kconfig" + source "fs/notify/Kconfig" source "fs/quota/Kconfig" diff --git a/fs/Makefile b/fs/Makefile index a88ac48..bcb3048 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -126,3 +126,4 @@ obj-y += exofs/ # Multiple modules obj-$(CONFIG_CEPH_FS) += ceph/ obj-$(CONFIG_PSTORE) += pstore/ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ +obj-$(CONFIG_FS_EVENTS) += events/ diff --git a/fs/events/Kconfig b/fs/events/Kconfig new file mode 100644 index 0000000..1c60195 --- /dev/null +++ b/fs/events/Kconfig @@ -0,0 +1,7 @@ +# Generic Files System events interface +config FS_EVENTS + bool "Generic filesystem events" + select NET + default y + help + Enable generic filesystem events interface diff --git a/fs/events/Makefile b/fs/events/Makefile new file mode 100644 index 0000000..9c98337 --- /dev/null +++ b/fs/events/Makefile @@ -0,0 +1,5 @@ +# +# Makefile for the Linux Generic File System Event Interface +# + +obj-y := fs_event.o fs_event_netlink.o diff --git a/fs/events/fs_event.c b/fs/events/fs_event.c new file mode 100644 index 0000000..1037311 --- /dev/null +++ b/fs/events/fs_event.c @@ -0,0 +1,809 @@ +/* + * Generic File System Evens Interface + * + * Copyright(c) 2015 Samsung Electronics. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "../pnode.h" +#include "fs_event.h" + +static LIST_HEAD(fs_trace_list); +static DEFINE_MUTEX(fs_trace_lock); + +static struct kmem_cache *fs_trace_cachep __read_mostly; + +static atomic_t stray_traces = ATOMIC_INIT(0); +static DECLARE_WAIT_QUEUE_HEAD(trace_wq); +/* + * Threshold notification state bits. + * Note the reverse as this refers to the number + * of available blocks. + */ +#define THRESH_LR_BELOW 0x0001 /* Falling below the lower range */ +#define THRESH_LR_BEYOND 0x0002 +#define THRESH_UR_BELOW 0x0004 +#define THRESH_UR_BEYOND 0x0008 /* Going beyond the upper range */ + +#define THRESH_LR_ON (THRESH_LR_BELOW | THRESH_LR_BEYOND) +#define THRESH_UR_ON (THRESH_UR_BELOW | THRESH_UR_BEYOND) + +#define FS_TRACE_ADD 0x100000 + +struct fs_trace_entry { + struct kref count; + atomic_t active; + struct super_block *sb; + unsigned int notify; + struct path mnt_path; + struct list_head node; + + struct fs_event_thresh { + u64 avail_space; + u64 lrange; + u64 urange; + unsigned int state; + } th; + struct rcu_head rcu_head; + spinlock_t lock; +}; + +static const match_table_t fs_etypes = { + { FS_EVENT_GENERIC, "G" }, + { FS_EVENT_THRESH, "T" }, + { 0, NULL }, +}; + +static inline int fs_trace_query_data(struct super_block *sb, + struct fs_trace_entry *en) +{ + if (sb->s_etrace.ops && sb->s_etrace.ops->query) { + sb->s_etrace.ops->query(sb, &en->th.avail_space); + return 0; + } + + return -EINVAL; +} + +static inline void fs_trace_entry_free(struct fs_trace_entry *en) +{ + kmem_cache_free(fs_trace_cachep, en); +} + +static void fs_destroy_trace_entry(struct kref *en_ref) +{ + struct fs_trace_entry *en = container_of(en_ref, + struct fs_trace_entry, count); + + /* Last reference has been dropped */ + fs_trace_entry_free(en); + atomic_dec(&stray_traces); +} + +static void fs_trace_entry_put(struct fs_trace_entry *en) +{ + kref_put(&en->count, fs_destroy_trace_entry); +} + +static void fs_release_trace_entry(struct rcu_head *rcu_head) +{ + struct fs_trace_entry *en = container_of(rcu_head, + struct fs_trace_entry, + rcu_head); + /* + * As opposed to typical reference drop, this one is being + * called from the rcu callback. This is to make sure all + * readers have managed to safely grab the reference before + * the change to rcu pointer is visible to all and before + * the reference is dropped here. + */ + fs_trace_entry_put(en); +} + +static void fs_drop_trace_entry(struct fs_trace_entry *en) +{ + struct super_block *sb; + + lockdep_assert_held(&fs_trace_lock); + /* + * The trace entry might have already been removed + * from the list of active traces with the proper + * ref drop, though it was still in use handling + * one of the fs events. This means that the object + * has been already scheduled for being released. + * So leave... + */ + + if (!atomic_add_unless(&en->active, -1, 0)) + return; + /* + * At this point the trace entry is being marked as inactive + * so no new references will be allowed. + * Still it might be floating around somewhere + * so drop the reference when the rcu readers are done. + */ + spin_lock(&en->lock); + list_del(&en->node); + sb = en->sb; + en->sb = NULL; + spin_unlock(&en->lock); + + rcu_assign_pointer(sb->s_etrace.e_priv, NULL); + call_rcu(&en->rcu_head, fs_release_trace_entry); + /* It's safe now to drop the reference to the super */ + deactivate_super(sb); + atomic_inc(&stray_traces); +} + +static inline +struct fs_trace_entry *fs_trace_entry_get(struct fs_trace_entry *en) +{ + if (en) { + if (!kref_get_unless_zero(&en->count)) + return NULL; + /* Don't allow referencing inactive object */ + if (!atomic_read(&en->active)) { + fs_trace_entry_put(en); + return NULL; + } + } + return en; +} + +static struct fs_trace_entry *fs_trace_entry_get_rcu(struct super_block *sb) +{ + struct fs_trace_entry *en; + + if (!sb) + return NULL; + + rcu_read_lock(); + en = rcu_dereference(sb->s_etrace.e_priv); + en = fs_trace_entry_get(en); + rcu_read_unlock(); + + return en; +} + +static int fs_remove_trace_entry(struct super_block *sb) +{ + struct fs_trace_entry *en; + + en = fs_trace_entry_get_rcu(sb); + if (!en) + return -EINVAL; + + mutex_lock(&fs_trace_lock); + fs_drop_trace_entry(en); + mutex_unlock(&fs_trace_lock); + fs_trace_entry_put(en); + return 0; +} + +static void fs_remove_all_traces(void) +{ + struct fs_trace_entry *en, *guard; + + mutex_lock(&fs_trace_lock); + list_for_each_entry_safe(en, guard, &fs_trace_list, node) + fs_drop_trace_entry(en); + mutex_unlock(&fs_trace_lock); +} + +static int create_common_msg(struct sk_buff *skb, void *data) +{ + struct fs_trace_entry *en = (struct fs_trace_entry *)data; + struct super_block *sb = en->sb; + + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) + return -EINVAL; + + if (nla_put_u64(skb, FS_NL_A_CAUSED_ID, pid_vnr(task_pid(current)))) + return -EINVAL; + + return 0; +} + +static int create_thresh_msg(struct sk_buff *skb, void *data) +{ + struct fs_trace_entry *en = (struct fs_trace_entry *)data; + int ret; + + ret = create_common_msg(skb, data); + if (!ret) + ret = nla_put_u64(skb, FS_NL_A_DATA, en->th.avail_space); + return ret; +} + +static void fs_event_send(struct fs_trace_entry *en, unsigned int event_id) +{ + size_t size = nla_total_size(sizeof(u32)) * 2 + + nla_total_size(sizeof(u64)); + + fs_netlink_send_event(size, event_id, create_common_msg, en); +} + +static void fs_event_send_thresh(struct fs_trace_entry *en, + unsigned int event_id) +{ + size_t size = nla_total_size(sizeof(u32)) * 2 + + nla_total_size(sizeof(u64)) * 2; + + fs_netlink_send_event(size, event_id, create_thresh_msg, en); +} + +void fs_event_notify(struct super_block *sb, unsigned int event_id) +{ + struct fs_trace_entry *en; + + en = fs_trace_entry_get_rcu(sb); + if (!en) + return; + + spin_lock(&en->lock); + if (atomic_read(&en->active) && (en->notify & FS_EVENT_GENERIC)) + fs_event_send(en, event_id); + spin_unlock(&en->lock); + fs_trace_entry_put(en); +} +EXPORT_SYMBOL(fs_event_notify); + +void fs_event_alloc_space(struct super_block *sb, u64 ncount) +{ + struct fs_trace_entry *en; + s64 count; + + en = fs_trace_entry_get_rcu(sb); + if (!en) + return; + + spin_lock(&en->lock); + + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) + goto leave; + /* + * we shouldn't drop below 0 here, + * unless there is a sync issue somewhere (?) + */ + count = en->th.avail_space - ncount; + en->th.avail_space = count < 0 ? 0 : count; + + if (en->th.avail_space > en->th.lrange) + /* Not 'even' close - leave */ + goto leave; + + if (en->th.avail_space > en->th.urange) { + /* Close enough - the lower range has been reached */ + if (!(en->th.state & THRESH_LR_BEYOND)) { + /* Send notification */ + fs_event_send_thresh(en, FS_THR_LRBELOW); + en->th.state &= ~THRESH_LR_BELOW; + en->th.state |= THRESH_LR_BEYOND; + } + goto leave; + } + if (!(en->th.state & THRESH_UR_BEYOND)) { + fs_event_send_thresh(en, FS_THR_URBELOW); + en->th.state &= ~THRESH_UR_BELOW; + en->th.state |= THRESH_UR_BEYOND; + } + +leave: + spin_unlock(&en->lock); + fs_trace_entry_put(en); +} +EXPORT_SYMBOL(fs_event_alloc_space); + +void fs_event_free_space(struct super_block *sb, u64 ncount) +{ + struct fs_trace_entry *en; + + en = fs_trace_entry_get_rcu(sb); + if (!en) + return; + + spin_lock(&en->lock); + + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) + goto leave; + + en->th.avail_space += ncount; + + if (en->th.avail_space > en->th.lrange) { + if (!(en->th.state & THRESH_LR_BELOW) + && en->th.state & THRESH_LR_BEYOND) { + /* Send notification */ + fs_event_send_thresh(en, FS_THR_LRABOVE); + en->th.state &= ~(THRESH_LR_BEYOND|THRESH_UR_BEYOND); + en->th.state |= THRESH_LR_BELOW; + goto leave; + } + } + if (en->th.avail_space > en->th.urange) { + if (!(en->th.state & THRESH_UR_BELOW) + && en->th.state & THRESH_UR_BEYOND) { + /* Notify */ + fs_event_send_thresh(en, FS_THR_URABOVE); + en->th.state &= ~THRESH_UR_BEYOND; + en->th.state |= THRESH_UR_BELOW; + } + } +leave: + spin_unlock(&en->lock); + fs_trace_entry_put(en); +} +EXPORT_SYMBOL(fs_event_free_space); + +void fs_event_mount_dropped(struct vfsmount *mnt) +{ + /* + * The mount is dropped but the super might not get released + * at once so there is very small chance some notifications + * will come through. + * Note that the mount being dropped here might belong to a different + * namespace - if this is the case, just ignore it. + */ + struct fs_trace_entry *en = fs_trace_entry_get_rcu(mnt->mnt_sb); + struct vfsmount *en_mnt; + + if (!en || !atomic_read(&en->active)) + return; + /* + * The entry once set, does not change the mountpoint it's being + * pinned to, so no need to take the lock here. + */ + en_mnt = en->mnt_path.mnt; + if (!(real_mount(mnt)->mnt_ns != (real_mount(en_mnt))->mnt_ns)) + fs_remove_trace_entry(mnt->mnt_sb); + fs_trace_entry_put(en); +} + +static int fs_new_trace_entry(struct path *path, struct fs_event_thresh *thresh, + unsigned int nmask) +{ + struct fs_trace_entry *en; + struct super_block *sb; + struct mount *r_mnt; + + en = kmem_cache_zalloc(fs_trace_cachep, GFP_KERNEL); + if (unlikely(!en)) + return -ENOMEM; + /* + * Note that no reference is being taken here for the path as it would + * make the unmount unnecessarily puzzling (due to an extra 'valid' + * reference for the mnt). + * This is *rather* safe as the notification on mount being dropped + * will get called prior to releasing the super block - so right + * in time to perform appropriate clean-up + */ + r_mnt = real_mount(path->mnt); + + en->mnt_path.dentry = r_mnt->mnt.mnt_root; + en->mnt_path.mnt = &r_mnt->mnt; + + sb = path->mnt->mnt_sb; + en->sb = sb; + /* + * Increase the refcount for sb to mark it's being relied on. + * Note that the reference to path is taken by the caller, so it + * is safe to assume there is at least single active reference + * to super as well. + */ + atomic_inc(&sb->s_active); + + nmask &= sb->s_etrace.events_cap_mask; + if (!nmask) + goto leave; + + spin_lock_init(&en->lock); + INIT_LIST_HEAD(&en->node); + + en->notify = nmask; + memcpy(&en->th, thresh, offsetof(struct fs_event_thresh, state)); + if (nmask & FS_EVENT_THRESH) + fs_trace_query_data(sb, en); + + kref_init(&en->count); + + if (rcu_access_pointer(sb->s_etrace.e_priv) != NULL) { + struct fs_trace_entry *prev_en; + + prev_en = fs_trace_entry_get_rcu(sb); + if (prev_en) { + WARN_ON(prev_en); + fs_trace_entry_put(prev_en); + goto leave; + } + } + atomic_set(&en->active, 1); + + mutex_lock(&fs_trace_lock); + list_add(&en->node, &fs_trace_list); + mutex_unlock(&fs_trace_lock); + + rcu_assign_pointer(sb->s_etrace.e_priv, en); + synchronize_rcu(); + + return 0; +leave: + deactivate_super(sb); + kmem_cache_free(fs_trace_cachep, en); + return -EINVAL; +} + +static int fs_update_trace_entry(struct path *path, + struct fs_event_thresh *thresh, + unsigned int nmask) +{ + struct fs_trace_entry *en; + struct super_block *sb; + int extend = nmask & FS_TRACE_ADD; + int ret = -EINVAL; + + en = fs_trace_entry_get_rcu(path->mnt->mnt_sb); + if (!en) + return (extend) ? fs_new_trace_entry(path, thresh, nmask) + : -EINVAL; + + if (!atomic_read(&en->active)) + return -EINVAL; + + nmask &= ~FS_TRACE_ADD; + + spin_lock(&en->lock); + sb = en->sb; + if (!sb || !(nmask & sb->s_etrace.events_cap_mask)) + goto leave; + + if (nmask & FS_EVENT_THRESH) { + if (extend) { + /* Get the current state */ + if (!(en->notify & FS_EVENT_THRESH)) + if (fs_trace_query_data(sb, en)) + goto leave; + + if (thresh->state & THRESH_LR_ON) { + en->th.lrange = thresh->lrange; + en->th.state &= ~THRESH_LR_ON; + } + + if (thresh->state & THRESH_UR_ON) { + en->th.urange = thresh->urange; + en->th.state &= ~THRESH_UR_ON; + } + } else { + memset(&en->th, 0, sizeof(en->th)); + } + } + + if (extend) + en->notify |= nmask; + else + en->notify &= ~nmask; + ret = 0; +leave: + spin_unlock(&en->lock); + fs_trace_entry_put(en); + return ret; +} + +static int fs_parse_trace_request(int argc, char **argv) +{ + struct fs_event_thresh thresh = {0}; + struct path path; + substring_t args[MAX_OPT_ARGS]; + unsigned int nmask = FS_TRACE_ADD; + int token; + char *s; + int ret = -EINVAL; + + if (!argc) { + fs_remove_all_traces(); + return 0; + } + + s = *(argv); + if (*s == '!') { + /* Clear the trace entry */ + nmask &= ~FS_TRACE_ADD; + ++s; + } + + if (kern_path_mountpoint(AT_FDCWD, s, &path, LOOKUP_FOLLOW)) + return -EINVAL; + + if (!(--argc)) { + if (!(nmask & FS_TRACE_ADD)) + ret = fs_remove_trace_entry(path.mnt->mnt_sb); + goto leave; + } + +repeat: + args[0].to = args[0].from = NULL; + token = match_token(*(++argv), fs_etypes, args); + if (!token && !nmask) + goto leave; + + nmask |= token & FS_EVENTS_ALL; + --argc; + if ((token & FS_EVENT_THRESH) && (nmask & FS_TRACE_ADD)) { + /* + * Get the threshold config data: + * lower range + * upper range + */ + if (!argc) + goto leave; + + ret = kstrtoull(*(++argv), 10, &thresh.lrange); + if (ret) + goto leave; + thresh.state |= THRESH_LR_ON; + if ((--argc)) { + ret = kstrtoull(*(++argv), 10, &thresh.urange); + if (ret) + goto leave; + thresh.state |= THRESH_UR_ON; + --argc; + } + /* The thresholds are based on number of available blocks */ + if (thresh.lrange < thresh.urange) { + ret = -EINVAL; + goto leave; + } + } + if (argc) + goto repeat; + + ret = fs_update_trace_entry(&path, &thresh, nmask); +leave: + path_put(&path); + return ret; +} + +#define DEFAULT_BUF_SIZE PAGE_SIZE + +static ssize_t fs_trace_write(struct file *file, const char __user *buffer, + size_t count, loff_t *ppos) +{ + char **argv; + char *kern_buf, *next, *cfg; + size_t size, dcount = 0; + int argc; + + if (!count) + return 0; + + kern_buf = kmalloc(DEFAULT_BUF_SIZE, GFP_KERNEL); + if (!kern_buf) + return -ENOMEM; + + while (dcount < count) { + + size = count - dcount; + if (size >= DEFAULT_BUF_SIZE) + size = DEFAULT_BUF_SIZE - 1; + if (copy_from_user(kern_buf, buffer + dcount, size)) { + dcount = -EINVAL; + goto leave; + } + + kern_buf[size] = '\0'; + + next = cfg = kern_buf; + + do { + next = strchr(cfg, ';'); + if (next) + *next = '\0'; + + argv = argv_split(GFP_KERNEL, cfg, &argc); + if (!argv) { + dcount = -ENOMEM; + goto leave; + } + + if (fs_parse_trace_request(argc, argv)) { + dcount = -EINVAL; + argv_free(argv); + goto leave; + } + + argv_free(argv); + if (next) + cfg = ++next; + + } while (next); + dcount += size; + } +leave: + kfree(kern_buf); + return dcount; +} + +static void *fs_trace_seq_start(struct seq_file *m, loff_t *pos) +{ + mutex_lock(&fs_trace_lock); + return seq_list_start(&fs_trace_list, *pos); +} + +static void *fs_trace_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + return seq_list_next(v, &fs_trace_list, pos); +} + +static void fs_trace_seq_stop(struct seq_file *m, void *v) +{ + mutex_unlock(&fs_trace_lock); +} + +static int fs_trace_seq_show(struct seq_file *m, void *v) +{ + struct fs_trace_entry *en; + struct super_block *sb; + struct mount *r_mnt; + const struct match_token *match; + unsigned int nmask; + + en = list_entry(v, struct fs_trace_entry, node); + /* Do not show the entries outside current mount namespace */ + r_mnt = real_mount(en->mnt_path.mnt); + if (r_mnt->mnt_ns != current->nsproxy->mnt_ns) { + if (!__is_local_mountpoint(r_mnt->mnt_mountpoint)) + return 0; + } + + sb = en->sb; + + seq_path(m, &en->mnt_path, "\t\n\\"); + seq_putc(m, ' '); + + seq_escape(m, sb->s_type->name, " \t\n\\"); + if (sb->s_subtype && sb->s_subtype[0]) { + seq_putc(m, '.'); + seq_escape(m, sb->s_subtype, " \t\n\\"); + } + + seq_putc(m, ' '); + if (sb->s_op->show_devname) { + sb->s_op->show_devname(m, en->mnt_path.mnt->mnt_root); + } else { + seq_escape(m, r_mnt->mnt_devname ? r_mnt->mnt_devname : "none", + " \t\n\\"); + } + seq_puts(m, " ("); + + nmask = en->notify; + for (match = fs_etypes; match->pattern; ++match) { + if (match->token & nmask) { + seq_puts(m, match->pattern); + nmask &= ~match->token; + if (nmask) + seq_putc(m, ','); + } + } + seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); + seq_puts(m, ")\n"); + return 0; +} + +static const struct seq_operations fs_trace_seq_ops = { + .start = fs_trace_seq_start, + .next = fs_trace_seq_next, + .stop = fs_trace_seq_stop, + .show = fs_trace_seq_show, +}; + +static int fs_trace_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &fs_trace_seq_ops); +} + +static const struct file_operations fs_trace_fops = { + .owner = THIS_MODULE, + .open = fs_trace_open, + .write = fs_trace_write, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +static int fs_trace_init(void) +{ + fs_trace_cachep = KMEM_CACHE(fs_trace_entry, 0); + if (!fs_trace_cachep) + return -EINVAL; + init_waitqueue_head(&trace_wq); + return 0; +} + +/* VFS support */ +static int fs_trace_fill_super(struct super_block *sb, void *data, int silen) +{ + int ret; + static struct tree_descr desc[] = { + [2] = { + .name = "config", + .ops = &fs_trace_fops, + .mode = S_IWUSR | S_IRUGO, + }, + {""}, + }; + + ret = simple_fill_super(sb, 0x7246332, desc); + return !ret ? fs_trace_init() : ret; +} + +static struct dentry *fs_trace_do_mount(struct file_system_type *fs_type, + int ntype, const char *dev_name, void *data) +{ + return mount_single(fs_type, ntype, data, fs_trace_fill_super); +} + +static void fs_trace_kill_super(struct super_block *sb) +{ + /* + * The rcu_barrier here will/should make sure all call_rcu + * callbacks are completed - still there might be some active + * trace objects in use which can make calling the + * kmem_cache_destroy unsafe. So we wait until all traces + * are finally released. + */ + fs_remove_all_traces(); + rcu_barrier(); + wait_event(trace_wq, !atomic_read(&stray_traces)); + + kmem_cache_destroy(fs_trace_cachep); + kill_litter_super(sb); +} + +static struct kset *fs_trace_kset; + +static struct file_system_type fs_trace_fstype = { + .name = "fstrace", + .mount = fs_trace_do_mount, + .kill_sb = fs_trace_kill_super, +}; + +static void __init fs_trace_vfs_init(void) +{ + fs_trace_kset = kset_create_and_add("events", NULL, fs_kobj); + + if (!fs_trace_kset) + return; + + if (!register_filesystem(&fs_trace_fstype)) { + if (!fs_event_netlink_register()) + return; + unregister_filesystem(&fs_trace_fstype); + } + kset_unregister(fs_trace_kset); +} + +static int __init fs_trace_evens_init(void) +{ + fs_trace_vfs_init(); + return 0; +}; +module_init(fs_trace_evens_init); + diff --git a/fs/events/fs_event.h b/fs/events/fs_event.h new file mode 100644 index 0000000..23f24c8 --- /dev/null +++ b/fs/events/fs_event.h @@ -0,0 +1,22 @@ +/* + * Copyright(c) 2015 Samsung Electronics. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#ifndef __GENERIC_FS_EVENTS_H +#define __GENERIC_FS_EVENTS_H + +int fs_event_netlink_register(void); +void fs_event_netlink_unregister(void); + +#endif /* __GENERIC_FS_EVENTS_H */ diff --git a/fs/events/fs_event_netlink.c b/fs/events/fs_event_netlink.c new file mode 100644 index 0000000..0c97eb7 --- /dev/null +++ b/fs/events/fs_event_netlink.c @@ -0,0 +1,104 @@ +/* + * Copyright(c) 2015 Samsung Electronics. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ +#include +#include +#include +#include +#include +#include +#include +#include "fs_event.h" + +static const struct genl_multicast_group fs_event_mcgroups[] = { + { .name = FS_EVENTS_MCAST_GRP_NAME, }, +}; + +static struct genl_family fs_event_family = { + .id = GENL_ID_GENERATE, + .name = FS_EVENTS_FAMILY_NAME, + .version = 1, + .maxattr = FS_NL_A_MAX, + .mcgrps = fs_event_mcgroups, + .n_mcgrps = ARRAY_SIZE(fs_event_mcgroups), +}; + +int fs_netlink_send_event(size_t size, unsigned int event_id, + int (*compose_msg)(struct sk_buff *skb, void *data), + void *cbdata) +{ + static atomic_t seq; + struct sk_buff *skb; + void *msg_head; + int ret = 0; + + if (!size || !compose_msg) + return -EINVAL; + + /* Skip if there are no listeners */ + if (!genl_has_listeners(&fs_event_family, &init_net, 0)) + return 0; + + if (event_id != FS_EVENT_NONE) + size += nla_total_size(sizeof(u32)); + size += nla_total_size(sizeof(u64)); + skb = genlmsg_new(size, GFP_NOWAIT); + + if (!skb) { + pr_debug("Failed to allocate new FS generic netlink message\n"); + return -ENOMEM; + } + + msg_head = genlmsg_put(skb, 0, atomic_add_return(1, &seq), + &fs_event_family, 0, FS_NL_C_EVENT); + if (!msg_head) + goto cleanup; + + if (event_id != FS_EVENT_NONE) + if (nla_put_u32(skb, FS_NL_A_EVENT_ID, event_id)) + goto cancel; + + ret = compose_msg(skb, cbdata); + if (ret) + goto cancel; + + genlmsg_end(skb, msg_head); + ret = genlmsg_multicast(&fs_event_family, skb, 0, 0, GFP_NOWAIT); + if (ret && ret != -ENOBUFS && ret != -ESRCH) + goto cleanup; + + return ret; + +cancel: + genlmsg_cancel(skb, msg_head); +cleanup: + nlmsg_free(skb); + return ret; +} +EXPORT_SYMBOL(fs_netlink_send_event); + +int fs_event_netlink_register(void) +{ + int ret; + + ret = genl_register_family(&fs_event_family); + if (ret) + pr_err("Failed to register FS netlink interface\n"); + return ret; +} + +void fs_event_netlink_unregister(void) +{ + genl_unregister_family(&fs_event_family); +} diff --git a/fs/namespace.c b/fs/namespace.c index 82ef140..ec6e2ef 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1031,6 +1031,7 @@ static void cleanup_mnt(struct mount *mnt) if (unlikely(mnt->mnt_pins.first)) mnt_pin_kill(mnt); fsnotify_vfsmount_delete(&mnt->mnt); + fs_event_mount_dropped(&mnt->mnt); dput(mnt->mnt.mnt_root); deactivate_super(mnt->mnt.mnt_sb); mnt_free_id(mnt); diff --git a/include/linux/fs.h b/include/linux/fs.h index b4d71b5..b7dadd9 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -263,6 +263,10 @@ struct iattr { * Includes for diskquotas. */ #include +/* + * Include for Generic File System Events Interface + */ +#include /* * Maximum number of layers of fs stack. Needs to be limited to @@ -1253,7 +1257,7 @@ struct super_block { struct hlist_node s_instances; unsigned int s_quota_types; /* Bitmask of supported quota types */ struct quota_info s_dquot; /* Diskquota specific options */ - + struct fs_trace_info s_etrace; struct sb_writers s_writers; char s_id[32]; /* Informational name */ diff --git a/include/linux/fs_event.h b/include/linux/fs_event.h new file mode 100644 index 0000000..83e22dd --- /dev/null +++ b/include/linux/fs_event.h @@ -0,0 +1,72 @@ +/* + * Generic File System Events Interface + * + * Copyright(c) 2015 Samsung Electronics. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ +#ifndef _LINUX_GENERIC_FS_EVETS_ +#define _LINUX_GENERIC_FS_EVETS_ +#include +#include + +/* + * Currently supported event types + */ +#define FS_EVENT_GENERIC 0x001 +#define FS_EVENT_THRESH 0x002 + +#define FS_EVENTS_ALL (FS_EVENT_GENERIC | FS_EVENT_THRESH) + +struct fs_trace_operations { + void (*query)(struct super_block *, u64 *); +}; + +struct fs_trace_info { + void __rcu *e_priv; /* READ ONLY */ + unsigned int events_cap_mask; /* Supported notifications */ + const struct fs_trace_operations *ops; +}; + +#ifdef CONFIG_FS_EVENTS + +void fs_event_notify(struct super_block *sb, unsigned int event_id); +void fs_event_alloc_space(struct super_block *sb, u64 ncount); +void fs_event_free_space(struct super_block *sb, u64 ncount); +void fs_event_mount_dropped(struct vfsmount *mnt); + +int fs_netlink_send_event(size_t size, unsigned int event_id, + int (*compose_msg)(struct sk_buff *skb, void *data), + void *cbdata); + +#else /* CONFIG_FS_EVENTS */ + +static inline +void fs_event_notify(struct super_block *sb, unsigned int event_id) {}; +static inline +void fs_event_alloc_space(struct super_block *sb, u64 ncount) {}; +static inline +void fs_event_free_space(struct super_block *sb, u64 ncount) {}; +static inline +void fs_event_mount_dropped(struct vfsmount *mnt) {}; + +static inline +int fs_netlink_send_event(size_t size, unsigned int event_id, + int (*compose_msig)(struct sk_buff *skb, void *data), + void *cbdata) +{ + return -ENOSYS; +} +#endif /* CONFIG_FS_EVENTS */ + +#endif /* _LINUX_GENERIC_FS_EVENTS_ */ + diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index 68ceb97..dae0fab 100644 --- a/include/uapi/linux/Kbuild +++ b/include/uapi/linux/Kbuild @@ -129,6 +129,7 @@ header-y += firewire-constants.h header-y += flat.h header-y += fou.h header-y += fs.h +header-y += fs_event.h header-y += fsl_hypervisor.h header-y += fuse.h header-y += futex.h diff --git a/include/uapi/linux/fs_event.h b/include/uapi/linux/fs_event.h new file mode 100644 index 0000000..d8b07da --- /dev/null +++ b/include/uapi/linux/fs_event.h @@ -0,0 +1,58 @@ +/* + * Generic netlink support for Generic File System Events Interface + * + * Copyright(c) 2015 Samsung Electronics. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ +#ifndef _UAPI_LINUX_GENERIC_FS_EVENTS_ +#define _UAPI_LINUX_GENERIC_FS_EVENTS_ + +#define FS_EVENTS_FAMILY_NAME "fs_event" +#define FS_EVENTS_MCAST_GRP_NAME "fs_event_mc_grp" + +/* + * Generic netlink attribute types + */ +enum { + FS_NL_A_NONE, + FS_NL_A_EVENT_ID, + FS_NL_A_DEV_MAJOR, + FS_NL_A_DEV_MINOR, + FS_NL_A_CAUSED_ID, + FS_NL_A_DATA, + __FS_NL_A_MAX, +}; +#define FS_NL_A_MAX (__FS_NL_A_MAX - 1) +/* + * Generic netlink commands + */ +#define FS_NL_C_EVENT 1 + +/* + * Supported set of FS events + */ +enum { + FS_EVENT_NONE, + FS_WARN_ENOSPC, /* No space left to reserve data blks */ + FS_WARN_ENOSPC_META, /* No space left for metadata */ + FS_THR_LRBELOW, /* The threshold lower range has been reached */ + FS_THR_LRABOVE, /* The threshold lower range re-activcated*/ + FS_THR_URBELOW, + FS_THR_URABOVE, + FS_ERR_REMOUNT_RO, /* The file system has been remounted as RO */ + FS_ERR_CORRUPTED /* Critical error - fs corrupted */ + +}; + +#endif /* _UAPI_LINUX_GENERIC_FS_EVENTS_ */ + -- 1.7.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: [RFC v3 4/4] shmem: Add support for generic FS events Date: Tue, 16 Jun 2015 15:09:33 +0200 Message-ID: <1434460173-18427-5-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> Return-path: In-reply-to: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Cc: greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org Add support for the generic FS events interface covering threshold notifiactions and the ENOSPC warning. Signed-off-by: Beata Michalska --- mm/shmem.c | 33 ++++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index cf2d0ca..a044d12 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -201,6 +201,7 @@ static int shmem_reserve_inode(struct super_block *sb) spin_lock(&sbinfo->stat_lock); if (!sbinfo->free_inodes) { spin_unlock(&sbinfo->stat_lock); + fs_event_notify(sb, FS_WARN_ENOSPC); return -ENOSPC; } sbinfo->free_inodes--; @@ -239,8 +240,10 @@ static void shmem_recalc_inode(struct inode *inode) freed = info->alloced - info->swapped - inode->i_mapping->nrpages; if (freed > 0) { struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); - if (sbinfo->max_blocks) + if (sbinfo->max_blocks) { percpu_counter_add(&sbinfo->used_blocks, -freed); + fs_event_free_space(inode->i_sb, freed); + } info->alloced -= freed; inode->i_blocks -= freed * BLOCKS_PER_PAGE; shmem_unacct_blocks(info->flags, freed); @@ -1164,6 +1167,7 @@ repeat: goto unacct; } percpu_counter_inc(&sbinfo->used_blocks); + fs_event_alloc_space(inode->i_sb, 1); } page = shmem_alloc_page(gfp, info, index); @@ -1245,8 +1249,10 @@ trunc: spin_unlock(&info->lock); decused: sbinfo = SHMEM_SB(inode->i_sb); - if (sbinfo->max_blocks) + if (sbinfo->max_blocks) { percpu_counter_add(&sbinfo->used_blocks, -1); + fs_event_free_space(inode->i_sb, 1); + } unacct: shmem_unacct_blocks(info->flags, 1); failed: @@ -1258,12 +1264,16 @@ unlock: unlock_page(page); page_cache_release(page); } - if (error == -ENOSPC && !once++) { + if (error == -ENOSPC) { + if (!once++) { info = SHMEM_I(inode); spin_lock(&info->lock); shmem_recalc_inode(inode); spin_unlock(&info->lock); goto repeat; + } else { + fs_event_notify(inode->i_sb, FS_WARN_ENOSPC); + } } if (error == -EEXIST) /* from above or from radix_tree_insert */ goto repeat; @@ -2729,12 +2739,26 @@ static int shmem_encode_fh(struct inode *inode, __u32 *fh, int *len, return 1; } +static void shmem_trace_query(struct super_block *sb, u64 *ncount) +{ + struct shmem_sb_info *sbinfo = SHMEM_SB(sb); + + if (sbinfo->max_blocks) + *ncount = sbinfo->max_blocks - + percpu_counter_sum(&sbinfo->used_blocks); + +} + static const struct export_operations shmem_export_ops = { .get_parent = shmem_get_parent, .encode_fh = shmem_encode_fh, .fh_to_dentry = shmem_fh_to_dentry, }; +static const struct fs_trace_operations shmem_trace_ops = { + .query = shmem_trace_query, +}; + static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, bool remount) { @@ -3020,6 +3044,9 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent) sb->s_flags |= MS_NOUSER; } sb->s_export_op = &shmem_export_ops; + sb->s_etrace.ops = &shmem_trace_ops; + sb->s_etrace.events_cap_mask = FS_EVENTS_ALL; + sb->s_flags |= MS_NOSEC; #else sb->s_flags |= MS_NOUSER; -- 1.7.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: [RFC v3 2/4] ext4: Add helper function to mark group as corrupted Date: Tue, 16 Jun 2015 15:09:31 +0200 Message-ID: <1434460173-18427-3-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> Return-path: In-reply-to: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> Sender: owner-linux-mm@kvack.org To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Cc: greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org Add ext4_mark_group_corrupted helper function to simplify the code and to keep the logic in one place. Signed-off-by: Beata Michalska --- fs/ext4/balloc.c | 15 +++------------ fs/ext4/ext4.h | 9 +++++++++ fs/ext4/ialloc.c | 5 +---- fs/ext4/mballoc.c | 11 ++--------- 4 files changed, 15 insertions(+), 25 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index 83a6f49..e95b27a 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -193,10 +193,7 @@ static int ext4_init_block_bitmap(struct super_block *sb, * essentially implementing a per-group read-only flag. */ if (!ext4_group_desc_csum_verify(sb, block_group, gdp)) { grp = ext4_get_group_info(sb, block_group); - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) - percpu_counter_sub(&sbi->s_freeclusters_counter, - grp->bb_free); - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); + ext4_mark_group_corrupted(sbi, grp); if (!EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) { int count; count = ext4_free_inodes_count(sb, gdp); @@ -379,20 +376,14 @@ static void ext4_validate_block_bitmap(struct super_block *sb, ext4_unlock_group(sb, block_group); ext4_error(sb, "bg %u: block %llu: invalid block bitmap", block_group, blk); - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) - percpu_counter_sub(&sbi->s_freeclusters_counter, - grp->bb_free); - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); + ext4_mark_group_corrupted(sbi, grp); return; } if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group, desc, bh))) { ext4_unlock_group(sb, block_group); ext4_error(sb, "bg %u: bad block bitmap checksum", block_group); - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) - percpu_counter_sub(&sbi->s_freeclusters_counter, - grp->bb_free); - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); + ext4_mark_group_corrupted(sbi, grp); return; } set_buffer_verified(bh); diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index f63c3d5..163afe2 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2535,6 +2535,15 @@ static inline spinlock_t *ext4_group_lock_ptr(struct super_block *sb, return bgl_lock_ptr(EXT4_SB(sb)->s_blockgroup_lock, group); } +static inline +void ext4_mark_group_corrupted(struct ext4_sb_info *sbi, + struct ext4_group_info *grp) +{ + if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) + percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free); + set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); +} + /* * Returns true if the filesystem is busy enough that attempts to * access the block group locks has run into contention. diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index ac644c3..ebe0499 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -79,10 +79,7 @@ static unsigned ext4_init_inode_bitmap(struct super_block *sb, if (!ext4_group_desc_csum_verify(sb, block_group, gdp)) { ext4_error(sb, "Checksum bad for group %u", block_group); grp = ext4_get_group_info(sb, block_group); - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) - percpu_counter_sub(&sbi->s_freeclusters_counter, - grp->bb_free); - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); + ext4_mark_group_corrupted(sbi, grp); if (!EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) { int count; count = ext4_free_inodes_count(sb, gdp); diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 8d1e602..24a4b6d 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -760,10 +760,7 @@ void ext4_mb_generate_buddy(struct super_block *sb, * corrupt and update bb_free using bitmap value */ grp->bb_free = free; - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) - percpu_counter_sub(&sbi->s_freeclusters_counter, - grp->bb_free); - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); + ext4_mark_group_corrupted(sbi, grp); } mb_set_largest_free_order(sb, grp); @@ -1448,12 +1445,8 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b, "freeing already freed block " "(bit %u); block bitmap corrupt.", block); - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)) - percpu_counter_sub(&sbi->s_freeclusters_counter, - e4b->bd_info->bb_free); /* Mark the block group as corrupt. */ - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, - &e4b->bd_info->bb_state); + ext4_mark_group_corrupted(sbi, e4b->bd_info); mb_regenerate_buddy(e4b); goto done; } -- 1.7.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: [RFC v3 3/4] ext4: Add support for generic FS events Date: Tue, 16 Jun 2015 15:09:32 +0200 Message-ID: <1434460173-18427-4-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> Return-path: In-reply-to: <1434460173-18427-1-git-send-email-b.michalska-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: greg-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org, jack-AlSwsSmVLrQ@public.gmane.org, tytso-3s7WtUTddSA@public.gmane.org, adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org, hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, kyungmin.park-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org, kmpark-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org List-Id: linux-api@vger.kernel.org Add support for generic FS events including threshold notifications, ENOSPC and remount as read-only warnings, along with generic internal warnings/errors. Signed-off-by: Beata Michalska --- fs/ext4/balloc.c | 10 ++++++++-- fs/ext4/ext4.h | 1 + fs/ext4/inode.c | 2 +- fs/ext4/mballoc.c | 6 +++++- fs/ext4/resize.c | 1 + fs/ext4/super.c | 39 +++++++++++++++++++++++++++++++++++++++ 6 files changed, 55 insertions(+), 4 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index e95b27a..a48450f 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -569,6 +569,7 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi, { if (ext4_has_free_clusters(sbi, nclusters, flags)) { percpu_counter_add(&sbi->s_dirtyclusters_counter, nclusters); + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, nclusters)); return 0; } else return -ENOSPC; @@ -590,9 +591,10 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries) { if (!ext4_has_free_clusters(EXT4_SB(sb), 1, 0) || (*retries)++ > 3 || - !EXT4_SB(sb)->s_journal) + !EXT4_SB(sb)->s_journal) { + fs_event_notify(sb, FS_WARN_ENOSPC); return 0; - + } jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id); return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal); @@ -637,6 +639,10 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode, dquot_alloc_block_nofail(inode, EXT4_C2B(EXT4_SB(inode->i_sb), ar.len)); } + + if (*errp == -ENOSPC) + fs_event_notify(inode->i_sb, FS_WARN_ENOSPC_META); + return ret; } diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 163afe2..7d75ff9 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2542,6 +2542,7 @@ void ext4_mark_group_corrupted(struct ext4_sb_info *sbi, if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free); set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, grp->bb_free)); } /* diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5cb9a21..2a7af0f 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1238,7 +1238,7 @@ static void ext4_da_release_space(struct inode *inode, int to_free) percpu_counter_sub(&sbi->s_dirtyclusters_counter, to_free); spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); - + fs_event_free_space(sbi->s_sb, to_free); dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free)); } diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 24a4b6d..c2df6f0 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4511,6 +4511,9 @@ out: kmem_cache_free(ext4_ac_cachep, ac); if (inquota && ar->len < inquota) dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len)); + if (reserv_clstrs && ar->len < reserv_clstrs) + fs_event_free_space(sbi->s_sb, + EXT4_C2B(sbi, reserv_clstrs - ar->len)); if (!ar->len) { if ((ar->flags & EXT4_MB_DELALLOC_RESERVED) == 0) /* release all the reserved blocks if non delalloc */ @@ -4848,7 +4851,7 @@ do_more: if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) dquot_free_block(inode, EXT4_C2B(sbi, count_clusters)); percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters); - + fs_event_free_space(sb, EXT4_C2B(sbi, count_clusters)); ext4_mb_unload_buddy(&e4b); /* We dirtied the bitmap block */ @@ -4982,6 +4985,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb, ext4_unlock_group(sb, block_group); percpu_counter_add(&sbi->s_freeclusters_counter, EXT4_NUM_B2C(sbi, blocks_freed)); + fs_event_free_space(sb, blocks_freed); if (sbi->s_log_groups_per_flex) { ext4_group_t flex_group = ext4_flex_group(sbi, block_group); diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index 8a8ec62..dbf08d6 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -1378,6 +1378,7 @@ static void ext4_update_super(struct super_block *sb, EXT4_NUM_B2C(sbi, free_blocks)); percpu_counter_add(&sbi->s_freeinodes_counter, EXT4_INODES_PER_GROUP(sb) * flex_gd->count); + fs_event_free_space(sb, free_blocks - reserved_blocks); ext4_debug("free blocks count %llu", percpu_counter_read(&sbi->s_freeclusters_counter)); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index e061e66..108b667 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -585,6 +585,8 @@ void __ext4_abort(struct super_block *sb, const char *function, if (EXT4_SB(sb)->s_journal) jbd2_journal_abort(EXT4_SB(sb)->s_journal, -EIO); save_error_info(sb, function, line); + fs_event_notify(sb, FS_ERR_REMOUNT_RO); + } if (test_opt(sb, ERRORS_PANIC)) panic("EXT4-fs panic from previous error\n"); @@ -1083,6 +1085,12 @@ static const struct quotactl_ops ext4_qctl_operations = { }; #endif +static void ext4_trace_query(struct super_block *sb, u64 *ncount); + +static const struct fs_trace_operations ext4_trace_ops = { + .query = ext4_trace_query, +}; + static const struct super_operations ext4_sops = { .alloc_inode = ext4_alloc_inode, .destroy_inode = ext4_destroy_inode, @@ -3398,11 +3406,20 @@ static int ext4_reserve_clusters(struct ext4_sb_info *sbi, ext4_fsblk_t count) { ext4_fsblk_t clusters = ext4_blocks_count(sbi->s_es) >> sbi->s_cluster_bits; + ext4_fsblk_t current_resv; if (count >= clusters) return -EINVAL; + current_resv = atomic64_read(&sbi->s_resv_clusters); atomic64_set(&sbi->s_resv_clusters, count); + + if (count > current_resv) + fs_event_alloc_space(sbi->s_sb, + EXT4_C2B(sbi, count - current_resv)); + else + fs_event_free_space(sbi->s_sb, + EXT4_C2B(sbi, current_resv - count)); return 0; } @@ -3966,6 +3983,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) sb->s_qcop = &ext4_qctl_operations; sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP; #endif + sb->s_etrace.ops = &ext4_trace_ops; + sb->s_etrace.events_cap_mask = FS_EVENTS_ALL; + memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid)); INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */ @@ -5438,6 +5458,25 @@ out: #endif +static void ext4_trace_query(struct super_block *sb, u64 *ncount) +{ + struct ext4_sb_info *sbi = EXT4_SB(sb); + struct ext4_super_block *es = sbi->s_es; + ext4_fsblk_t rsv_blocks; + ext4_fsblk_t nblocks; + + nblocks = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) - + percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter); + nblocks = EXT4_C2B(sbi, nblocks); + rsv_blocks = ext4_r_blocks_count(es) + + EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters)); + if (nblocks < rsv_blocks) + nblocks = 0; + else + nblocks -= rsv_blocks; + *ncount = nblocks; +} + static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { -- 1.7.9.5 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Al Viro Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Tue, 16 Jun 2015 17:21:47 +0100 Message-ID: <20150616162147.GA17109@ZenIV.linux.org.uk> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <1434460173-18427-2-git-send-email-b.michalska-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Beata Michalska Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, greg-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org, jack-AlSwsSmVLrQ@public.gmane.org, tytso-3s7WtUTddSA@public.gmane.org, adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org, hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, kyungmin.park-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org, kmpark-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org List-Id: linux-api@vger.kernel.org On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: > Introduce configurable generic interface for file > system-wide event notifications, to provide file > systems with a common way of reporting any potential > issues as they emerge. > > The notifications are to be issued through generic > netlink interface by newly introduced multicast group. > > Threshold notifications have been included, allowing > triggering an event whenever the amount of free space drops > below a certain level - or levels to be more precise as two > of them are being supported: the lower and the upper range. > The notifications work both ways: once the threshold level > has been reached, an event shall be generated whenever > the number of available blocks goes up again re-activating > the threshold. > > The interface has been exposed through a vfs. Once mounted, > it serves as an entry point for the set-up where one can > register for particular file system events. Hmm... 1) what happens if two processes write to that file at the same time, trying to create an entry for the same fs? WARN_ON() and fail for one of them if they race? 2) what happens if fs is mounted more than once (e.g. in different namespaces, or bound at different mountpoints, or just plain mounted several times in different places) and we add an event for each? More specifically, what should happen when one of those gets unmounted? 3) what's the meaning of ->active? Is that "fs_drop_trace_entry() hadn't been called yet" flag? Unless I'm misreading it, we can very well get explicit removal race with umount, resulting in cleanup_mnt() returning from fs_event_mount_dropped() before the first process (i.e. write asking to remove that entry) gets around to its deactivate_super(), ending up with umount(2) on a filesystem that isn't mounted anywhere else reporting success to userland before the actual fs shutdown, which is not a nice thing to do... 4) test in fs_event_mount_dropped() looks very odd - by that point we are absolutely guaranteed to have ->mnt_ns == NULL. What's that supposed to do? Al, trying to figure out the lifetime rules in all of that... From mboxrd@z Thu Jan 1 00:00:00 1970 From: Leon Romanovsky Subject: Re: [RFC v3 4/4] shmem: Add support for generic FS events Date: Wed, 17 Jun 2015 09:08:41 +0300 Message-ID: References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-5-git-send-email-b.michalska@samsung.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Return-path: In-Reply-To: <1434460173-18427-5-git-send-email-b.michalska@samsung.com> Sender: linux-kernel-owner@vger.kernel.org To: Beata Michalska Cc: "linux-kernel@vger.kernel.org" , Linux-FSDevel , linux-api@vger.kernel.org, Greg Kroah , jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, Hugh Dickins , lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, Linux-MM , kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org > } > - if (error == -ENOSPC && !once++) { > + if (error == -ENOSPC) { > + if (!once++) { > info = SHMEM_I(inode); > spin_lock(&info->lock); > shmem_recalc_inode(inode); > spin_unlock(&info->lock); > goto repeat; > + } else { > + fs_event_notify(inode->i_sb, FS_WARN_ENOSPC); > + } > } Very minor remark, please fix indentation. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Leon Romanovsky Subject: Re: [RFC v3 3/4] ext4: Add support for generic FS events Date: Wed, 17 Jun 2015 09:15:24 +0300 Message-ID: References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-4-git-send-email-b.michalska@samsung.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Return-path: In-Reply-To: <1434460173-18427-4-git-send-email-b.michalska@samsung.com> Sender: owner-linux-mm@kvack.org To: Beata Michalska Cc: "linux-kernel@vger.kernel.org" , Linux-FSDevel , linux-api , Greg Kroah , jack , tytso , "adilger.kernel" , Hugh Dickins , lczerner , hch , linux-ext4 , Linux-MM , "kyungmin.park" , kmpark List-Id: linux-api@vger.kernel.org On Tue, Jun 16, 2015 at 4:09 PM, Beata Michalska wrote: > Add support for generic FS events including threshold > notifications, ENOSPC and remount as read-only warnings, > along with generic internal warnings/errors. > > Signed-off-by: Beata Michalska > --- > fs/ext4/balloc.c | 10 ++++++++-- > fs/ext4/ext4.h | 1 + > fs/ext4/inode.c | 2 +- > fs/ext4/mballoc.c | 6 +++++- > fs/ext4/resize.c | 1 + > fs/ext4/super.c | 39 +++++++++++++++++++++++++++++++++++++++ > 6 files changed, 55 insertions(+), 4 deletions(-) > > diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c > index e95b27a..a48450f 100644 > --- a/fs/ext4/balloc.c > +++ b/fs/ext4/balloc.c > @@ -569,6 +569,7 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi, > { > if (ext4_has_free_clusters(sbi, nclusters, flags)) { > percpu_counter_add(&sbi->s_dirtyclusters_counter, nclusters); > + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, nclusters)); > return 0; > } else > return -ENOSPC; Do you need to add "fs_event_notify(sb, FS_WARN_ENOSPC);" here too? > @@ -590,9 +591,10 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries) > { > if (!ext4_has_free_clusters(EXT4_SB(sb), 1, 0) || > (*retries)++ > 3 || > - !EXT4_SB(sb)->s_journal) > + !EXT4_SB(sb)->s_journal) { > + fs_event_notify(sb, FS_WARN_ENOSPC); > return 0; > - > + } > jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id); > > return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal); > @@ -637,6 +639,10 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode, > dquot_alloc_block_nofail(inode, > EXT4_C2B(EXT4_SB(inode->i_sb), ar.len)); > } > + > + if (*errp == -ENOSPC) > + fs_event_notify(inode->i_sb, FS_WARN_ENOSPC_META); > + > return ret; > } > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 163afe2..7d75ff9 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -2542,6 +2542,7 @@ void ext4_mark_group_corrupted(struct ext4_sb_info *sbi, > if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free); > set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, grp->bb_free)); > } > > /* > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 5cb9a21..2a7af0f 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -1238,7 +1238,7 @@ static void ext4_da_release_space(struct inode *inode, int to_free) > percpu_counter_sub(&sbi->s_dirtyclusters_counter, to_free); > > spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); > - > + fs_event_free_space(sbi->s_sb, to_free); > dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free)); > } > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 24a4b6d..c2df6f0 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -4511,6 +4511,9 @@ out: > kmem_cache_free(ext4_ac_cachep, ac); > if (inquota && ar->len < inquota) > dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len)); > + if (reserv_clstrs && ar->len < reserv_clstrs) > + fs_event_free_space(sbi->s_sb, > + EXT4_C2B(sbi, reserv_clstrs - ar->len)); > if (!ar->len) { > if ((ar->flags & EXT4_MB_DELALLOC_RESERVED) == 0) > /* release all the reserved blocks if non delalloc */ > @@ -4848,7 +4851,7 @@ do_more: > if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) > dquot_free_block(inode, EXT4_C2B(sbi, count_clusters)); > percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters); > - > + fs_event_free_space(sb, EXT4_C2B(sbi, count_clusters)); > ext4_mb_unload_buddy(&e4b); > > /* We dirtied the bitmap block */ > @@ -4982,6 +4985,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb, > ext4_unlock_group(sb, block_group); > percpu_counter_add(&sbi->s_freeclusters_counter, > EXT4_NUM_B2C(sbi, blocks_freed)); > + fs_event_free_space(sb, blocks_freed); > > if (sbi->s_log_groups_per_flex) { > ext4_group_t flex_group = ext4_flex_group(sbi, block_group); > diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c > index 8a8ec62..dbf08d6 100644 > --- a/fs/ext4/resize.c > +++ b/fs/ext4/resize.c > @@ -1378,6 +1378,7 @@ static void ext4_update_super(struct super_block *sb, > EXT4_NUM_B2C(sbi, free_blocks)); > percpu_counter_add(&sbi->s_freeinodes_counter, > EXT4_INODES_PER_GROUP(sb) * flex_gd->count); > + fs_event_free_space(sb, free_blocks - reserved_blocks); > > ext4_debug("free blocks count %llu", > percpu_counter_read(&sbi->s_freeclusters_counter)); > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index e061e66..108b667 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -585,6 +585,8 @@ void __ext4_abort(struct super_block *sb, const char *function, > if (EXT4_SB(sb)->s_journal) > jbd2_journal_abort(EXT4_SB(sb)->s_journal, -EIO); > save_error_info(sb, function, line); > + fs_event_notify(sb, FS_ERR_REMOUNT_RO); > + > } > if (test_opt(sb, ERRORS_PANIC)) > panic("EXT4-fs panic from previous error\n"); > @@ -1083,6 +1085,12 @@ static const struct quotactl_ops ext4_qctl_operations = { > }; > #endif > > +static void ext4_trace_query(struct super_block *sb, u64 *ncount); > + > +static const struct fs_trace_operations ext4_trace_ops = { > + .query = ext4_trace_query, > +}; > + > static const struct super_operations ext4_sops = { > .alloc_inode = ext4_alloc_inode, > .destroy_inode = ext4_destroy_inode, > @@ -3398,11 +3406,20 @@ static int ext4_reserve_clusters(struct ext4_sb_info *sbi, ext4_fsblk_t count) > { > ext4_fsblk_t clusters = ext4_blocks_count(sbi->s_es) >> > sbi->s_cluster_bits; > + ext4_fsblk_t current_resv; > > if (count >= clusters) > return -EINVAL; > > + current_resv = atomic64_read(&sbi->s_resv_clusters); > atomic64_set(&sbi->s_resv_clusters, count); > + > + if (count > current_resv) > + fs_event_alloc_space(sbi->s_sb, > + EXT4_C2B(sbi, count - current_resv)); > + else > + fs_event_free_space(sbi->s_sb, > + EXT4_C2B(sbi, current_resv - count)); > return 0; > } > > @@ -3966,6 +3983,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) > sb->s_qcop = &ext4_qctl_operations; > sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP; > #endif > + sb->s_etrace.ops = &ext4_trace_ops; > + sb->s_etrace.events_cap_mask = FS_EVENTS_ALL; > + > memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid)); > > INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */ > @@ -5438,6 +5458,25 @@ out: > > #endif > > +static void ext4_trace_query(struct super_block *sb, u64 *ncount) > +{ > + struct ext4_sb_info *sbi = EXT4_SB(sb); > + struct ext4_super_block *es = sbi->s_es; > + ext4_fsblk_t rsv_blocks; > + ext4_fsblk_t nblocks; > + > + nblocks = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) - > + percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter); > + nblocks = EXT4_C2B(sbi, nblocks); > + rsv_blocks = ext4_r_blocks_count(es) + > + EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters)); > + if (nblocks < rsv_blocks) > + nblocks = 0; > + else > + nblocks -= rsv_blocks; > + *ncount = nblocks; > +} > + > static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags, > const char *dev_name, void *data) > { > -- > 1.7.9.5 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Leon Romanovsky | Independent Linux Consultant www.leon.nu | leon@leon.nu -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Wed, 17 Jun 2015 11:22:49 +0200 Message-ID: <55813C69.2040401@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150616162147.GA17109@ZenIV.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Return-path: In-reply-to: <20150616162147.GA17109@ZenIV.linux.org.uk> Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org Hi, On 06/16/2015 06:21 PM, Al Viro wrote: > On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: >> Introduce configurable generic interface for file >> system-wide event notifications, to provide file >> systems with a common way of reporting any potential >> issues as they emerge. >> >> The notifications are to be issued through generic >> netlink interface by newly introduced multicast group. >> >> Threshold notifications have been included, allowing >> triggering an event whenever the amount of free space drops >> below a certain level - or levels to be more precise as two >> of them are being supported: the lower and the upper range. >> The notifications work both ways: once the threshold level >> has been reached, an event shall be generated whenever >> the number of available blocks goes up again re-activating >> the threshold. >> >> The interface has been exposed through a vfs. Once mounted, >> it serves as an entry point for the set-up where one can >> register for particular file system events. > > Hmm... > > 1) what happens if two processes write to that file at the same time, > trying to create an entry for the same fs? WARN_ON() and fail for one > of them if they race? > There are some limits here - I admit. The entries in the config file might be overwritten at any time - there is no support for multiple config entries for the same mounted fs. This is mainly due to the threshold notifications: handling potentially numerous threshold limits each time the number of available blocks changes didn't seem like a good idea. So this is more like a global config, resembling sysfs fs-related tune options. > 2) what happens if fs is mounted more than once (e.g. in different > namespaces, or bound at different mountpoints, or just plain mounted > several times in different places) and we add an event for each? > More specifically, what should happen when one of those gets unmounted? > Each write to that file is being handled within the current namespace. Setting up an entry for a mount point from a different mnt namespace needs switching to that ns. As for bound mounts: the entry exists until the mount point it has been registered with is detached. The events can only be registered for one of the mount points, as they are tied with the super block - so one cannot have a separate config entry for each bound mounts. > 3) what's the meaning of ->active? Is that "fs_drop_trace_entry() hadn't > been called yet" flag? Unless I'm misreading it, we can very well get > explicit removal race with umount, resulting in cleanup_mnt() returning > from fs_event_mount_dropped() before the first process (i.e. write > asking to remove that entry) gets around to its deactivate_super(), > ending up with umount(2) on a filesystem that isn't mounted anywhere > else reporting success to userland before the actual fs shutdown, which > is not a nice thing to do... > The 'active' means simply that the entry for a given mounted fs is still valid in a way that the events are still required: the entry in the config file has not been removed. When the trace is being removed - it's 'active' filed gets invalidated to mark that the events for related fs are no longer needed. deactivate_super() should get called only once, dropping the reference acquired while creating the entry (fs_new_trace_entry). While in fs_drop_trace_entry, lock is being held (in both cases: unmount and explicit entry removal). The fs_drop_trace_entry will silently skip all the clean-up if the entry is inactive. I might be missing smth here - though. If so,I would really appreciate some more of your comments. > 4) test in fs_event_mount_dropped() looks very odd - by that point we > are absolutely guaranteed to have ->mnt_ns == NULL. What's that supposed > to do? > I have totally missed the fact that the mnt namespace pointer is invalidated during unmount_tree - cannot really explain why that did happen. So thank You for pointing that out. This should be simply checking if it's still valid. This verification is needed in case the mount that is being detached is not the one the events have been registered with as they refer to fs not a particular mount point. This is the case with the mnt namespaces: let's assume one registers for events for particular mounted fs in an init mnt namespace, then the new mnt namespace is being created with shared moutn points being cloned: so the same mount point exists in both namespaces. Now if this mnt point gets detached: either through umount or during the mnt namespace being swept out - the entry in the init mnt namespace should remain untouched - same applies the other way round. > > Al, trying to figure out the lifetime rules in all of that... > Best Regards Beata -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: Re: [RFC v3 4/4] shmem: Add support for generic FS events Date: Wed, 17 Jun 2015 11:23:40 +0200 Message-ID: <55813C9C.1010608@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-5-git-send-email-b.michalska@samsung.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Return-path: In-reply-to: Sender: linux-kernel-owner@vger.kernel.org To: Leon Romanovsky Cc: "linux-kernel@vger.kernel.org" , Linux-FSDevel , linux-api@vger.kernel.org, Greg Kroah , jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, Hugh Dickins , lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, Linux-MM , kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org On 06/17/2015 08:08 AM, Leon Romanovsky wrote: >> } >> - if (error == -ENOSPC && !once++) { >> + if (error == -ENOSPC) { >> + if (!once++) { >> info = SHMEM_I(inode); >> spin_lock(&info->lock); >> shmem_recalc_inode(inode); >> spin_unlock(&info->lock); >> goto repeat; >> + } else { >> + fs_event_notify(inode->i_sb, FS_WARN_ENOSPC); >> + } >> } > > Very minor remark, please fix indentation. > I will, thank You. BR Beata From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: Re: [RFC v3 3/4] ext4: Add support for generic FS events Date: Wed, 17 Jun 2015 11:25:04 +0200 Message-ID: <55813CF0.6010602@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-4-git-send-email-b.michalska@samsung.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Return-path: In-reply-to: Sender: owner-linux-mm@kvack.org To: Leon Romanovsky Cc: "linux-kernel@vger.kernel.org" , Linux-FSDevel , linux-api , Greg Kroah , jack , tytso , "adilger.kernel" , Hugh Dickins , lczerner , hch , linux-ext4 , Linux-MM , "kyungmin.park" , kmpark List-Id: linux-api@vger.kernel.org On 06/17/2015 08:15 AM, Leon Romanovsky wrote: > On Tue, Jun 16, 2015 at 4:09 PM, Beata Michalska > wrote: >> Add support for generic FS events including threshold >> notifications, ENOSPC and remount as read-only warnings, >> along with generic internal warnings/errors. >> >> Signed-off-by: Beata Michalska >> --- >> fs/ext4/balloc.c | 10 ++++++++-- >> fs/ext4/ext4.h | 1 + >> fs/ext4/inode.c | 2 +- >> fs/ext4/mballoc.c | 6 +++++- >> fs/ext4/resize.c | 1 + >> fs/ext4/super.c | 39 +++++++++++++++++++++++++++++++++++++++ >> 6 files changed, 55 insertions(+), 4 deletions(-) >> >> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c >> index e95b27a..a48450f 100644 >> --- a/fs/ext4/balloc.c >> +++ b/fs/ext4/balloc.c >> @@ -569,6 +569,7 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi, >> { >> if (ext4_has_free_clusters(sbi, nclusters, flags)) { >> percpu_counter_add(&sbi->s_dirtyclusters_counter, nclusters); >> + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, nclusters)); >> return 0; >> } else >> return -ENOSPC; > Do you need to add "fs_event_notify(sb, FS_WARN_ENOSPC);" here too? Yeap, I've missed that one. Thank You. BR Beata > >> @@ -590,9 +591,10 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries) >> { >> if (!ext4_has_free_clusters(EXT4_SB(sb), 1, 0) || >> (*retries)++ > 3 || >> - !EXT4_SB(sb)->s_journal) >> + !EXT4_SB(sb)->s_journal) { >> + fs_event_notify(sb, FS_WARN_ENOSPC); >> return 0; >> - >> + } >> jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id); >> >> return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal); >> @@ -637,6 +639,10 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode, >> dquot_alloc_block_nofail(inode, >> EXT4_C2B(EXT4_SB(inode->i_sb), ar.len)); >> } >> + >> + if (*errp == -ENOSPC) >> + fs_event_notify(inode->i_sb, FS_WARN_ENOSPC_META); >> + >> return ret; >> } >> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >> index 163afe2..7d75ff9 100644 >> --- a/fs/ext4/ext4.h >> +++ b/fs/ext4/ext4.h >> @@ -2542,6 +2542,7 @@ void ext4_mark_group_corrupted(struct ext4_sb_info *sbi, >> if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) >> percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free); >> set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); >> + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, grp->bb_free)); >> } >> >> /* >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index 5cb9a21..2a7af0f 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -1238,7 +1238,7 @@ static void ext4_da_release_space(struct inode *inode, int to_free) >> percpu_counter_sub(&sbi->s_dirtyclusters_counter, to_free); >> >> spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); >> - >> + fs_event_free_space(sbi->s_sb, to_free); >> dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free)); >> } >> >> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c >> index 24a4b6d..c2df6f0 100644 >> --- a/fs/ext4/mballoc.c >> +++ b/fs/ext4/mballoc.c >> @@ -4511,6 +4511,9 @@ out: >> kmem_cache_free(ext4_ac_cachep, ac); >> if (inquota && ar->len < inquota) >> dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len)); >> + if (reserv_clstrs && ar->len < reserv_clstrs) >> + fs_event_free_space(sbi->s_sb, >> + EXT4_C2B(sbi, reserv_clstrs - ar->len)); >> if (!ar->len) { >> if ((ar->flags & EXT4_MB_DELALLOC_RESERVED) == 0) >> /* release all the reserved blocks if non delalloc */ >> @@ -4848,7 +4851,7 @@ do_more: >> if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) >> dquot_free_block(inode, EXT4_C2B(sbi, count_clusters)); >> percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters); >> - >> + fs_event_free_space(sb, EXT4_C2B(sbi, count_clusters)); >> ext4_mb_unload_buddy(&e4b); >> >> /* We dirtied the bitmap block */ >> @@ -4982,6 +4985,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb, >> ext4_unlock_group(sb, block_group); >> percpu_counter_add(&sbi->s_freeclusters_counter, >> EXT4_NUM_B2C(sbi, blocks_freed)); >> + fs_event_free_space(sb, blocks_freed); >> >> if (sbi->s_log_groups_per_flex) { >> ext4_group_t flex_group = ext4_flex_group(sbi, block_group); >> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c >> index 8a8ec62..dbf08d6 100644 >> --- a/fs/ext4/resize.c >> +++ b/fs/ext4/resize.c >> @@ -1378,6 +1378,7 @@ static void ext4_update_super(struct super_block *sb, >> EXT4_NUM_B2C(sbi, free_blocks)); >> percpu_counter_add(&sbi->s_freeinodes_counter, >> EXT4_INODES_PER_GROUP(sb) * flex_gd->count); >> + fs_event_free_space(sb, free_blocks - reserved_blocks); >> >> ext4_debug("free blocks count %llu", >> percpu_counter_read(&sbi->s_freeclusters_counter)); >> diff --git a/fs/ext4/super.c b/fs/ext4/super.c >> index e061e66..108b667 100644 >> --- a/fs/ext4/super.c >> +++ b/fs/ext4/super.c >> @@ -585,6 +585,8 @@ void __ext4_abort(struct super_block *sb, const char *function, >> if (EXT4_SB(sb)->s_journal) >> jbd2_journal_abort(EXT4_SB(sb)->s_journal, -EIO); >> save_error_info(sb, function, line); >> + fs_event_notify(sb, FS_ERR_REMOUNT_RO); >> + >> } >> if (test_opt(sb, ERRORS_PANIC)) >> panic("EXT4-fs panic from previous error\n"); >> @@ -1083,6 +1085,12 @@ static const struct quotactl_ops ext4_qctl_operations = { >> }; >> #endif >> >> +static void ext4_trace_query(struct super_block *sb, u64 *ncount); >> + >> +static const struct fs_trace_operations ext4_trace_ops = { >> + .query = ext4_trace_query, >> +}; >> + >> static const struct super_operations ext4_sops = { >> .alloc_inode = ext4_alloc_inode, >> .destroy_inode = ext4_destroy_inode, >> @@ -3398,11 +3406,20 @@ static int ext4_reserve_clusters(struct ext4_sb_info *sbi, ext4_fsblk_t count) >> { >> ext4_fsblk_t clusters = ext4_blocks_count(sbi->s_es) >> >> sbi->s_cluster_bits; >> + ext4_fsblk_t current_resv; >> >> if (count >= clusters) >> return -EINVAL; >> >> + current_resv = atomic64_read(&sbi->s_resv_clusters); >> atomic64_set(&sbi->s_resv_clusters, count); >> + >> + if (count > current_resv) >> + fs_event_alloc_space(sbi->s_sb, >> + EXT4_C2B(sbi, count - current_resv)); >> + else >> + fs_event_free_space(sbi->s_sb, >> + EXT4_C2B(sbi, current_resv - count)); >> return 0; >> } >> >> @@ -3966,6 +3983,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) >> sb->s_qcop = &ext4_qctl_operations; >> sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP; >> #endif >> + sb->s_etrace.ops = &ext4_trace_ops; >> + sb->s_etrace.events_cap_mask = FS_EVENTS_ALL; >> + >> memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid)); >> >> INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */ >> @@ -5438,6 +5458,25 @@ out: >> >> #endif >> >> +static void ext4_trace_query(struct super_block *sb, u64 *ncount) >> +{ >> + struct ext4_sb_info *sbi = EXT4_SB(sb); >> + struct ext4_super_block *es = sbi->s_es; >> + ext4_fsblk_t rsv_blocks; >> + ext4_fsblk_t nblocks; >> + >> + nblocks = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) - >> + percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter); >> + nblocks = EXT4_C2B(sbi, nblocks); >> + rsv_blocks = ext4_r_blocks_count(es) + >> + EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters)); >> + if (nblocks < rsv_blocks) >> + nblocks = 0; >> + else >> + nblocks -= rsv_blocks; >> + *ncount = nblocks; >> +} >> + >> static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags, >> const char *dev_name, void *data) >> { >> -- >> 1.7.9.5 >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Thu, 18 Jun 2015 09:06:05 +1000 Message-ID: <20150617230605.GK10224@dastard> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <1434460173-18427-2-git-send-email-b.michalska@samsung.com> Sender: owner-linux-mm@kvack.org To: Beata Michalska Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: > Introduce configurable generic interface for file > system-wide event notifications, to provide file > systems with a common way of reporting any potential > issues as they emerge. > > The notifications are to be issued through generic > netlink interface by newly introduced multicast group. > > Threshold notifications have been included, allowing > triggering an event whenever the amount of free space drops > below a certain level - or levels to be more precise as two > of them are being supported: the lower and the upper range. > The notifications work both ways: once the threshold level > has been reached, an event shall be generated whenever > the number of available blocks goes up again re-activating > the threshold. > > The interface has been exposed through a vfs. Once mounted, > it serves as an entry point for the set-up where one can > register for particular file system events. > > Signed-off-by: Beata Michalska This has massive scalability problems: > + 4.3 Threshold notifications: > + > + #include > + void fs_event_alloc_space(struct super_block *sb, u64 ncount); > + void fs_event_free_space(struct super_block *sb, u64 ncount); > + > + Each filesystme supporting the threshold notifications should call > + fs_event_alloc_space/fs_event_free_space respectively whenever the > + amount of available blocks changes. > + - sb: the filesystem's super block > + - ncount: number of blocks being acquired/released ... here. > + Note that to properly handle the threshold notifications the fs events > + interface needs to be kept up to date by the filesystems. Each should > + register fs_trace_operations to enable querying the current number of > + available blocks. Have you noticed that the filesystems have percpu counters for tracking global space usage? There's good reason for that - taking a spinlock in such a hot accounting path causes severe contention. > +static void fs_event_send(struct fs_trace_entry *en, unsigned int event_id) > +{ > + size_t size = nla_total_size(sizeof(u32)) * 2 + > + nla_total_size(sizeof(u64)); > + > + fs_netlink_send_event(size, event_id, create_common_msg, en); > +} > + > +static void fs_event_send_thresh(struct fs_trace_entry *en, > + unsigned int event_id) > +{ > + size_t size = nla_total_size(sizeof(u32)) * 2 + > + nla_total_size(sizeof(u64)) * 2; > + > + fs_netlink_send_event(size, event_id, create_thresh_msg, en); > +} > + > +void fs_event_notify(struct super_block *sb, unsigned int event_id) > +{ > + struct fs_trace_entry *en; > + > + en = fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + if (atomic_read(&en->active) && (en->notify & FS_EVENT_GENERIC)) > + fs_event_send(en, event_id); > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_notify); > + > +void fs_event_alloc_space(struct super_block *sb, u64 ncount) > +{ > + struct fs_trace_entry *en; > + s64 count; > + > + en = fs_trace_entry_get_rcu(sb); > + if (!en) > + return; Adds an atomic write to get the trace entry, > + spin_lock(&en->lock); a spin lock to lock the entry, > + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) > + goto leave; > + /* > + * we shouldn't drop below 0 here, > + * unless there is a sync issue somewhere (?) > + */ > + count = en->th.avail_space - ncount; > + en->th.avail_space = count < 0 ? 0 : count; > + > + if (en->th.avail_space > en->th.lrange) > + /* Not 'even' close - leave */ > + goto leave; > + > + if (en->th.avail_space > en->th.urange) { > + /* Close enough - the lower range has been reached */ > + if (!(en->th.state & THRESH_LR_BEYOND)) { > + /* Send notification */ > + fs_event_send_thresh(en, FS_THR_LRBELOW); > + en->th.state &= ~THRESH_LR_BELOW; > + en->th.state |= THRESH_LR_BEYOND; > + } > + goto leave; Then puts the entire netlink send path inside this spinlock, which includes memory allocation and all sorts of non-filesystem code paths. And it may be inside critical filesystem locks as well.... Apart from the serialisation problem of the locking, adding memory allocation and the network send path to filesystem code that is effectively considered "innermost" filesystem code is going to have all sorts of problems for various filesystems. In the XFS case, we simply cannot execute this sort of function in the places where we update global space accounting. As it is, I think the basic concept of separate tracking of free space if fundamentally flawed. What I think needs to be done is that filesystems need access to the thresholds for events, and then the filesystems call fs_event_send_thresh() themselves from appropriate contexts (ie. without compromising locking, scalability, memory allocation recursion constraints, etc). e.g. instead of tracking every change in free space, a filesystem might execute this once every few seconds from a workqueue: event = fs_event_need_space_warning(sb, ) if (event) fs_event_send_thresh(sb, event); User still gets warnings about space usage, but there's no runtime overhead or problems with lock/memory allocation contexts, etc. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Thu, 18 Jun 2015 10:25:08 +0200 Message-ID: <55828064.5040301@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150617230605.GK10224@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Return-path: In-reply-to: <20150617230605.GK10224@dastard> Sender: linux-kernel-owner@vger.kernel.org To: Dave Chinner Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org Hi, On 06/18/2015 01:06 AM, Dave Chinner wrote: > On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: >> Introduce configurable generic interface for file >> system-wide event notifications, to provide file >> systems with a common way of reporting any potential >> issues as they emerge. >> >> The notifications are to be issued through generic >> netlink interface by newly introduced multicast group. >> >> Threshold notifications have been included, allowing >> triggering an event whenever the amount of free space drops >> below a certain level - or levels to be more precise as two >> of them are being supported: the lower and the upper range. >> The notifications work both ways: once the threshold level >> has been reached, an event shall be generated whenever >> the number of available blocks goes up again re-activating >> the threshold. >> >> The interface has been exposed through a vfs. Once mounted, >> it serves as an entry point for the set-up where one can >> register for particular file system events. >> >> Signed-off-by: Beata Michalska > > This has massive scalability problems: > >> + 4.3 Threshold notifications: >> + >> + #include >> + void fs_event_alloc_space(struct super_block *sb, u64 ncount); >> + void fs_event_free_space(struct super_block *sb, u64 ncount); >> + >> + Each filesystme supporting the threshold notifications should call >> + fs_event_alloc_space/fs_event_free_space respectively whenever the >> + amount of available blocks changes. >> + - sb: the filesystem's super block >> + - ncount: number of blocks being acquired/released > > ... here. > >> + Note that to properly handle the threshold notifications the fs events >> + interface needs to be kept up to date by the filesystems. Each should >> + register fs_trace_operations to enable querying the current number of >> + available blocks. > > Have you noticed that the filesystems have percpu counters for > tracking global space usage? There's good reason for that - taking a > spinlock in such a hot accounting path causes severe contention. > >> +static void fs_event_send(struct fs_trace_entry *en, unsigned int event_id) >> +{ >> + size_t size = nla_total_size(sizeof(u32)) * 2 + >> + nla_total_size(sizeof(u64)); >> + >> + fs_netlink_send_event(size, event_id, create_common_msg, en); >> +} >> + >> +static void fs_event_send_thresh(struct fs_trace_entry *en, >> + unsigned int event_id) >> +{ >> + size_t size = nla_total_size(sizeof(u32)) * 2 + >> + nla_total_size(sizeof(u64)) * 2; >> + >> + fs_netlink_send_event(size, event_id, create_thresh_msg, en); >> +} >> + >> +void fs_event_notify(struct super_block *sb, unsigned int event_id) >> +{ >> + struct fs_trace_entry *en; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return; >> + >> + spin_lock(&en->lock); >> + if (atomic_read(&en->active) && (en->notify & FS_EVENT_GENERIC)) >> + fs_event_send(en, event_id); >> + spin_unlock(&en->lock); >> + fs_trace_entry_put(en); >> +} >> +EXPORT_SYMBOL(fs_event_notify); >> + >> +void fs_event_alloc_space(struct super_block *sb, u64 ncount) >> +{ >> + struct fs_trace_entry *en; >> + s64 count; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return; > > Adds an atomic write to get the trace entry, > >> + spin_lock(&en->lock); > > a spin lock to lock the entry, > > >> + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) >> + goto leave; >> + /* >> + * we shouldn't drop below 0 here, >> + * unless there is a sync issue somewhere (?) >> + */ >> + count = en->th.avail_space - ncount; >> + en->th.avail_space = count < 0 ? 0 : count; >> + >> + if (en->th.avail_space > en->th.lrange) >> + /* Not 'even' close - leave */ >> + goto leave; >> + >> + if (en->th.avail_space > en->th.urange) { >> + /* Close enough - the lower range has been reached */ >> + if (!(en->th.state & THRESH_LR_BEYOND)) { >> + /* Send notification */ >> + fs_event_send_thresh(en, FS_THR_LRBELOW); >> + en->th.state &= ~THRESH_LR_BELOW; >> + en->th.state |= THRESH_LR_BEYOND; >> + } >> + goto leave; > > Then puts the entire netlink send path inside this spinlock, which > includes memory allocation and all sorts of non-filesystem code > paths. And it may be inside critical filesystem locks as well.... > > Apart from the serialisation problem of the locking, adding > memory allocation and the network send path to filesystem code > that is effectively considered "innermost" filesystem code is going > to have all sorts of problems for various filesystems. In the XFS > case, we simply cannot execute this sort of function in the places > where we update global space accounting. > > As it is, I think the basic concept of separate tracking of free > space if fundamentally flawed. What I think needs to be done is that > filesystems need access to the thresholds for events, and then the > filesystems call fs_event_send_thresh() themselves from appropriate > contexts (ie. without compromising locking, scalability, memory > allocation recursion constraints, etc). > > e.g. instead of tracking every change in free space, a filesystem > might execute this once every few seconds from a workqueue: > > event = fs_event_need_space_warning(sb, ) > if (event) > fs_event_send_thresh(sb, event); > > User still gets warnings about space usage, but there's no runtime > overhead or problems with lock/memory allocation contexts, etc. > > Cheers, > > Dave. > Having fs to keep a firm hand on thresholds limits would indeed be far more sane approach though that would require each fs to add support for that and handle most of it on their own. Avoiding this was the main rationale behind this rfc. If fs people agree to that, I'll be more than willing to drop this in favour of the per-fs tracking solution. Personally, I hope they will. Best Regards Beata From mboxrd@z Thu Jan 1 00:00:00 1970 From: Kinglong Mee Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Thu, 18 Jun 2015 19:17:21 +0800 Message-ID: <5582A8C1.3000002@gmail.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1434460173-18427-2-git-send-email-b.michalska@samsung.com> Sender: owner-linux-mm@kvack.org To: Beata Michalska , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Cc: greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org, kinglongmee@gmail.com List-Id: linux-api@vger.kernel.org On 6/16/2015 9:09 PM, Beata Michalska wrote: > Introduce configurable generic interface for file > system-wide event notifications, to provide file > systems with a common way of reporting any potential > issues as they emerge. ... snip ... > + > +Sample request could look like the following: > + > + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config > + > +Multiple request might be specified provided they are separated with semicolon. > + > +The configuration itself might be modified at any time. One can add/remove > +particular event types for given fielsystem, modify the threshold levels, > +and remove single or all entries from the 'config' file. > + > + - Adding new event type: > + > + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config > + > +(Note that is is enough to provide the event type to be enabled without Should be "Note that it is ... " here ? > +the already set ones.) > + > + - Removing event type: > + > + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config > + > + - Updating threshold limits: > + > + $ echo MOUNT T L1 L2 > /sys/fs/events/config > + > + - Removing single entry: > + > + $ echo '!MOUNT' > /sys/fs/events/config > + > + - Removing all entries: > + > + $ echo > /sys/fs/events/config > + > +Reading the file will list all registered entries with their current set-up > +along with some additional info like the filesystem type and the backing device > +name if available. > + > +Final, though a very important note on the configuration: when and if the > +actual events are being triggered falls way beyond the scope of the generic > +filesystem events interface. It is up to a particular filesystem > +implementation which events are to be supported - if any at all. So if > +given filesystem does not support the event notifications, an attempt to > +enable those through 'config' file will fail. > + > + > +3. The generic netlink interface support: > +========================================= > + > +Whenever an event notification is triggered (by given filesystem) the current > +configuration is being validated to decide whether a userpsace notification > +should be launched. If there has been no request (in a mean of 'config' file > +entry) for given event, one will be silently disregarded. If, on the other > +hand, someone is 'watching' given filesystem for specific events, a generic > +netlink message will be sent. A dedicated multicast group has been provided > +solely for this purpose so in order to receive such notifications, one should > +subscribe to this new multicast group. As for now only the init network > +namespace is being supported. > + > +3.1 Message format > + > +The FS_NL_C_EVENT shall be stored within the generic netlink message header > +as the command field. The message payload will provide more detailed info: > +the backing device major and minor numbers, the event code and the id of > +the process which action led to the event occurrence. In case of threshold > +notifications, the current number of available blocks will be included > +in the payload as well. > + > + > + 0 1 2 3 > + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | NETLINK MESSAGE HEADER | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | GENERIC NETLINK MESSAGE HEADER | > + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | cmd, not cdm. > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | Optional user specific message header | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | GENERIC MESSAGE PAYLOAD: | > + +---------------------------------------------------------------+ > + | FS_NL_A_EVENT_ID (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DEV_MAJOR (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DEV_MINOR (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_CAUSED_ID (NLA_U32) | Should be NLA_U64 ? The following uses as, + if (nla_put_u64(skb, FS_NL_A_CAUSED_ID, pid_vnr(task_pid(current)))) + return -EINVAL; Also, I'd like FS_NL_A_CAUSED_PID than FS_NL_A_CAUSED_ID. > + +---------------------------------------------------------------+ > + | FS_NL_A_DATA (NLA_U64) | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + > + > +The above figure is based on: > + http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto#Message_Format > + > + ... snip... > + seq_putc(m, ' '); > + if (sb->s_op->show_devname) { > + sb->s_op->show_devname(m, en->mnt_path.mnt->mnt_root); > + } else { > + seq_escape(m, r_mnt->mnt_devname ? r_mnt->mnt_devname : "none", > + " \t\n\\"); > + } > + seq_puts(m, " ("); > + > + nmask = en->notify; > + for (match = fs_etypes; match->pattern; ++match) { > + if (match->token & nmask) { > + seq_puts(m, match->pattern); Print here is better. if (match->pattern & FS_EVENT_THRESH) seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); > + nmask &= ~match->token; > + if (nmask) > + seq_putc(m, ','); > + } > + } > + seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); Don't print the lrange/urange (always be zero) when without FS_EVENT_THRESH. > + seq_puts(m, ")\n"); > + return 0; > +} > + > +static const struct seq_operations fs_trace_seq_ops = { > + .start = fs_trace_seq_start, > + .next = fs_trace_seq_next, > + .stop = fs_trace_seq_stop, > + .show = fs_trace_seq_show, > +}; > + > +static int fs_trace_open(struct inode *inode, struct file *file) > +{ > + return seq_open(file, &fs_trace_seq_ops); > +} > + > +static const struct file_operations fs_trace_fops = { > + .owner = THIS_MODULE, > + .open = fs_trace_open, > + .write = fs_trace_write, > + .read = seq_read, > + .llseek = seq_lseek, > + .release = seq_release, > +}; > + > +static int fs_trace_init(void) > +{ > + fs_trace_cachep = KMEM_CACHE(fs_trace_entry, 0); > + if (!fs_trace_cachep) > + return -EINVAL; > + init_waitqueue_head(&trace_wq); > + return 0; > +} > + > +/* VFS support */ > +static int fs_trace_fill_super(struct super_block *sb, void *data, int silen) > +{ > + int ret; > + static struct tree_descr desc[] = { > + [2] = { > + .name = "config", > + .ops = &fs_trace_fops, > + .mode = S_IWUSR | S_IRUGO, > + }, > + {""}, > + }; > + > + ret = simple_fill_super(sb, 0x7246332, desc); > + return !ret ? fs_trace_init() : ret; > +} > + > +static struct dentry *fs_trace_do_mount(struct file_system_type *fs_type, > + int ntype, const char *dev_name, void *data) > +{ > + return mount_single(fs_type, ntype, data, fs_trace_fill_super); > +} > + > +static void fs_trace_kill_super(struct super_block *sb) > +{ > + /* > + * The rcu_barrier here will/should make sure all call_rcu > + * callbacks are completed - still there might be some active > + * trace objects in use which can make calling the > + * kmem_cache_destroy unsafe. So we wait until all traces > + * are finally released. > + */ > + fs_remove_all_traces(); > + rcu_barrier(); > + wait_event(trace_wq, !atomic_read(&stray_traces)); > + > + kmem_cache_destroy(fs_trace_cachep); > + kill_litter_super(sb); > +} > + > +static struct kset *fs_trace_kset; > + > +static struct file_system_type fs_trace_fstype = { > + .name = "fstrace", > + .mount = fs_trace_do_mount, > + .kill_sb = fs_trace_kill_super, > +}; > + > +static void __init fs_trace_vfs_init(void) > +{ > + fs_trace_kset = kset_create_and_add("events", NULL, fs_kobj); > + > + if (!fs_trace_kset) > + return; > + > + if (!register_filesystem(&fs_trace_fstype)) { > + if (!fs_event_netlink_register()) > + return; > + unregister_filesystem(&fs_trace_fstype); > + } > + kset_unregister(fs_trace_kset); > +} > + > +static int __init fs_trace_evens_init(void) > +{ > + fs_trace_vfs_init(); > + return 0; > +}; > +module_init(fs_trace_evens_init); > + > diff --git a/fs/events/fs_event.h b/fs/events/fs_event.h > new file mode 100644 > index 0000000..23f24c8 > --- /dev/null > +++ b/fs/events/fs_event.h > @@ -0,0 +1,22 @@ > +/* > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > + > +#ifndef __GENERIC_FS_EVENTS_H > +#define __GENERIC_FS_EVENTS_H > + > +int fs_event_netlink_register(void); > +void fs_event_netlink_unregister(void); > + > +#endif /* __GENERIC_FS_EVENTS_H */ > diff --git a/fs/events/fs_event_netlink.c b/fs/events/fs_event_netlink.c > new file mode 100644 > index 0000000..0c97eb7 > --- /dev/null > +++ b/fs/events/fs_event_netlink.c > @@ -0,0 +1,104 @@ > +/* > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "fs_event.h" > + > +static const struct genl_multicast_group fs_event_mcgroups[] = { > + { .name = FS_EVENTS_MCAST_GRP_NAME, }, > +}; > + > +static struct genl_family fs_event_family = { > + .id = GENL_ID_GENERATE, > + .name = FS_EVENTS_FAMILY_NAME, > + .version = 1, > + .maxattr = FS_NL_A_MAX, > + .mcgrps = fs_event_mcgroups, > + .n_mcgrps = ARRAY_SIZE(fs_event_mcgroups), > +}; > + > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), > + void *cbdata) > +{ > + static atomic_t seq; > + struct sk_buff *skb; > + void *msg_head; > + int ret = 0; > + > + if (!size || !compose_msg) > + return -EINVAL; > + > + /* Skip if there are no listeners */ > + if (!genl_has_listeners(&fs_event_family, &init_net, 0)) > + return 0; > + > + if (event_id != FS_EVENT_NONE) > + size += nla_total_size(sizeof(u32)); > + size += nla_total_size(sizeof(u64)); What is this for ? thanks Kinglong Mee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Thu, 18 Jun 2015 16:50:11 +0200 Message-ID: <5582DAA3.8080204@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <5582A8C1.3000002@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Return-path: In-reply-to: <5582A8C1.3000002@gmail.com> Sender: owner-linux-mm@kvack.org To: Kinglong Mee Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org Hi, On 06/18/2015 01:17 PM, Kinglong Mee wrote: > On 6/16/2015 9:09 PM, Beata Michalska wrote: >> Introduce configurable generic interface for file >> system-wide event notifications, to provide file >> systems with a common way of reporting any potential >> issues as they emerge. > ... snip ... >> + >> +Sample request could look like the following: >> + >> + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config >> + >> +Multiple request might be specified provided they are separated with semicolon. >> + >> +The configuration itself might be modified at any time. One can add/remove >> +particular event types for given fielsystem, modify the threshold levels, >> +and remove single or all entries from the 'config' file. >> + >> + - Adding new event type: >> + >> + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config >> + >> +(Note that is is enough to provide the event type to be enabled without > > Should be "Note that it is ... " here ? Right > >> +the already set ones.) >> + >> + - Removing event type: >> + >> + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config >> + >> + - Updating threshold limits: >> + >> + $ echo MOUNT T L1 L2 > /sys/fs/events/config >> + >> + - Removing single entry: >> + >> + $ echo '!MOUNT' > /sys/fs/events/config >> + >> + - Removing all entries: >> + >> + $ echo > /sys/fs/events/config >> + >> +Reading the file will list all registered entries with their current set-up >> +along with some additional info like the filesystem type and the backing device >> +name if available. >> + >> +Final, though a very important note on the configuration: when and if the >> +actual events are being triggered falls way beyond the scope of the generic >> +filesystem events interface. It is up to a particular filesystem >> +implementation which events are to be supported - if any at all. So if >> +given filesystem does not support the event notifications, an attempt to >> +enable those through 'config' file will fail. >> + >> + >> +3. The generic netlink interface support: >> +========================================= >> + >> +Whenever an event notification is triggered (by given filesystem) the current >> +configuration is being validated to decide whether a userpsace notification >> +should be launched. If there has been no request (in a mean of 'config' file >> +entry) for given event, one will be silently disregarded. If, on the other >> +hand, someone is 'watching' given filesystem for specific events, a generic >> +netlink message will be sent. A dedicated multicast group has been provided >> +solely for this purpose so in order to receive such notifications, one should >> +subscribe to this new multicast group. As for now only the init network >> +namespace is being supported. >> + >> +3.1 Message format >> + >> +The FS_NL_C_EVENT shall be stored within the generic netlink message header >> +as the command field. The message payload will provide more detailed info: >> +the backing device major and minor numbers, the event code and the id of >> +the process which action led to the event occurrence. In case of threshold >> +notifications, the current number of available blocks will be included >> +in the payload as well. >> + >> + >> + 0 1 2 3 >> + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | NETLINK MESSAGE HEADER | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | GENERIC NETLINK MESSAGE HEADER | >> + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | > > cmd, not cdm. ditto > >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | Optional user specific message header | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | GENERIC MESSAGE PAYLOAD: | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_EVENT_ID (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DEV_MAJOR (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DEV_MINOR (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_CAUSED_ID (NLA_U32) | > > Should be NLA_U64 ? The following uses as, > > + if (nla_put_u64(skb, FS_NL_A_CAUSED_ID, pid_vnr(task_pid(current)))) > + return -EINVAL; > Yes, or nla_put_u32 - either way my bad > Also, I'd like FS_NL_A_CAUSED_PID than FS_NL_A_CAUSED_ID. Alright > >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DATA (NLA_U64) | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + >> + >> +The above figure is based on: >> + http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto#Message_Format >> + >> + > ... snip... >> + seq_putc(m, ' '); >> + if (sb->s_op->show_devname) { >> + sb->s_op->show_devname(m, en->mnt_path.mnt->mnt_root); >> + } else { >> + seq_escape(m, r_mnt->mnt_devname ? r_mnt->mnt_devname : "none", >> + " \t\n\\"); >> + } >> + seq_puts(m, " ("); >> + >> + nmask = en->notify; >> + for (match = fs_etypes; match->pattern; ++match) { >> + if (match->token & nmask) { >> + seq_puts(m, match->pattern); > > Print here is better. > > if (match->pattern & FS_EVENT_THRESH) > seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); > >> + nmask &= ~match->token; >> + if (nmask) >> + seq_putc(m, ','); >> + } >> + } >> + seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); > > Don't print the lrange/urange (always be zero) when without FS_EVENT_THRESH. > ditto >> + seq_puts(m, ")\n"); >> + return 0; >> +} >> + >> +static const struct seq_operations fs_trace_seq_ops = { >> + .start = fs_trace_seq_start, >> + .next = fs_trace_seq_next, >> + .stop = fs_trace_seq_stop, >> + .show = fs_trace_seq_show, >> +}; >> + >> +static int fs_trace_open(struct inode *inode, struct file *file) >> +{ >> + return seq_open(file, &fs_trace_seq_ops); >> +} >> + >> +static const struct file_operations fs_trace_fops = { >> + .owner = THIS_MODULE, >> + .open = fs_trace_open, >> + .write = fs_trace_write, >> + .read = seq_read, >> + .llseek = seq_lseek, >> + .release = seq_release, >> +}; >> + >> +static int fs_trace_init(void) >> +{ >> + fs_trace_cachep = KMEM_CACHE(fs_trace_entry, 0); >> + if (!fs_trace_cachep) >> + return -EINVAL; >> + init_waitqueue_head(&trace_wq); >> + return 0; >> +} >> + >> +/* VFS support */ >> +static int fs_trace_fill_super(struct super_block *sb, void *data, int silen) >> +{ >> + int ret; >> + static struct tree_descr desc[] = { >> + [2] = { >> + .name = "config", >> + .ops = &fs_trace_fops, >> + .mode = S_IWUSR | S_IRUGO, >> + }, >> + {""}, >> + }; >> + >> + ret = simple_fill_super(sb, 0x7246332, desc); >> + return !ret ? fs_trace_init() : ret; >> +} >> + >> +static struct dentry *fs_trace_do_mount(struct file_system_type *fs_type, >> + int ntype, const char *dev_name, void *data) >> +{ >> + return mount_single(fs_type, ntype, data, fs_trace_fill_super); >> +} >> + >> +static void fs_trace_kill_super(struct super_block *sb) >> +{ >> + /* >> + * The rcu_barrier here will/should make sure all call_rcu >> + * callbacks are completed - still there might be some active >> + * trace objects in use which can make calling the >> + * kmem_cache_destroy unsafe. So we wait until all traces >> + * are finally released. >> + */ >> + fs_remove_all_traces(); >> + rcu_barrier(); >> + wait_event(trace_wq, !atomic_read(&stray_traces)); >> + >> + kmem_cache_destroy(fs_trace_cachep); >> + kill_litter_super(sb); >> +} >> + >> +static struct kset *fs_trace_kset; >> + >> +static struct file_system_type fs_trace_fstype = { >> + .name = "fstrace", >> + .mount = fs_trace_do_mount, >> + .kill_sb = fs_trace_kill_super, >> +}; >> + >> +static void __init fs_trace_vfs_init(void) >> +{ >> + fs_trace_kset = kset_create_and_add("events", NULL, fs_kobj); >> + >> + if (!fs_trace_kset) >> + return; >> + >> + if (!register_filesystem(&fs_trace_fstype)) { >> + if (!fs_event_netlink_register()) >> + return; >> + unregister_filesystem(&fs_trace_fstype); >> + } >> + kset_unregister(fs_trace_kset); >> +} >> + >> +static int __init fs_trace_evens_init(void) >> +{ >> + fs_trace_vfs_init(); >> + return 0; >> +}; >> +module_init(fs_trace_evens_init); >> + >> diff --git a/fs/events/fs_event.h b/fs/events/fs_event.h >> new file mode 100644 >> index 0000000..23f24c8 >> --- /dev/null >> +++ b/fs/events/fs_event.h >> @@ -0,0 +1,22 @@ >> +/* >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> + >> +#ifndef __GENERIC_FS_EVENTS_H >> +#define __GENERIC_FS_EVENTS_H >> + >> +int fs_event_netlink_register(void); >> +void fs_event_netlink_unregister(void); >> + >> +#endif /* __GENERIC_FS_EVENTS_H */ >> diff --git a/fs/events/fs_event_netlink.c b/fs/events/fs_event_netlink.c >> new file mode 100644 >> index 0000000..0c97eb7 >> --- /dev/null >> +++ b/fs/events/fs_event_netlink.c >> @@ -0,0 +1,104 @@ >> +/* >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include "fs_event.h" >> + >> +static const struct genl_multicast_group fs_event_mcgroups[] = { >> + { .name = FS_EVENTS_MCAST_GRP_NAME, }, >> +}; >> + >> +static struct genl_family fs_event_family = { >> + .id = GENL_ID_GENERATE, >> + .name = FS_EVENTS_FAMILY_NAME, >> + .version = 1, >> + .maxattr = FS_NL_A_MAX, >> + .mcgrps = fs_event_mcgroups, >> + .n_mcgrps = ARRAY_SIZE(fs_event_mcgroups), >> +}; >> + >> +int fs_netlink_send_event(size_t size, unsigned int event_id, >> + int (*compose_msg)(struct sk_buff *skb, void *data), >> + void *cbdata) >> +{ >> + static atomic_t seq; >> + struct sk_buff *skb; >> + void *msg_head; >> + int ret = 0; >> + >> + if (!size || !compose_msg) >> + return -EINVAL; >> + >> + /* Skip if there are no listeners */ >> + if (!genl_has_listeners(&fs_event_family, &init_net, 0)) >> + return 0; >> + >> + if (event_id != FS_EVENT_NONE) >> + size += nla_total_size(sizeof(u32)); >> + size += nla_total_size(sizeof(u64)); > > What is this for ? > This should actually get removed :) > thanks > Kinglong Mee > Thank You, Best Regards Beata -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Fri, 19 Jun 2015 10:03:41 +1000 Message-ID: <20150619000341.GM10224@dastard> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150617230605.GK10224@dastard> <55828064.5040301@samsung.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <55828064.5040301@samsung.com> Sender: owner-linux-mm@kvack.org To: Beata Michalska Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org On Thu, Jun 18, 2015 at 10:25:08AM +0200, Beata Michalska wrote: > On 06/18/2015 01:06 AM, Dave Chinner wrote: > > On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: > >> Introduce configurable generic interface for file > >> system-wide event notifications, to provide file > >> systems with a common way of reporting any potential > >> issues as they emerge. > >> > >> The notifications are to be issued through generic > >> netlink interface by newly introduced multicast group. > >> > >> Threshold notifications have been included, allowing > >> triggering an event whenever the amount of free space drops > >> below a certain level - or levels to be more precise as two > >> of them are being supported: the lower and the upper range. > >> The notifications work both ways: once the threshold level > >> has been reached, an event shall be generated whenever > >> the number of available blocks goes up again re-activating > >> the threshold. > >> > >> The interface has been exposed through a vfs. Once mounted, > >> it serves as an entry point for the set-up where one can > >> register for particular file system events. > >> > >> Signed-off-by: Beata Michalska > > > > This has massive scalability problems: .... > > Have you noticed that the filesystems have percpu counters for > > tracking global space usage? There's good reason for that - taking a > > spinlock in such a hot accounting path causes severe contention. .... > > Then puts the entire netlink send path inside this spinlock, which > > includes memory allocation and all sorts of non-filesystem code > > paths. And it may be inside critical filesystem locks as well.... > > > > Apart from the serialisation problem of the locking, adding > > memory allocation and the network send path to filesystem code > > that is effectively considered "innermost" filesystem code is going > > to have all sorts of problems for various filesystems. In the XFS > > case, we simply cannot execute this sort of function in the places > > where we update global space accounting. > > > > As it is, I think the basic concept of separate tracking of free > > space if fundamentally flawed. What I think needs to be done is that > > filesystems need access to the thresholds for events, and then the > > filesystems call fs_event_send_thresh() themselves from appropriate > > contexts (ie. without compromising locking, scalability, memory > > allocation recursion constraints, etc). > > > > e.g. instead of tracking every change in free space, a filesystem > > might execute this once every few seconds from a workqueue: > > > > event = fs_event_need_space_warning(sb, ) > > if (event) > > fs_event_send_thresh(sb, event); > > > > User still gets warnings about space usage, but there's no runtime > > overhead or problems with lock/memory allocation contexts, etc. > > Having fs to keep a firm hand on thresholds limits would indeed be > far more sane approach though that would require each fs to > add support for that and handle most of it on their own. Avoiding >> this was the main rationale behind this rfc. > If fs people agree to that, I'll be more than willing to drop this > in favour of the per-fs tracking solution. > Personally, I hope they will. I was hoping that you'd think a little more about my suggestion and work out how to do background threshold event detection generically. I kind of left it as "an exercise for the reader" because it seems obvious to me. Hint: ->statfs allows you to get the total, free and used space from filesystems in a generic manner. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Fri, 19 Jun 2015 19:28:11 +0200 Message-ID: <5584512B.5020301@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150617230605.GK10224@dastard> <55828064.5040301@samsung.com> <20150619000341.GM10224@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Return-path: In-reply-to: <20150619000341.GM10224@dastard> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Dave Chinner Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, greg-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org, jack-AlSwsSmVLrQ@public.gmane.org, tytso-3s7WtUTddSA@public.gmane.org, adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org, hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, kyungmin.park-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org, kmpark-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org List-Id: linux-api@vger.kernel.org On 06/19/2015 02:03 AM, Dave Chinner wrote: > On Thu, Jun 18, 2015 at 10:25:08AM +0200, Beata Michalska wrote: >> On 06/18/2015 01:06 AM, Dave Chinner wrote: >>> On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: >>>> Introduce configurable generic interface for file >>>> system-wide event notifications, to provide file >>>> systems with a common way of reporting any potential >>>> issues as they emerge. >>>> >>>> The notifications are to be issued through generic >>>> netlink interface by newly introduced multicast group. >>>> >>>> Threshold notifications have been included, allowing >>>> triggering an event whenever the amount of free space drops >>>> below a certain level - or levels to be more precise as two >>>> of them are being supported: the lower and the upper range. >>>> The notifications work both ways: once the threshold level >>>> has been reached, an event shall be generated whenever >>>> the number of available blocks goes up again re-activating >>>> the threshold. >>>> >>>> The interface has been exposed through a vfs. Once mounted, >>>> it serves as an entry point for the set-up where one can >>>> register for particular file system events. >>>> >>>> Signed-off-by: Beata Michalska >>> >>> This has massive scalability problems: > .... >>> Have you noticed that the filesystems have percpu counters for >>> tracking global space usage? There's good reason for that - taking a >>> spinlock in such a hot accounting path causes severe contention. > .... >>> Then puts the entire netlink send path inside this spinlock, which >>> includes memory allocation and all sorts of non-filesystem code >>> paths. And it may be inside critical filesystem locks as well.... >>> >>> Apart from the serialisation problem of the locking, adding >>> memory allocation and the network send path to filesystem code >>> that is effectively considered "innermost" filesystem code is going >>> to have all sorts of problems for various filesystems. In the XFS >>> case, we simply cannot execute this sort of function in the places >>> where we update global space accounting. >>> >>> As it is, I think the basic concept of separate tracking of free >>> space if fundamentally flawed. What I think needs to be done is that >>> filesystems need access to the thresholds for events, and then the >>> filesystems call fs_event_send_thresh() themselves from appropriate >>> contexts (ie. without compromising locking, scalability, memory >>> allocation recursion constraints, etc). >>> >>> e.g. instead of tracking every change in free space, a filesystem >>> might execute this once every few seconds from a workqueue: >>> >>> event = fs_event_need_space_warning(sb, ) >>> if (event) >>> fs_event_send_thresh(sb, event); >>> >>> User still gets warnings about space usage, but there's no runtime >>> overhead or problems with lock/memory allocation contexts, etc. >> >> Having fs to keep a firm hand on thresholds limits would indeed be >> far more sane approach though that would require each fs to >> add support for that and handle most of it on their own. Avoiding >>> this was the main rationale behind this rfc. >> If fs people agree to that, I'll be more than willing to drop this >> in favour of the per-fs tracking solution. >> Personally, I hope they will. > > I was hoping that you'd think a little more about my suggestion and > work out how to do background threshold event detection generically. > I kind of left it as "an exercise for the reader" because it seems > obvious to me. > > Hint: ->statfs allows you to get the total, free and used space > from filesystems in a generic manner. > > Cheers, > > Dave. > I haven't given up on that, so yes, I'm still working on a more suitable generic solution. Background detection is one of the options, though it needs some more thoughts. Giving up the sync approach means less accuracy for the threshold notifications, but I guess this could be fine-tuned to get an acceptable level. Another bump: how this tuning is supposed to be done (additional config option maybe)? The interface would have to keep it somehow sane - but what would 'sane' mean in this case (?) Also, I'm not sure whether single approach would server here well for all the potentially supported file systems so this would have to be properly adjusted (taking the threshold levels into consideration as well). And still,it would require some form of synchronization with tracked fs so that this 'detection' is not being unnecessarily performed (i.e. while fs remains frozen). There is also an idea of using an interface resembling the stackable fs: a transparent file system layered on top of the tracked one (solely for the tracking purposes). This would simplify handling the trace object's lifetime - no more list of registered traces. It would also give a way of tracking (to some extent) the changes in the amount of available space, which combined with tweaked background check could give a solution with less performance overhead than the original one. I'll try this one and see how it goes. Thank You for your feedback so far - I really appreciate it. Best Regards Beata From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dave Chinner Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Sat, 20 Jun 2015 09:21:17 +1000 Message-ID: <20150619232117.GN10224@dastard> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150617230605.GK10224@dastard> <55828064.5040301@samsung.com> <20150619000341.GM10224@dastard> <5584512B.5020301@samsung.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <5584512B.5020301@samsung.com> Sender: owner-linux-mm@kvack.org To: Beata Michalska Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org On Fri, Jun 19, 2015 at 07:28:11PM +0200, Beata Michalska wrote: > On 06/19/2015 02:03 AM, Dave Chinner wrote: > > On Thu, Jun 18, 2015 at 10:25:08AM +0200, Beata Michalska wrote: > >> On 06/18/2015 01:06 AM, Dave Chinner wrote: > >>> On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: > >>>> Introduce configurable generic interface for file > >>>> system-wide event notifications, to provide file > >>>> systems with a common way of reporting any potential > >>>> issues as they emerge. > >>>> > >>>> The notifications are to be issued through generic > >>>> netlink interface by newly introduced multicast group. > >>>> > >>>> Threshold notifications have been included, allowing > >>>> triggering an event whenever the amount of free space drops > >>>> below a certain level - or levels to be more precise as two > >>>> of them are being supported: the lower and the upper range. > >>>> The notifications work both ways: once the threshold level > >>>> has been reached, an event shall be generated whenever > >>>> the number of available blocks goes up again re-activating > >>>> the threshold. > >>>> > >>>> The interface has been exposed through a vfs. Once mounted, > >>>> it serves as an entry point for the set-up where one can > >>>> register for particular file system events. > >>>> > >>>> Signed-off-by: Beata Michalska > >>> > >>> This has massive scalability problems: > > .... > >>> Have you noticed that the filesystems have percpu counters for > >>> tracking global space usage? There's good reason for that - taking a > >>> spinlock in such a hot accounting path causes severe contention. > > .... > >>> Then puts the entire netlink send path inside this spinlock, which > >>> includes memory allocation and all sorts of non-filesystem code > >>> paths. And it may be inside critical filesystem locks as well.... > >>> > >>> Apart from the serialisation problem of the locking, adding > >>> memory allocation and the network send path to filesystem code > >>> that is effectively considered "innermost" filesystem code is going > >>> to have all sorts of problems for various filesystems. In the XFS > >>> case, we simply cannot execute this sort of function in the places > >>> where we update global space accounting. > >>> > >>> As it is, I think the basic concept of separate tracking of free > >>> space if fundamentally flawed. What I think needs to be done is that > >>> filesystems need access to the thresholds for events, and then the > >>> filesystems call fs_event_send_thresh() themselves from appropriate > >>> contexts (ie. without compromising locking, scalability, memory > >>> allocation recursion constraints, etc). > >>> > >>> e.g. instead of tracking every change in free space, a filesystem > >>> might execute this once every few seconds from a workqueue: > >>> > >>> event = fs_event_need_space_warning(sb, ) > >>> if (event) > >>> fs_event_send_thresh(sb, event); > >>> > >>> User still gets warnings about space usage, but there's no runtime > >>> overhead or problems with lock/memory allocation contexts, etc. > >> > >> Having fs to keep a firm hand on thresholds limits would indeed be > >> far more sane approach though that would require each fs to > >> add support for that and handle most of it on their own. Avoiding > >>> this was the main rationale behind this rfc. > >> If fs people agree to that, I'll be more than willing to drop this > >> in favour of the per-fs tracking solution. > >> Personally, I hope they will. > > > > I was hoping that you'd think a little more about my suggestion and > > work out how to do background threshold event detection generically. > > I kind of left it as "an exercise for the reader" because it seems > > obvious to me. > > > > Hint: ->statfs allows you to get the total, free and used space > > from filesystems in a generic manner. > > > > Cheers, > > > > Dave. > > > > I haven't given up on that, so yes, I'm still working on a more suitable > generic solution. > Background detection is one of the options, though it needs some more thoughts. > Giving up the sync approach means less accuracy for the threshold notifications, > but I guess this could be fine-tuned to get an acceptable level. Accuracy really doesn't matter for threshold notifications - by the time the event is delivered to userspace it can already be wrong. > Another bump: > how this tuning is supposed to be done (additional config option maybe)? Why would you need to tune it at all? You can't *stop* the operation that is triggering the threshold, so a few seconds delay on delivery isn't going to make any difference to anyone.... You're overthinking this massively. All this needs is a work item per superblock, and when the thresholds are turned on it queues a self-repeating delayed work that calls ->statfs, checks against the configured threshold, issues an event if necessary, and then queues itself again to run next period. When the threshold is turned off, the work is cancelled. Another option: a kernel thread that runs periodically and just calls iterate_supers() with a function that checks the sb for threshold events, and if configured runs ->statfs and does the work, otherwise skips the sb. That avoids all the lifetime issues with using workqueues, you don't need a struct work, etc. > There is also an idea of using an interface resembling the stackable fs: No. Just .... No. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Mon, 22 Jun 2015 17:46:27 +0200 Message-ID: <55882DD3.5040002@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150617230605.GK10224@dastard> <55828064.5040301@samsung.com> <20150619000341.GM10224@dastard> <5584512B.5020301@samsung.com> <20150619232117.GN10224@dastard> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Return-path: In-reply-to: <20150619232117.GN10224@dastard> Sender: owner-linux-mm@kvack.org To: Dave Chinner Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org On 06/20/2015 01:21 AM, Dave Chinner wrote: > On Fri, Jun 19, 2015 at 07:28:11PM +0200, Beata Michalska wrote: >> On 06/19/2015 02:03 AM, Dave Chinner wrote: >>> On Thu, Jun 18, 2015 at 10:25:08AM +0200, Beata Michalska wrote: >>>> On 06/18/2015 01:06 AM, Dave Chinner wrote: >>>>> On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: >>>>>> Introduce configurable generic interface for file >>>>>> system-wide event notifications, to provide file >>>>>> systems with a common way of reporting any potential >>>>>> issues as they emerge. >>>>>> >>>>>> The notifications are to be issued through generic >>>>>> netlink interface by newly introduced multicast group. >>>>>> >>>>>> Threshold notifications have been included, allowing >>>>>> triggering an event whenever the amount of free space drops >>>>>> below a certain level - or levels to be more precise as two >>>>>> of them are being supported: the lower and the upper range. >>>>>> The notifications work both ways: once the threshold level >>>>>> has been reached, an event shall be generated whenever >>>>>> the number of available blocks goes up again re-activating >>>>>> the threshold. >>>>>> >>>>>> The interface has been exposed through a vfs. Once mounted, >>>>>> it serves as an entry point for the set-up where one can >>>>>> register for particular file system events. >>>>>> >>>>>> Signed-off-by: Beata Michalska >>>>> >>>>> This has massive scalability problems: >>> .... >>>>> Have you noticed that the filesystems have percpu counters for >>>>> tracking global space usage? There's good reason for that - taking a >>>>> spinlock in such a hot accounting path causes severe contention. >>> .... >>>>> Then puts the entire netlink send path inside this spinlock, which >>>>> includes memory allocation and all sorts of non-filesystem code >>>>> paths. And it may be inside critical filesystem locks as well.... >>>>> >>>>> Apart from the serialisation problem of the locking, adding >>>>> memory allocation and the network send path to filesystem code >>>>> that is effectively considered "innermost" filesystem code is going >>>>> to have all sorts of problems for various filesystems. In the XFS >>>>> case, we simply cannot execute this sort of function in the places >>>>> where we update global space accounting. >>>>> >>>>> As it is, I think the basic concept of separate tracking of free >>>>> space if fundamentally flawed. What I think needs to be done is that >>>>> filesystems need access to the thresholds for events, and then the >>>>> filesystems call fs_event_send_thresh() themselves from appropriate >>>>> contexts (ie. without compromising locking, scalability, memory >>>>> allocation recursion constraints, etc). >>>>> >>>>> e.g. instead of tracking every change in free space, a filesystem >>>>> might execute this once every few seconds from a workqueue: >>>>> >>>>> event = fs_event_need_space_warning(sb, ) >>>>> if (event) >>>>> fs_event_send_thresh(sb, event); >>>>> >>>>> User still gets warnings about space usage, but there's no runtime >>>>> overhead or problems with lock/memory allocation contexts, etc. >>>> >>>> Having fs to keep a firm hand on thresholds limits would indeed be >>>> far more sane approach though that would require each fs to >>>> add support for that and handle most of it on their own. Avoiding >>>>> this was the main rationale behind this rfc. >>>> If fs people agree to that, I'll be more than willing to drop this >>>> in favour of the per-fs tracking solution. >>>> Personally, I hope they will. >>> >>> I was hoping that you'd think a little more about my suggestion and >>> work out how to do background threshold event detection generically. >>> I kind of left it as "an exercise for the reader" because it seems >>> obvious to me. >>> >>> Hint: ->statfs allows you to get the total, free and used space >>> from filesystems in a generic manner. >>> >>> Cheers, >>> >>> Dave. >>> >> >> I haven't given up on that, so yes, I'm still working on a more suitable >> generic solution. >> Background detection is one of the options, though it needs some more thoughts. >> Giving up the sync approach means less accuracy for the threshold notifications, >> but I guess this could be fine-tuned to get an acceptable level. > > Accuracy really doesn't matter for threshold notifications - by the > time the event is delivered to userspace it can already be wrong. > >> Another bump: >> how this tuning is supposed to be done (additional config option maybe)? > > Why would you need to tune it at all? You can't *stop* the operation > that is triggering the threshold, so a few seconds delay on delivery > isn't going to make any difference to anyone.... > > You're overthinking this massively. All this needs is a work item > per superblock, and when the thresholds are turned on it queues a > self-repeating delayed work that calls ->statfs, checks against the > configured threshold, issues an event if necessary, and then queues > itself again to run next period. When the threshold is turned off, > the work is cancelled. > > Another option: a kernel thread that runs periodically and just > calls iterate_supers() with a function that checks the sb for > threshold events, and if configured runs ->statfs and does the work, > otherwise skips the sb. That avoids all the lifetime issues with > using workqueues, you don't need a struct work, etc. > >> There is also an idea of using an interface resembling the stackable fs: > > No. Just .... No. > > Cheers, > > Dave. > Alright, I'll make appropriate changes to move the threshold verification into the background and see how it works. Thanks, Best Regards Beata -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dmitry Monakhov Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Wed, 24 Jun 2015 11:47:18 +0300 Message-ID: <87oak5ebmx.fsf@openvz.org> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> Mime-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha512; protocol="application/pgp-signature" Return-path: In-Reply-To: <1434460173-18427-2-git-send-email-b.michalska-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org> Sender: linux-api-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Beata Michalska , linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Cc: greg-U8xfFu+wG4EAvxtiuMwx3w@public.gmane.org, jack-AlSwsSmVLrQ@public.gmane.org, tytso-3s7WtUTddSA@public.gmane.org, adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org, hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org, linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, kyungmin.park-Sze3O3UU22JBDgjK7y7TUQ@public.gmane.org, kmpark-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org List-Id: linux-api@vger.kernel.org --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Beata Michalska writes: > Introduce configurable generic interface for file > system-wide event notifications, to provide file > systems with a common way of reporting any potential > issues as they emerge. > > The notifications are to be issued through generic > netlink interface by newly introduced multicast group. > > Threshold notifications have been included, allowing > triggering an event whenever the amount of free space drops > below a certain level - or levels to be more precise as two > of them are being supported: the lower and the upper range. > The notifications work both ways: once the threshold level > has been reached, an event shall be generated whenever > the number of available blocks goes up again re-activating > the threshold. > > The interface has been exposed through a vfs. Once mounted, > it serves as an entry point for the set-up where one can > register for particular file system events. > > Signed-off-by: Beata Michalska > --- > Documentation/filesystems/events.txt | 232 ++++++++++ > fs/Kconfig | 2 + > fs/Makefile | 1 + > fs/events/Kconfig | 7 + > fs/events/Makefile | 5 + > fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++= ++++++ > fs/events/fs_event.h | 22 + > fs/events/fs_event_netlink.c | 104 +++++ > fs/namespace.c | 1 + > include/linux/fs.h | 6 +- > include/linux/fs_event.h | 72 +++ > include/uapi/linux/Kbuild | 1 + > include/uapi/linux/fs_event.h | 58 +++ > 13 files changed, 1319 insertions(+), 1 deletion(-) > create mode 100644 Documentation/filesystems/events.txt > create mode 100644 fs/events/Kconfig > create mode 100644 fs/events/Makefile > create mode 100644 fs/events/fs_event.c > create mode 100644 fs/events/fs_event.h > create mode 100644 fs/events/fs_event_netlink.c > create mode 100644 include/linux/fs_event.h > create mode 100644 include/uapi/linux/fs_event.h > > diff --git a/Documentation/filesystems/events.txt b/Documentation/filesys= tems/events.txt > new file mode 100644 > index 0000000..c2e6227 > --- /dev/null > +++ b/Documentation/filesystems/events.txt > @@ -0,0 +1,232 @@ > + > + Generic file system event notification interface > + > +Document created 23 April 2015 by Beata Michalska > + > +1. The reason behind: > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +There are many corner cases when things might get messy with the filesys= tems. > +And it is not always obvious what and when went wrong. Sometimes you mig= ht > +get some subtle hints that there is something going on - but by the time > +you realise it, it might be too late as you are already out-of-space > +or the filesystem has been remounted as read-only (i.e.). The generic > +interface for the filesystem events fills the gap by providing a rather > +easy way of real-time notifications triggered whenever something interes= ting > +happens, allowing filesystems to report events in a common way, as they = occur. > + > +2. How does it work: > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +The interface itself has been exposed as fstrace-type Virtual File Syste= m, > +primarily to ease the process of setting up the configuration for the > +notifications. So for starters, it needs to get mounted (obviously): > + > + mount -t fstrace none /sys/fs/events > + > +This will unveil the single fstrace filesystem entry - the 'config' file, > +through which the notification are being set-up. > + > +Activating notifications for particular filesystem is as straightforward > +as writing into the 'config' file. Note that by default all events, desp= ite > +the actual filesystem type, are being disregarded. > + > +Synopsis of config: > +------------------ > + > + MOUNT EVENT_TYPE [L1] [L2] > + > + MOUNT : the filesystem's mount point > + EVENT_TYPE : event types - currently two of them are being supported: > + > + * generic events ("G") covering most common warnings > + and errors that might be reported by any filesystem; > + this option does not take any arguments; > + > + * threshold notifications ("T") - events sent whenever > + the amount of available space drops below certain level; > + it is possible to specify two threshold levels though > + only one is required to properly setup the notifications; > + as those refer to the number of available blocks, the lower > + level [L1] needs to be higher than the upper one [L2] > + > +Sample request could look like the following: > + > + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config > + > +Multiple request might be specified provided they are separated with sem= icolon. > + > +The configuration itself might be modified at any time. One can add/remo= ve > +particular event types for given fielsystem, modify the threshold levels, > +and remove single or all entries from the 'config' file. > + > + - Adding new event type: > + > + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config > + > +(Note that is is enough to provide the event type to be enabled without > +the already set ones.) > + > + - Removing event type: > + > + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config > + > + - Updating threshold limits: > + > + $ echo MOUNT T L1 L2 > /sys/fs/events/config > + > + - Removing single entry: > + > + $ echo '!MOUNT' > /sys/fs/events/config > + > + - Removing all entries: > + > + $ echo > /sys/fs/events/config > + > +Reading the file will list all registered entries with their current set= -up > +along with some additional info like the filesystem type and the backing= device > +name if available. > + > +Final, though a very important note on the configuration: when and if the > +actual events are being triggered falls way beyond the scope of the gene= ric > +filesystem events interface. It is up to a particular filesystem > +implementation which events are to be supported - if any at all. So if > +given filesystem does not support the event notifications, an attempt to > +enable those through 'config' file will fail. > + > + > +3. The generic netlink interface support: > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +Whenever an event notification is triggered (by given filesystem) the cu= rrent > +configuration is being validated to decide whether a userpsace notificat= ion > +should be launched. If there has been no request (in a mean of 'config' = file > +entry) for given event, one will be silently disregarded. If, on the oth= er > +hand, someone is 'watching' given filesystem for specific events, a gene= ric > +netlink message will be sent. A dedicated multicast group has been provi= ded > +solely for this purpose so in order to receive such notifications, one s= hould > +subscribe to this new multicast group. As for now only the init network > +namespace is being supported. > + > +3.1 Message format > + > +The FS_NL_C_EVENT shall be stored within the generic netlink message hea= der > +as the command field. The message payload will provide more detailed inf= o: > +the backing device major and minor numbers, the event code and the id of > +the process which action led to the event occurrence. In case of thresho= ld > +notifications, the current number of available blocks will be included > +in the payload as well. > + > + > + 0 1 2 3 > + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | NETLINK MESSAGE HEADER | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | GENERIC NETLINK MESSAGE HEADER | > + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | Optional user specific message header | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | GENERIC MESSAGE PAYLOAD: | > + +---------------------------------------------------------------+ > + | FS_NL_A_EVENT_ID (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DEV_MAJOR (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DEV_MINOR (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_CAUSED_ID (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DATA (NLA_U64) | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + > + > +The above figure is based on: > + http://www.linuxfoundation.org/collaborate/workgroups/networking/generi= c_netlink_howto#Message_Format > + > + > +4. API Reference: > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > + 4.1 Generic file system event interface data & operations > + > + #include > + > + struct fs_trace_info { > + void __rcu *e_priv /* READ ONLY */ > + unsigned int events_cap_mask; /* Supported notifications */ > + const struct fs_trace_operations *ops; > + }; > + > + struct fs_trace_operations { > + void (*query)(struct super_block *, u64 *); > + }; > + > + In order to get the fireworks and stuff, each filesystem needs to setup > + the events_cap_mask field of the fs_trace_info structure, which has been > + embedded within the super_block structure. This should reflect the type= of > + events the filesystem wants to support. In case of threshold notificati= ons, > + apart from setting the FS_EVENT_THRESH flag, the 'query' callback should > + be provided as this enables the events interface to get the up-to-date > + state of the number of available blocks whenever those notifications are > + being requested. > + > + The 'e_priv' field of the fs_trace_info structure should be completely = ignored > + as it's for INTERNAL USE ONLY. So don't even think of messing with it, = if you > + do not want to get yourself into some real trouble. If still, you are t= empted > + to do so - feel free, it's gonna be pure fun. Consider yourself warned. > + > + > + 4.2 Event notification: > + > + #include > + void fs_event_notify(struct super_block *sb, unsigned int event_id); > + > + Notify the generic FS event interface of an occurring event. > + This shall be used by any file system that wishes to inform any potenti= al > + listeners/watchers of a particular event. > + - sb: the filesystem's super block > + - event_id: an event identifier > + > + 4.3 Threshold notifications: > + > + #include > + void fs_event_alloc_space(struct super_block *sb, u64 ncount); > + void fs_event_free_space(struct super_block *sb, u64 ncount); > + > + Each filesystme supporting the threshold notifications should call > + fs_event_alloc_space/fs_event_free_space respectively whenever the > + amount of available blocks changes. > + - sb: the filesystem's super block > + - ncount: number of blocks being acquired/released > + > + Note that to properly handle the threshold notifications the fs events > + interface needs to be kept up to date by the filesystems. Each should > + register fs_trace_operations to enable querying the current number of > + available blocks. > + > + 4.4 Sending message through generic netlink interface > + > + #include > + > + int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), void *cbdata); > + > + Although the fs event interface is fully responsible for sending the me= ssages > + over the netlink, filesystems might use the FS_EVENT multicast group to= send > + their own custom messages. > + - size: the size of the message payload > + - event_id: the event identifier > + - compose_msg: a callback responsible for filling-in the message payload > + - cbdata: message custom data > + > + Calling fs_netlink_send_event will result in a message being sent by > + the FS_EVENT multicast group. Note that the body of the message should = be > + prepared (set-up )by the caller - through compose_msg callback. The mes= sage's > + sk_buff will be allocated on behalf of the caller (thus the size parame= ter). > + The compose_msg should only fill the payload with proper data. Unless > + the event id is specified as FS_EVENT_NONE, it's value shall be added > + to the payload prior to calling the compose_msg. > + > + > diff --git a/fs/Kconfig b/fs/Kconfig > index ec35851..a89e678 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -69,6 +69,8 @@ config FILE_LOCKING > for filesystems like NFS and for the flock() system > call. Disabling this option saves about 11k. >=20=20 > +source "fs/events/Kconfig" > + > source "fs/notify/Kconfig" >=20=20 > source "fs/quota/Kconfig" > diff --git a/fs/Makefile b/fs/Makefile > index a88ac48..bcb3048 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -126,3 +126,4 @@ obj-y +=3D exofs/ # Multiple modules > obj-$(CONFIG_CEPH_FS) +=3D ceph/ > obj-$(CONFIG_PSTORE) +=3D pstore/ > obj-$(CONFIG_EFIVAR_FS) +=3D efivarfs/ > +obj-$(CONFIG_FS_EVENTS) +=3D events/ > diff --git a/fs/events/Kconfig b/fs/events/Kconfig > new file mode 100644 > index 0000000..1c60195 > --- /dev/null > +++ b/fs/events/Kconfig > @@ -0,0 +1,7 @@ > +# Generic Files System events interface > +config FS_EVENTS > + bool "Generic filesystem events" > + select NET > + default y > + help > + Enable generic filesystem events interface > diff --git a/fs/events/Makefile b/fs/events/Makefile > new file mode 100644 > index 0000000..9c98337 > --- /dev/null > +++ b/fs/events/Makefile > @@ -0,0 +1,5 @@ > +# > +# Makefile for the Linux Generic File System Event Interface > +# > + > +obj-y :=3D fs_event.o fs_event_netlink.o > diff --git a/fs/events/fs_event.c b/fs/events/fs_event.c > new file mode 100644 > index 0000000..1037311 > --- /dev/null > +++ b/fs/events/fs_event.c > @@ -0,0 +1,809 @@ > +/* > + * Generic File System Evens Interface > + * > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify = it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution = in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but W= ITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License = for > + * more details. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "../pnode.h" > +#include "fs_event.h" > + > +static LIST_HEAD(fs_trace_list); > +static DEFINE_MUTEX(fs_trace_lock); > + > +static struct kmem_cache *fs_trace_cachep __read_mostly; > + > +static atomic_t stray_traces =3D ATOMIC_INIT(0); > +static DECLARE_WAIT_QUEUE_HEAD(trace_wq); > +/* > + * Threshold notification state bits. > + * Note the reverse as this refers to the number > + * of available blocks. > + */ > +#define THRESH_LR_BELOW 0x0001 /* Falling below the lower range */ > +#define THRESH_LR_BEYOND 0x0002 > +#define THRESH_UR_BELOW 0x0004 > +#define THRESH_UR_BEYOND 0x0008 /* Going beyond the upper range */ > + > +#define THRESH_LR_ON (THRESH_LR_BELOW | THRESH_LR_BEYOND) > +#define THRESH_UR_ON (THRESH_UR_BELOW | THRESH_UR_BEYOND) > + > +#define FS_TRACE_ADD 0x100000 > + > +struct fs_trace_entry { > + struct kref count; > + atomic_t active; > + struct super_block *sb; > + unsigned int notify; > + struct path mnt_path; > + struct list_head node; > + > + struct fs_event_thresh { > + u64 avail_space; > + u64 lrange; > + u64 urange; > + unsigned int state; > + } th; > + struct rcu_head rcu_head; > + spinlock_t lock; > +}; > + > +static const match_table_t fs_etypes =3D { > + { FS_EVENT_GENERIC, "G" }, > + { FS_EVENT_THRESH, "T" }, > + { 0, NULL }, > +}; > + > +static inline int fs_trace_query_data(struct super_block *sb, > + struct fs_trace_entry *en) > +{ > + if (sb->s_etrace.ops && sb->s_etrace.ops->query) { > + sb->s_etrace.ops->query(sb, &en->th.avail_space); > + return 0; > + } > + > + return -EINVAL; > +} > + > +static inline void fs_trace_entry_free(struct fs_trace_entry *en) > +{ > + kmem_cache_free(fs_trace_cachep, en); > +} > + > +static void fs_destroy_trace_entry(struct kref *en_ref) > +{ > + struct fs_trace_entry *en =3D container_of(en_ref, > + struct fs_trace_entry, count); > + > + /* Last reference has been dropped */ > + fs_trace_entry_free(en); > + atomic_dec(&stray_traces); > +} > + > +static void fs_trace_entry_put(struct fs_trace_entry *en) > +{ > + kref_put(&en->count, fs_destroy_trace_entry); > +} > + > +static void fs_release_trace_entry(struct rcu_head *rcu_head) > +{ > + struct fs_trace_entry *en =3D container_of(rcu_head, > + struct fs_trace_entry, > + rcu_head); > + /* > + * As opposed to typical reference drop, this one is being > + * called from the rcu callback. This is to make sure all > + * readers have managed to safely grab the reference before > + * the change to rcu pointer is visible to all and before > + * the reference is dropped here. > + */ > + fs_trace_entry_put(en); > +} > + > +static void fs_drop_trace_entry(struct fs_trace_entry *en) > +{ > + struct super_block *sb; > + > + lockdep_assert_held(&fs_trace_lock); > + /* > + * The trace entry might have already been removed > + * from the list of active traces with the proper > + * ref drop, though it was still in use handling > + * one of the fs events. This means that the object > + * has been already scheduled for being released. > + * So leave... > + */ > + > + if (!atomic_add_unless(&en->active, -1, 0)) > + return; > + /* > + * At this point the trace entry is being marked as inactive > + * so no new references will be allowed. > + * Still it might be floating around somewhere > + * so drop the reference when the rcu readers are done. > + */ > + spin_lock(&en->lock); > + list_del(&en->node); > + sb =3D en->sb; > + en->sb =3D NULL; > + spin_unlock(&en->lock); > + > + rcu_assign_pointer(sb->s_etrace.e_priv, NULL); > + call_rcu(&en->rcu_head, fs_release_trace_entry); > + /* It's safe now to drop the reference to the super */ > + deactivate_super(sb); > + atomic_inc(&stray_traces); > +} > + > +static inline > +struct fs_trace_entry *fs_trace_entry_get(struct fs_trace_entry *en) > +{ > + if (en) { > + if (!kref_get_unless_zero(&en->count)) > + return NULL; > + /* Don't allow referencing inactive object */ > + if (!atomic_read(&en->active)) { > + fs_trace_entry_put(en); > + return NULL; > + } > + } > + return en; > +} > + > +static struct fs_trace_entry *fs_trace_entry_get_rcu(struct super_block = *sb) > +{ > + struct fs_trace_entry *en; > + > + if (!sb) > + return NULL; > + > + rcu_read_lock(); > + en =3D rcu_dereference(sb->s_etrace.e_priv); > + en =3D fs_trace_entry_get(en); > + rcu_read_unlock(); > + > + return en; > +} > + > +static int fs_remove_trace_entry(struct super_block *sb) > +{ > + struct fs_trace_entry *en; > + > + en =3D fs_trace_entry_get_rcu(sb); > + if (!en) > + return -EINVAL; > + > + mutex_lock(&fs_trace_lock); > + fs_drop_trace_entry(en); > + mutex_unlock(&fs_trace_lock); > + fs_trace_entry_put(en); > + return 0; > +} > + > +static void fs_remove_all_traces(void) > +{ > + struct fs_trace_entry *en, *guard; > + > + mutex_lock(&fs_trace_lock); > + list_for_each_entry_safe(en, guard, &fs_trace_list, node) > + fs_drop_trace_entry(en); > + mutex_unlock(&fs_trace_lock); > +} > + > +static int create_common_msg(struct sk_buff *skb, void *data) > +{ > + struct fs_trace_entry *en =3D (struct fs_trace_entry *)data; > + struct super_block *sb =3D en->sb; > + > + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) > + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) > + return -EINVAL; What about diskless(nfs,cifs,etc) filesystem? btrfs also has no valid sb->s_dev=20=20 > + > + if (nla_put_u64(skb, FS_NL_A_CAUSED_ID, pid_vnr(task_pid(current)))) > + return -EINVAL; > + > + return 0; > +} > + > +static int create_thresh_msg(struct sk_buff *skb, void *data) > +{ > + struct fs_trace_entry *en =3D (struct fs_trace_entry *)data; > + int ret; > + > + ret =3D create_common_msg(skb, data); > + if (!ret) > + ret =3D nla_put_u64(skb, FS_NL_A_DATA, en->th.avail_space); > + return ret; > +} > + > +static void fs_event_send(struct fs_trace_entry *en, unsigned int event_= id) > +{ > + size_t size =3D nla_total_size(sizeof(u32)) * 2 + > + nla_total_size(sizeof(u64)); > + > + fs_netlink_send_event(size, event_id, create_common_msg, en); > +} > + > +static void fs_event_send_thresh(struct fs_trace_entry *en, > + unsigned int event_id) > +{ > + size_t size =3D nla_total_size(sizeof(u32)) * 2 + > + nla_total_size(sizeof(u64)) * 2; > + > + fs_netlink_send_event(size, event_id, create_thresh_msg, en); > +} > + > +void fs_event_notify(struct super_block *sb, unsigned int event_id) > +{ > + struct fs_trace_entry *en; > + > + en =3D fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + if (atomic_read(&en->active) && (en->notify & FS_EVENT_GENERIC)) > + fs_event_send(en, event_id); > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_notify); > + > +void fs_event_alloc_space(struct super_block *sb, u64 ncount) > +{ > + struct fs_trace_entry *en; > + s64 count; > + > + en =3D fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + > + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) > + goto leave; > + /* > + * we shouldn't drop below 0 here, > + * unless there is a sync issue somewhere (?) > + */ > + count =3D en->th.avail_space - ncount; > + en->th.avail_space =3D count < 0 ? 0 : count; > + > + if (en->th.avail_space > en->th.lrange) > + /* Not 'even' close - leave */ > + goto leave; > + > + if (en->th.avail_space > en->th.urange) { > + /* Close enough - the lower range has been reached */ > + if (!(en->th.state & THRESH_LR_BEYOND)) { > + /* Send notification */ > + fs_event_send_thresh(en, FS_THR_LRBELOW); > + en->th.state &=3D ~THRESH_LR_BELOW; > + en->th.state |=3D THRESH_LR_BEYOND; > + } > + goto leave; > + } > + if (!(en->th.state & THRESH_UR_BEYOND)) { > + fs_event_send_thresh(en, FS_THR_URBELOW); > + en->th.state &=3D ~THRESH_UR_BELOW; > + en->th.state |=3D THRESH_UR_BEYOND; > + } > + > +leave: > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_alloc_space); > + > +void fs_event_free_space(struct super_block *sb, u64 ncount) > +{ > + struct fs_trace_entry *en; > + > + en =3D fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + > + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) > + goto leave; > + > + en->th.avail_space +=3D ncount; > + > + if (en->th.avail_space > en->th.lrange) { > + if (!(en->th.state & THRESH_LR_BELOW) > + && en->th.state & THRESH_LR_BEYOND) { > + /* Send notification */ > + fs_event_send_thresh(en, FS_THR_LRABOVE); > + en->th.state &=3D ~(THRESH_LR_BEYOND|THRESH_UR_BEYOND); > + en->th.state |=3D THRESH_LR_BELOW; > + goto leave; > + } > + } > + if (en->th.avail_space > en->th.urange) { > + if (!(en->th.state & THRESH_UR_BELOW) > + && en->th.state & THRESH_UR_BEYOND) { > + /* Notify */ > + fs_event_send_thresh(en, FS_THR_URABOVE); > + en->th.state &=3D ~THRESH_UR_BEYOND; > + en->th.state |=3D THRESH_UR_BELOW; > + } > + } > +leave: > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_free_space); > + > +void fs_event_mount_dropped(struct vfsmount *mnt) > +{ > + /* > + * The mount is dropped but the super might not get released > + * at once so there is very small chance some notifications > + * will come through. > + * Note that the mount being dropped here might belong to a different > + * namespace - if this is the case, just ignore it. > + */ > + struct fs_trace_entry *en =3D fs_trace_entry_get_rcu(mnt->mnt_sb); > + struct vfsmount *en_mnt; > + > + if (!en || !atomic_read(&en->active)) > + return; > + /* > + * The entry once set, does not change the mountpoint it's being > + * pinned to, so no need to take the lock here. > + */ > + en_mnt =3D en->mnt_path.mnt; > + if (!(real_mount(mnt)->mnt_ns !=3D (real_mount(en_mnt))->mnt_ns)) > + fs_remove_trace_entry(mnt->mnt_sb); > + fs_trace_entry_put(en); > +} > + > +static int fs_new_trace_entry(struct path *path, struct fs_event_thresh = *thresh, > + unsigned int nmask) > +{ > + struct fs_trace_entry *en; > + struct super_block *sb; > + struct mount *r_mnt; > + > + en =3D kmem_cache_zalloc(fs_trace_cachep, GFP_KERNEL); > + if (unlikely(!en)) > + return -ENOMEM; > + /* > + * Note that no reference is being taken here for the path as it would > + * make the unmount unnecessarily puzzling (due to an extra 'valid' > + * reference for the mnt). > + * This is *rather* safe as the notification on mount being dropped > + * will get called prior to releasing the super block - so right > + * in time to perform appropriate clean-up > + */ > + r_mnt =3D real_mount(path->mnt); > + > + en->mnt_path.dentry =3D r_mnt->mnt.mnt_root; > + en->mnt_path.mnt =3D &r_mnt->mnt; > + > + sb =3D path->mnt->mnt_sb; > + en->sb =3D sb; > + /* > + * Increase the refcount for sb to mark it's being relied on. > + * Note that the reference to path is taken by the caller, so it > + * is safe to assume there is at least single active reference > + * to super as well. > + */ > + atomic_inc(&sb->s_active); > + > + nmask &=3D sb->s_etrace.events_cap_mask; > + if (!nmask) > + goto leave; > + > + spin_lock_init(&en->lock); > + INIT_LIST_HEAD(&en->node); > + > + en->notify =3D nmask; > + memcpy(&en->th, thresh, offsetof(struct fs_event_thresh, state)); > + if (nmask & FS_EVENT_THRESH) > + fs_trace_query_data(sb, en); > + > + kref_init(&en->count); > + > + if (rcu_access_pointer(sb->s_etrace.e_priv) !=3D NULL) { > + struct fs_trace_entry *prev_en; > + > + prev_en =3D fs_trace_entry_get_rcu(sb); > + if (prev_en) { > + WARN_ON(prev_en); > + fs_trace_entry_put(prev_en); > + goto leave; > + } > + } > + atomic_set(&en->active, 1); > + > + mutex_lock(&fs_trace_lock); > + list_add(&en->node, &fs_trace_list); > + mutex_unlock(&fs_trace_lock); > + > + rcu_assign_pointer(sb->s_etrace.e_priv, en); > + synchronize_rcu(); > + > + return 0; > +leave: > + deactivate_super(sb); > + kmem_cache_free(fs_trace_cachep, en); > + return -EINVAL; > +} > + > +static int fs_update_trace_entry(struct path *path, > + struct fs_event_thresh *thresh, > + unsigned int nmask) > +{ > + struct fs_trace_entry *en; > + struct super_block *sb; > + int extend =3D nmask & FS_TRACE_ADD; > + int ret =3D -EINVAL; > + > + en =3D fs_trace_entry_get_rcu(path->mnt->mnt_sb); > + if (!en) > + return (extend) ? fs_new_trace_entry(path, thresh, nmask) > + : -EINVAL; > + > + if (!atomic_read(&en->active)) > + return -EINVAL; > + > + nmask &=3D ~FS_TRACE_ADD; > + > + spin_lock(&en->lock); > + sb =3D en->sb; > + if (!sb || !(nmask & sb->s_etrace.events_cap_mask)) > + goto leave; > + > + if (nmask & FS_EVENT_THRESH) { > + if (extend) { > + /* Get the current state */ > + if (!(en->notify & FS_EVENT_THRESH)) > + if (fs_trace_query_data(sb, en)) > + goto leave; > + > + if (thresh->state & THRESH_LR_ON) { > + en->th.lrange =3D thresh->lrange; > + en->th.state &=3D ~THRESH_LR_ON; > + } > + > + if (thresh->state & THRESH_UR_ON) { > + en->th.urange =3D thresh->urange; > + en->th.state &=3D ~THRESH_UR_ON; > + } > + } else { > + memset(&en->th, 0, sizeof(en->th)); > + } > + } > + > + if (extend) > + en->notify |=3D nmask; > + else > + en->notify &=3D ~nmask; > + ret =3D 0; > +leave: > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > + return ret; > +} > + > +static int fs_parse_trace_request(int argc, char **argv) > +{ > + struct fs_event_thresh thresh =3D {0}; > + struct path path; > + substring_t args[MAX_OPT_ARGS]; > + unsigned int nmask =3D FS_TRACE_ADD; > + int token; > + char *s; > + int ret =3D -EINVAL; > + > + if (!argc) { > + fs_remove_all_traces(); > + return 0; > + } > + > + s =3D *(argv); > + if (*s =3D=3D '!') { > + /* Clear the trace entry */ > + nmask &=3D ~FS_TRACE_ADD; > + ++s; > + } > + > + if (kern_path_mountpoint(AT_FDCWD, s, &path, LOOKUP_FOLLOW)) > + return -EINVAL; > + > + if (!(--argc)) { > + if (!(nmask & FS_TRACE_ADD)) > + ret =3D fs_remove_trace_entry(path.mnt->mnt_sb); > + goto leave; > + } > + > +repeat: > + args[0].to =3D args[0].from =3D NULL; > + token =3D match_token(*(++argv), fs_etypes, args); > + if (!token && !nmask) > + goto leave; > + > + nmask |=3D token & FS_EVENTS_ALL; > + --argc; > + if ((token & FS_EVENT_THRESH) && (nmask & FS_TRACE_ADD)) { > + /* > + * Get the threshold config data: > + * lower range > + * upper range > + */ > + if (!argc) > + goto leave; > + > + ret =3D kstrtoull(*(++argv), 10, &thresh.lrange); > + if (ret) > + goto leave; > + thresh.state |=3D THRESH_LR_ON; > + if ((--argc)) { > + ret =3D kstrtoull(*(++argv), 10, &thresh.urange); > + if (ret) > + goto leave; > + thresh.state |=3D THRESH_UR_ON; > + --argc; > + } > + /* The thresholds are based on number of available blocks */ > + if (thresh.lrange < thresh.urange) { > + ret =3D -EINVAL; > + goto leave; > + } > + } > + if (argc) > + goto repeat; > + > + ret =3D fs_update_trace_entry(&path, &thresh, nmask); > +leave: > + path_put(&path); > + return ret; > +} > + > +#define DEFAULT_BUF_SIZE PAGE_SIZE > + > +static ssize_t fs_trace_write(struct file *file, const char __user *buff= er, > + size_t count, loff_t *ppos) > +{ > + char **argv; > + char *kern_buf, *next, *cfg; > + size_t size, dcount =3D 0; > + int argc; > + > + if (!count) > + return 0; > + > + kern_buf =3D kmalloc(DEFAULT_BUF_SIZE, GFP_KERNEL); > + if (!kern_buf) > + return -ENOMEM; > + > + while (dcount < count) { > + > + size =3D count - dcount; > + if (size >=3D DEFAULT_BUF_SIZE) > + size =3D DEFAULT_BUF_SIZE - 1; > + if (copy_from_user(kern_buf, buffer + dcount, size)) { > + dcount =3D -EINVAL; > + goto leave; > + } > + > + kern_buf[size] =3D '\0'; > + > + next =3D cfg =3D kern_buf; > + > + do { > + next =3D strchr(cfg, ';'); > + if (next) > + *next =3D '\0'; > + > + argv =3D argv_split(GFP_KERNEL, cfg, &argc); > + if (!argv) { > + dcount =3D -ENOMEM; > + goto leave; > + } > + > + if (fs_parse_trace_request(argc, argv)) { > + dcount =3D -EINVAL; > + argv_free(argv); > + goto leave; > + } > + > + argv_free(argv); > + if (next) > + cfg =3D ++next; > + > + } while (next); > + dcount +=3D size; > + } > +leave: > + kfree(kern_buf); > + return dcount; > +} > + > +static void *fs_trace_seq_start(struct seq_file *m, loff_t *pos) > +{ > + mutex_lock(&fs_trace_lock); > + return seq_list_start(&fs_trace_list, *pos); > +} > + > +static void *fs_trace_seq_next(struct seq_file *m, void *v, loff_t *pos) > +{ > + return seq_list_next(v, &fs_trace_list, pos); > +} > + > +static void fs_trace_seq_stop(struct seq_file *m, void *v) > +{ > + mutex_unlock(&fs_trace_lock); > +} > + > +static int fs_trace_seq_show(struct seq_file *m, void *v) > +{ > + struct fs_trace_entry *en; > + struct super_block *sb; > + struct mount *r_mnt; > + const struct match_token *match; > + unsigned int nmask; > + > + en =3D list_entry(v, struct fs_trace_entry, node); > + /* Do not show the entries outside current mount namespace */ > + r_mnt =3D real_mount(en->mnt_path.mnt); > + if (r_mnt->mnt_ns !=3D current->nsproxy->mnt_ns) { > + if (!__is_local_mountpoint(r_mnt->mnt_mountpoint)) > + return 0; > + } > + > + sb =3D en->sb; > + > + seq_path(m, &en->mnt_path, "\t\n\\"); > + seq_putc(m, ' '); > + > + seq_escape(m, sb->s_type->name, " \t\n\\"); > + if (sb->s_subtype && sb->s_subtype[0]) { > + seq_putc(m, '.'); > + seq_escape(m, sb->s_subtype, " \t\n\\"); > + } > + > + seq_putc(m, ' '); > + if (sb->s_op->show_devname) { > + sb->s_op->show_devname(m, en->mnt_path.mnt->mnt_root); > + } else { > + seq_escape(m, r_mnt->mnt_devname ? r_mnt->mnt_devname : "none", > + " \t\n\\"); > + } > + seq_puts(m, " ("); > + > + nmask =3D en->notify; > + for (match =3D fs_etypes; match->pattern; ++match) { > + if (match->token & nmask) { > + seq_puts(m, match->pattern); > + nmask &=3D ~match->token; > + if (nmask) > + seq_putc(m, ','); > + } > + } > + seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); > + seq_puts(m, ")\n"); > + return 0; > +} > + > +static const struct seq_operations fs_trace_seq_ops =3D { > + .start =3D fs_trace_seq_start, > + .next =3D fs_trace_seq_next, > + .stop =3D fs_trace_seq_stop, > + .show =3D fs_trace_seq_show, > +}; > + > +static int fs_trace_open(struct inode *inode, struct file *file) > +{ > + return seq_open(file, &fs_trace_seq_ops); > +} > + > +static const struct file_operations fs_trace_fops =3D { > + .owner =3D THIS_MODULE, > + .open =3D fs_trace_open, > + .write =3D fs_trace_write, > + .read =3D seq_read, > + .llseek =3D seq_lseek, > + .release =3D seq_release, > +}; > + > +static int fs_trace_init(void) > +{ > + fs_trace_cachep =3D KMEM_CACHE(fs_trace_entry, 0); > + if (!fs_trace_cachep) > + return -EINVAL; > + init_waitqueue_head(&trace_wq); > + return 0; > +} > + > +/* VFS support */ > +static int fs_trace_fill_super(struct super_block *sb, void *data, int s= ilen) > +{ > + int ret; > + static struct tree_descr desc[] =3D { > + [2] =3D { > + .name =3D "config", > + .ops =3D &fs_trace_fops, > + .mode =3D S_IWUSR | S_IRUGO, > + }, > + {""}, > + }; > + > + ret =3D simple_fill_super(sb, 0x7246332, desc); > + return !ret ? fs_trace_init() : ret; > +} > + > +static struct dentry *fs_trace_do_mount(struct file_system_type *fs_type, > + int ntype, const char *dev_name, void *data) > +{ > + return mount_single(fs_type, ntype, data, fs_trace_fill_super); > +} > + > +static void fs_trace_kill_super(struct super_block *sb) > +{ > + /* > + * The rcu_barrier here will/should make sure all call_rcu > + * callbacks are completed - still there might be some active > + * trace objects in use which can make calling the > + * kmem_cache_destroy unsafe. So we wait until all traces > + * are finally released. > + */ > + fs_remove_all_traces(); > + rcu_barrier(); > + wait_event(trace_wq, !atomic_read(&stray_traces)); > + > + kmem_cache_destroy(fs_trace_cachep); > + kill_litter_super(sb); > +} > + > +static struct kset *fs_trace_kset; > + > +static struct file_system_type fs_trace_fstype =3D { > + .name =3D "fstrace", > + .mount =3D fs_trace_do_mount, > + .kill_sb =3D fs_trace_kill_super, > +}; > + > +static void __init fs_trace_vfs_init(void) > +{ > + fs_trace_kset =3D kset_create_and_add("events", NULL, fs_kobj); > + > + if (!fs_trace_kset) > + return; > + > + if (!register_filesystem(&fs_trace_fstype)) { > + if (!fs_event_netlink_register()) > + return; > + unregister_filesystem(&fs_trace_fstype); > + } > + kset_unregister(fs_trace_kset); > +} > + > +static int __init fs_trace_evens_init(void) > +{ > + fs_trace_vfs_init(); > + return 0; > +}; > +module_init(fs_trace_evens_init); > + > diff --git a/fs/events/fs_event.h b/fs/events/fs_event.h > new file mode 100644 > index 0000000..23f24c8 > --- /dev/null > +++ b/fs/events/fs_event.h > @@ -0,0 +1,22 @@ > +/* > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify = it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution = in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but W= ITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License = for > + * more details. > + */ > + > +#ifndef __GENERIC_FS_EVENTS_H > +#define __GENERIC_FS_EVENTS_H > + > +int fs_event_netlink_register(void); > +void fs_event_netlink_unregister(void); > + > +#endif /* __GENERIC_FS_EVENTS_H */ > diff --git a/fs/events/fs_event_netlink.c b/fs/events/fs_event_netlink.c > new file mode 100644 > index 0000000..0c97eb7 > --- /dev/null > +++ b/fs/events/fs_event_netlink.c > @@ -0,0 +1,104 @@ > +/* > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify = it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution = in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but W= ITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License = for > + * more details. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "fs_event.h" > + > +static const struct genl_multicast_group fs_event_mcgroups[] =3D { > + { .name =3D FS_EVENTS_MCAST_GRP_NAME, }, > +}; > + > +static struct genl_family fs_event_family =3D { > + .id =3D GENL_ID_GENERATE, > + .name =3D FS_EVENTS_FAMILY_NAME, > + .version =3D 1, > + .maxattr =3D FS_NL_A_MAX, > + .mcgrps =3D fs_event_mcgroups, > + .n_mcgrps =3D ARRAY_SIZE(fs_event_mcgroups), > +}; > + > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), > + void *cbdata) > +{ > + static atomic_t seq; > + struct sk_buff *skb; > + void *msg_head; > + int ret =3D 0; > + > + if (!size || !compose_msg) > + return -EINVAL; > + > + /* Skip if there are no listeners */ > + if (!genl_has_listeners(&fs_event_family, &init_net, 0)) > + return 0; > + > + if (event_id !=3D FS_EVENT_NONE) > + size +=3D nla_total_size(sizeof(u32)); > + size +=3D nla_total_size(sizeof(u64)); > + skb =3D genlmsg_new(size, GFP_NOWAIT); > + > + if (!skb) { > + pr_debug("Failed to allocate new FS generic netlink message\n"); > + return -ENOMEM; > + } > + > + msg_head =3D genlmsg_put(skb, 0, atomic_add_return(1, &seq), > + &fs_event_family, 0, FS_NL_C_EVENT); > + if (!msg_head) > + goto cleanup; > + > + if (event_id !=3D FS_EVENT_NONE) > + if (nla_put_u32(skb, FS_NL_A_EVENT_ID, event_id)) > + goto cancel; > + > + ret =3D compose_msg(skb, cbdata); > + if (ret) > + goto cancel; > + > + genlmsg_end(skb, msg_head); > + ret =3D genlmsg_multicast(&fs_event_family, skb, 0, 0, GFP_NOWAIT); > + if (ret && ret !=3D -ENOBUFS && ret !=3D -ESRCH) > + goto cleanup; > + > + return ret; > + > +cancel: > + genlmsg_cancel(skb, msg_head); > +cleanup: > + nlmsg_free(skb); > + return ret; > +} > +EXPORT_SYMBOL(fs_netlink_send_event); > + > +int fs_event_netlink_register(void) > +{ > + int ret; > + > + ret =3D genl_register_family(&fs_event_family); > + if (ret) > + pr_err("Failed to register FS netlink interface\n"); > + return ret; > +} > + > +void fs_event_netlink_unregister(void) > +{ > + genl_unregister_family(&fs_event_family); > +} > diff --git a/fs/namespace.c b/fs/namespace.c > index 82ef140..ec6e2ef 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -1031,6 +1031,7 @@ static void cleanup_mnt(struct mount *mnt) > if (unlikely(mnt->mnt_pins.first)) > mnt_pin_kill(mnt); > fsnotify_vfsmount_delete(&mnt->mnt); > + fs_event_mount_dropped(&mnt->mnt); > dput(mnt->mnt.mnt_root); > deactivate_super(mnt->mnt.mnt_sb); > mnt_free_id(mnt); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index b4d71b5..b7dadd9 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -263,6 +263,10 @@ struct iattr { > * Includes for diskquotas. > */ > #include > +/* > + * Include for Generic File System Events Interface > + */ > +#include >=20=20 > /* > * Maximum number of layers of fs stack. Needs to be limited to > @@ -1253,7 +1257,7 @@ struct super_block { > struct hlist_node s_instances; > unsigned int s_quota_types; /* Bitmask of supported quota types */ > struct quota_info s_dquot; /* Diskquota specific options */ > - > + struct fs_trace_info s_etrace; > struct sb_writers s_writers; >=20=20 > char s_id[32]; /* Informational name */ > diff --git a/include/linux/fs_event.h b/include/linux/fs_event.h > new file mode 100644 > index 0000000..83e22dd > --- /dev/null > +++ b/include/linux/fs_event.h > @@ -0,0 +1,72 @@ > +/* > + * Generic File System Events Interface > + * > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify = it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution = in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but W= ITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License = for > + * more details. > + */ > +#ifndef _LINUX_GENERIC_FS_EVETS_ > +#define _LINUX_GENERIC_FS_EVETS_ > +#include > +#include > + > +/* > + * Currently supported event types > + */ > +#define FS_EVENT_GENERIC 0x001 > +#define FS_EVENT_THRESH 0x002 > + > +#define FS_EVENTS_ALL (FS_EVENT_GENERIC | FS_EVENT_THRESH) > + > +struct fs_trace_operations { > + void (*query)(struct super_block *, u64 *); > +}; > + > +struct fs_trace_info { > + void __rcu *e_priv; /* READ ONLY */ > + unsigned int events_cap_mask; /* Supported notifications */ > + const struct fs_trace_operations *ops; > +}; > + > +#ifdef CONFIG_FS_EVENTS > + > +void fs_event_notify(struct super_block *sb, unsigned int event_id); > +void fs_event_alloc_space(struct super_block *sb, u64 ncount); > +void fs_event_free_space(struct super_block *sb, u64 ncount); > +void fs_event_mount_dropped(struct vfsmount *mnt); > + > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), > + void *cbdata); > + > +#else /* CONFIG_FS_EVENTS */ > + > +static inline > +void fs_event_notify(struct super_block *sb, unsigned int event_id) {}; > +static inline > +void fs_event_alloc_space(struct super_block *sb, u64 ncount) {}; > +static inline > +void fs_event_free_space(struct super_block *sb, u64 ncount) {}; > +static inline > +void fs_event_mount_dropped(struct vfsmount *mnt) {}; > + > +static inline > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msig)(struct sk_buff *skb, void *data), > + void *cbdata) > +{ > + return -ENOSYS; > +} > +#endif /* CONFIG_FS_EVENTS */ > + > +#endif /* _LINUX_GENERIC_FS_EVENTS_ */ > + > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild > index 68ceb97..dae0fab 100644 > --- a/include/uapi/linux/Kbuild > +++ b/include/uapi/linux/Kbuild > @@ -129,6 +129,7 @@ header-y +=3D firewire-constants.h > header-y +=3D flat.h > header-y +=3D fou.h > header-y +=3D fs.h > +header-y +=3D fs_event.h > header-y +=3D fsl_hypervisor.h > header-y +=3D fuse.h > header-y +=3D futex.h > diff --git a/include/uapi/linux/fs_event.h b/include/uapi/linux/fs_event.h > new file mode 100644 > index 0000000..d8b07da > --- /dev/null > +++ b/include/uapi/linux/fs_event.h > @@ -0,0 +1,58 @@ > +/* > + * Generic netlink support for Generic File System Events Interface > + * > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify = it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution = in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but W= ITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License = for > + * more details. > + */ > +#ifndef _UAPI_LINUX_GENERIC_FS_EVENTS_ > +#define _UAPI_LINUX_GENERIC_FS_EVENTS_ > + > +#define FS_EVENTS_FAMILY_NAME "fs_event" > +#define FS_EVENTS_MCAST_GRP_NAME "fs_event_mc_grp" > + > +/* > + * Generic netlink attribute types > + */ > +enum { > + FS_NL_A_NONE, > + FS_NL_A_EVENT_ID, > + FS_NL_A_DEV_MAJOR, > + FS_NL_A_DEV_MINOR, > + FS_NL_A_CAUSED_ID, > + FS_NL_A_DATA, > + __FS_NL_A_MAX, > +}; > +#define FS_NL_A_MAX (__FS_NL_A_MAX - 1) > +/* > + * Generic netlink commands > + */ > +#define FS_NL_C_EVENT 1 > + > +/* > + * Supported set of FS events > + */ > +enum { > + FS_EVENT_NONE, > + FS_WARN_ENOSPC, /* No space left to reserve data blks */ > + FS_WARN_ENOSPC_META, /* No space left for metadata */ > + FS_THR_LRBELOW, /* The threshold lower range has been reached */ > + FS_THR_LRABOVE, /* The threshold lower range re-activcated*/ > + FS_THR_URBELOW, > + FS_THR_URABOVE, > + FS_ERR_REMOUNT_RO, /* The file system has been remounted as RO */ > + FS_ERR_CORRUPTED /* Critical error - fs corrupted */ > + > +}; > + > +#endif /* _UAPI_LINUX_GENERIC_FS_EVENTS_ */ > + > --=20 > 1.7.9.5 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJVim6XAAoJELhyPTmIL6kB/loH/29j4u8BsOA1srlbsS7kHNNO Ii2NLDJZbCSfGmTIgEslALWcOnA1QostL8E5+CuCVBhwOCrZaZiLu4mSGwcc9D+E 0OIY3V7zCb03YsILTUhxCSutmyltyhe4IRL8PvQlMMDTYCiYzvatnGyXPP/CYcrA x5HSbp0xWgdA/Frg0wIiXZc/DMsm/W+eJK8tw/kIc1BWQ3lvFlRWaiTNIehqwHnA GzovJt97vqkchl92UvhgTLx3My7NYmi2V74XRVLuU07eEKnhlWd4YC0VcIe/Z4jG siE5y5Qr5AIFbgbejAQsOPr20bZN97goLrzzrt88mjZSLYGuxi1mQ514sw1XchE= =1Dh/ -----END PGP SIGNATURE----- --=-=-=-- From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Wed, 24 Jun 2015 17:31:06 +0200 Message-ID: <558ACD3A.2020508@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <87oak5ebmx.fsf@openvz.org> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Return-path: In-reply-to: <87oak5ebmx.fsf@openvz.org> Sender: owner-linux-mm@kvack.org To: Dmitry Monakhov Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org On 06/24/2015 10:47 AM, Dmitry Monakhov wrote: > Beata Michalska writes: > >> Introduce configurable generic interface for file >> system-wide event notifications, to provide file >> systems with a common way of reporting any potential >> issues as they emerge. >> >> The notifications are to be issued through generic >> netlink interface by newly introduced multicast group. >> >> Threshold notifications have been included, allowing >> triggering an event whenever the amount of free space drops >> below a certain level - or levels to be more precise as two >> of them are being supported: the lower and the upper range. >> The notifications work both ways: once the threshold level >> has been reached, an event shall be generated whenever >> the number of available blocks goes up again re-activating >> the threshold. >> >> The interface has been exposed through a vfs. Once mounted, >> it serves as an entry point for the set-up where one can >> register for particular file system events. >> >> Signed-off-by: Beata Michalska >> --- >> Documentation/filesystems/events.txt | 232 ++++++++++ >> fs/Kconfig | 2 + >> fs/Makefile | 1 + >> fs/events/Kconfig | 7 + >> fs/events/Makefile | 5 + >> fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ >> fs/events/fs_event.h | 22 + >> fs/events/fs_event_netlink.c | 104 +++++ >> fs/namespace.c | 1 + >> include/linux/fs.h | 6 +- >> include/linux/fs_event.h | 72 +++ >> include/uapi/linux/Kbuild | 1 + >> include/uapi/linux/fs_event.h | 58 +++ >> 13 files changed, 1319 insertions(+), 1 deletion(-) >> create mode 100644 Documentation/filesystems/events.txt >> create mode 100644 fs/events/Kconfig >> create mode 100644 fs/events/Makefile >> create mode 100644 fs/events/fs_event.c >> create mode 100644 fs/events/fs_event.h >> create mode 100644 fs/events/fs_event_netlink.c >> create mode 100644 include/linux/fs_event.h >> create mode 100644 include/uapi/linux/fs_event.h >> >> diff --git a/Documentation/filesystems/events.txt b/Documentation/filesystems/events.txt >> new file mode 100644 >> index 0000000..c2e6227 >> --- /dev/null >> +++ b/Documentation/filesystems/events.txt >> @@ -0,0 +1,232 @@ >> + >> + Generic file system event notification interface >> + >> +Document created 23 April 2015 by Beata Michalska >> + >> +1. The reason behind: >> +===================== >> + >> +There are many corner cases when things might get messy with the filesystems. >> +And it is not always obvious what and when went wrong. Sometimes you might >> +get some subtle hints that there is something going on - but by the time >> +you realise it, it might be too late as you are already out-of-space >> +or the filesystem has been remounted as read-only (i.e.). The generic >> +interface for the filesystem events fills the gap by providing a rather >> +easy way of real-time notifications triggered whenever something interesting >> +happens, allowing filesystems to report events in a common way, as they occur. >> + >> +2. How does it work: >> +==================== >> + >> +The interface itself has been exposed as fstrace-type Virtual File System, >> +primarily to ease the process of setting up the configuration for the >> +notifications. So for starters, it needs to get mounted (obviously): >> + >> + mount -t fstrace none /sys/fs/events >> + >> +This will unveil the single fstrace filesystem entry - the 'config' file, >> +through which the notification are being set-up. >> + >> +Activating notifications for particular filesystem is as straightforward >> +as writing into the 'config' file. Note that by default all events, despite >> +the actual filesystem type, are being disregarded. >> + >> +Synopsis of config: >> +------------------ >> + >> + MOUNT EVENT_TYPE [L1] [L2] >> + >> + MOUNT : the filesystem's mount point >> + EVENT_TYPE : event types - currently two of them are being supported: >> + >> + * generic events ("G") covering most common warnings >> + and errors that might be reported by any filesystem; >> + this option does not take any arguments; >> + >> + * threshold notifications ("T") - events sent whenever >> + the amount of available space drops below certain level; >> + it is possible to specify two threshold levels though >> + only one is required to properly setup the notifications; >> + as those refer to the number of available blocks, the lower >> + level [L1] needs to be higher than the upper one [L2] >> + >> +Sample request could look like the following: >> + >> + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config >> + >> +Multiple request might be specified provided they are separated with semicolon. >> + >> +The configuration itself might be modified at any time. One can add/remove >> +particular event types for given fielsystem, modify the threshold levels, >> +and remove single or all entries from the 'config' file. >> + >> + - Adding new event type: >> + >> + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config >> + >> +(Note that is is enough to provide the event type to be enabled without >> +the already set ones.) >> + >> + - Removing event type: >> + >> + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config >> + >> + - Updating threshold limits: >> + >> + $ echo MOUNT T L1 L2 > /sys/fs/events/config >> + >> + - Removing single entry: >> + >> + $ echo '!MOUNT' > /sys/fs/events/config >> + >> + - Removing all entries: >> + >> + $ echo > /sys/fs/events/config >> + >> +Reading the file will list all registered entries with their current set-up >> +along with some additional info like the filesystem type and the backing device >> +name if available. >> + >> +Final, though a very important note on the configuration: when and if the >> +actual events are being triggered falls way beyond the scope of the generic >> +filesystem events interface. It is up to a particular filesystem >> +implementation which events are to be supported - if any at all. So if >> +given filesystem does not support the event notifications, an attempt to >> +enable those through 'config' file will fail. >> + >> + >> +3. The generic netlink interface support: >> +========================================= >> + >> +Whenever an event notification is triggered (by given filesystem) the current >> +configuration is being validated to decide whether a userpsace notification >> +should be launched. If there has been no request (in a mean of 'config' file >> +entry) for given event, one will be silently disregarded. If, on the other >> +hand, someone is 'watching' given filesystem for specific events, a generic >> +netlink message will be sent. A dedicated multicast group has been provided >> +solely for this purpose so in order to receive such notifications, one should >> +subscribe to this new multicast group. As for now only the init network >> +namespace is being supported. >> + >> +3.1 Message format >> + >> +The FS_NL_C_EVENT shall be stored within the generic netlink message header >> +as the command field. The message payload will provide more detailed info: >> +the backing device major and minor numbers, the event code and the id of >> +the process which action led to the event occurrence. In case of threshold >> +notifications, the current number of available blocks will be included >> +in the payload as well. >> + >> + >> + 0 1 2 3 >> + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | NETLINK MESSAGE HEADER | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | GENERIC NETLINK MESSAGE HEADER | >> + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | Optional user specific message header | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | GENERIC MESSAGE PAYLOAD: | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_EVENT_ID (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DEV_MAJOR (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DEV_MINOR (NLA_U32) | > ... >> + >> +static int create_common_msg(struct sk_buff *skb, void *data) >> +{ >> + struct fs_trace_entry *en = (struct fs_trace_entry *)data; >> + struct super_block *sb = en->sb; >> + >> + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) >> + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) >> + return -EINVAL; > What about diskless(nfs,cifs,etc) filesystem? btrfs also has no > valid sb->s_dev Those are using the anon ids, generated by get_anon_bdev (through set_anon_super). This id will be visible in /proc/self/mountinfo or through stat. i.e: 30 22 0:21 / /root/fake_fs/btrfs rw,realtime - btrfs /dev/loop4 rw,nospace_cache Best Regards Beata -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steve French Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Wed, 24 Jun 2015 11:26:27 -0500 Message-ID: References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <87oak5ebmx.fsf@openvz.org> <558ACD3A.2020508@samsung.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Return-path: In-Reply-To: <558ACD3A.2020508@samsung.com> Sender: linux-fsdevel-owner@vger.kernel.org To: Beata Michalska Cc: Dmitry Monakhov , LKML , linux-fsdevel , "linux-api@vger.kernel.org" , Greg Kroah-Hartman , Jan Kara , Theodore Ts'o , adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, Christoph Hellwig , "linux-ext4@vger.kernel.org" , linux-mm , kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org On Wed, Jun 24, 2015 at 10:31 AM, Beata Michalska wrote: > On 06/24/2015 10:47 AM, Dmitry Monakhov wrote: >> Beata Michalska writes: >> >>> Introduce configurable generic interface for file >>> system-wide event notifications, to provide file >>> systems with a common way of reporting any potential >>> issues as they emerge. >>> >>> The notifications are to be issued through generic >>> netlink interface by newly introduced multicast group. >>> >>> Threshold notifications have been included, allowing >>> triggering an event whenever the amount of free space drops >>> below a certain level - or levels to be more precise as two >>> of them are being supported: the lower and the upper range. >>> The notifications work both ways: once the threshold level >>> has been reached, an event shall be generated whenever >>> the number of available blocks goes up again re-activating >>> the threshold. >>> >>> The interface has been exposed through a vfs. Once mounted, >>> it serves as an entry point for the set-up where one can >>> register for particular file system events. >>> >>> Signed-off-by: Beata Michalska >>> --- >>> Documentation/filesystems/events.txt | 232 ++++++++++ >>> fs/Kconfig | 2 + >>> fs/Makefile | 1 + >>> fs/events/Kconfig | 7 + >>> fs/events/Makefile | 5 + >>> fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ >>> fs/events/fs_event.h | 22 + >>> fs/events/fs_event_netlink.c | 104 +++++ >>> fs/namespace.c | 1 + >>> include/linux/fs.h | 6 +- >>> include/linux/fs_event.h | 72 +++ >>> include/uapi/linux/Kbuild | 1 + >>> include/uapi/linux/fs_event.h | 58 +++ >>> 13 files changed, 1319 insertions(+), 1 deletion(-) >>> create mode 100644 Documentation/filesystems/events.txt >>> create mode 100644 fs/events/Kconfig >>> create mode 100644 fs/events/Makefile >>> create mode 100644 fs/events/fs_event.c >>> create mode 100644 fs/events/fs_event.h >>> create mode 100644 fs/events/fs_event_netlink.c >>> create mode 100644 include/linux/fs_event.h >>> create mode 100644 include/uapi/linux/fs_event.h >>> >>> diff --git a/Documentation/filesystems/events.txt b/Documentation/filesystems/events.txt >>> new file mode 100644 >>> index 0000000..c2e6227 >>> --- /dev/null >>> +++ b/Documentation/filesystems/events.txt >>> @@ -0,0 +1,232 @@ >>> + >>> + Generic file system event notification interface >>> + >>> +Document created 23 April 2015 by Beata Michalska >>> + >>> +1. The reason behind: >>> +===================== >>> + >>> +There are many corner cases when things might get messy with the filesystems. >>> +And it is not always obvious what and when went wrong. Sometimes you might >>> +get some subtle hints that there is something going on - but by the time >>> +you realise it, it might be too late as you are already out-of-space >>> +or the filesystem has been remounted as read-only (i.e.). The generic >>> +interface for the filesystem events fills the gap by providing a rather >>> +easy way of real-time notifications triggered whenever something interesting >>> +happens, allowing filesystems to report events in a common way, as they occur. >>> + >>> +2. How does it work: >>> +==================== >>> + >>> +The interface itself has been exposed as fstrace-type Virtual File System, >>> +primarily to ease the process of setting up the configuration for the >>> +notifications. So for starters, it needs to get mounted (obviously): >>> + >>> + mount -t fstrace none /sys/fs/events >>> + >>> +This will unveil the single fstrace filesystem entry - the 'config' file, >>> +through which the notification are being set-up. >>> + >>> +Activating notifications for particular filesystem is as straightforward >>> +as writing into the 'config' file. Note that by default all events, despite >>> +the actual filesystem type, are being disregarded. >>> + >>> +Synopsis of config: >>> +------------------ >>> + >>> + MOUNT EVENT_TYPE [L1] [L2] >>> + >>> + MOUNT : the filesystem's mount point >>> + EVENT_TYPE : event types - currently two of them are being supported: >>> + >>> + * generic events ("G") covering most common warnings >>> + and errors that might be reported by any filesystem; >>> + this option does not take any arguments; >>> + >>> + * threshold notifications ("T") - events sent whenever >>> + the amount of available space drops below certain level; >>> + it is possible to specify two threshold levels though >>> + only one is required to properly setup the notifications; >>> + as those refer to the number of available blocks, the lower >>> + level [L1] needs to be higher than the upper one [L2] >>> + >>> +Sample request could look like the following: >>> + >>> + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config >>> + >>> +Multiple request might be specified provided they are separated with semicolon. >>> + >>> +The configuration itself might be modified at any time. One can add/remove >>> +particular event types for given fielsystem, modify the threshold levels, >>> +and remove single or all entries from the 'config' file. >>> + >>> + - Adding new event type: >>> + >>> + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config >>> + >>> +(Note that is is enough to provide the event type to be enabled without >>> +the already set ones.) >>> + >>> + - Removing event type: >>> + >>> + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config >>> + >>> + - Updating threshold limits: >>> + >>> + $ echo MOUNT T L1 L2 > /sys/fs/events/config >>> + >>> + - Removing single entry: >>> + >>> + $ echo '!MOUNT' > /sys/fs/events/config >>> + >>> + - Removing all entries: >>> + >>> + $ echo > /sys/fs/events/config >>> + >>> +Reading the file will list all registered entries with their current set-up >>> +along with some additional info like the filesystem type and the backing device >>> +name if available. >>> + >>> +Final, though a very important note on the configuration: when and if the >>> +actual events are being triggered falls way beyond the scope of the generic >>> +filesystem events interface. It is up to a particular filesystem >>> +implementation which events are to be supported - if any at all. So if >>> +given filesystem does not support the event notifications, an attempt to >>> +enable those through 'config' file will fail. >>> + >>> + >>> +3. The generic netlink interface support: >>> +========================================= >>> + >>> +Whenever an event notification is triggered (by given filesystem) the current >>> +configuration is being validated to decide whether a userpsace notification >>> +should be launched. If there has been no request (in a mean of 'config' file >>> +entry) for given event, one will be silently disregarded. If, on the other >>> +hand, someone is 'watching' given filesystem for specific events, a generic >>> +netlink message will be sent. A dedicated multicast group has been provided >>> +solely for this purpose so in order to receive such notifications, one should >>> +subscribe to this new multicast group. As for now only the init network >>> +namespace is being supported. >>> + >>> +3.1 Message format >>> + >>> +The FS_NL_C_EVENT shall be stored within the generic netlink message header >>> +as the command field. The message payload will provide more detailed info: >>> +the backing device major and minor numbers, the event code and the id of >>> +the process which action led to the event occurrence. In case of threshold >>> +notifications, the current number of available blocks will be included >>> +in the payload as well. >>> + >>> + >>> + 0 1 2 3 >>> + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 >>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>> + | NETLINK MESSAGE HEADER | >>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>> + | GENERIC NETLINK MESSAGE HEADER | >>> + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | >>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>> + | Optional user specific message header | >>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>> + | GENERIC MESSAGE PAYLOAD: | >>> + +---------------------------------------------------------------+ >>> + | FS_NL_A_EVENT_ID (NLA_U32) | >>> + +---------------------------------------------------------------+ >>> + | FS_NL_A_DEV_MAJOR (NLA_U32) | >>> + +---------------------------------------------------------------+ >>> + | FS_NL_A_DEV_MINOR (NLA_U32) | >> > ... > >>> + >>> +static int create_common_msg(struct sk_buff *skb, void *data) >>> +{ >>> + struct fs_trace_entry *en = (struct fs_trace_entry *)data; >>> + struct super_block *sb = en->sb; >>> + >>> + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) >>> + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) >>> + return -EINVAL; >> What about diskless(nfs,cifs,etc) filesystem? btrfs also has no >> valid sb->s_dev And note that filesystem notifications and also file/directory change notification are particularly useful in the case of a a network file system (and heavily used by Windows desktop, Mac etc.) since when a file is shared a user may not necessarily know that a file (or file system as a whole) changed via another client (or on the server, or on the server via a different protocol e.g.SMB3 vs NFSv4), but is more likely to know about local changes to the same file. In some sense the users of mounts on network file systems get more benefit from notifications than a mount on a local file system would. -- Thanks, Steve From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Fri, 26 Jun 2015 09:30:36 +0200 Message-ID: <558CFF9C.20700@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <87oak5ebmx.fsf@openvz.org> <558ACD3A.2020508@samsung.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Return-path: In-reply-to: Sender: owner-linux-mm@kvack.org To: Steve French Cc: Dmitry Monakhov , LKML , linux-fsdevel , "linux-api@vger.kernel.org" , Greg Kroah-Hartman , Jan Kara , Theodore Ts'o , adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, Christoph Hellwig , "linux-ext4@vger.kernel.org" , linux-mm , kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org On 06/24/2015 06:26 PM, Steve French wrote: > On Wed, Jun 24, 2015 at 10:31 AM, Beata Michalska > wrote: >> On 06/24/2015 10:47 AM, Dmitry Monakhov wrote: >>> Beata Michalska writes: >>> >>>> Introduce configurable generic interface for file >>>> system-wide event notifications, to provide file >>>> systems with a common way of reporting any potential >>>> issues as they emerge. >>>> >>>> The notifications are to be issued through generic >>>> netlink interface by newly introduced multicast group. >>>> >>>> Threshold notifications have been included, allowing >>>> triggering an event whenever the amount of free space drops >>>> below a certain level - or levels to be more precise as two >>>> of them are being supported: the lower and the upper range. >>>> The notifications work both ways: once the threshold level >>>> has been reached, an event shall be generated whenever >>>> the number of available blocks goes up again re-activating >>>> the threshold. >>>> >>>> The interface has been exposed through a vfs. Once mounted, >>>> it serves as an entry point for the set-up where one can >>>> register for particular file system events. >>>> >>>> Signed-off-by: Beata Michalska >>>> --- >>>> Documentation/filesystems/events.txt | 232 ++++++++++ >>>> fs/Kconfig | 2 + >>>> fs/Makefile | 1 + >>>> fs/events/Kconfig | 7 + >>>> fs/events/Makefile | 5 + >>>> fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ >>>> fs/events/fs_event.h | 22 + >>>> fs/events/fs_event_netlink.c | 104 +++++ >>>> fs/namespace.c | 1 + >>>> include/linux/fs.h | 6 +- >>>> include/linux/fs_event.h | 72 +++ >>>> include/uapi/linux/Kbuild | 1 + >>>> include/uapi/linux/fs_event.h | 58 +++ >>>> 13 files changed, 1319 insertions(+), 1 deletion(-) >>>> create mode 100644 Documentation/filesystems/events.txt >>>> create mode 100644 fs/events/Kconfig >>>> create mode 100644 fs/events/Makefile >>>> create mode 100644 fs/events/fs_event.c >>>> create mode 100644 fs/events/fs_event.h >>>> create mode 100644 fs/events/fs_event_netlink.c >>>> create mode 100644 include/linux/fs_event.h >>>> create mode 100644 include/uapi/linux/fs_event.h >>>> >>>> diff --git a/Documentation/filesystems/events.txt b/Documentation/filesystems/events.txt >>>> new file mode 100644 >>>> index 0000000..c2e6227 >>>> --- /dev/null >>>> +++ b/Documentation/filesystems/events.txt >>>> @@ -0,0 +1,232 @@ >>>> + >>>> + Generic file system event notification interface >>>> + >>>> +Document created 23 April 2015 by Beata Michalska >>>> + >>>> +1. The reason behind: >>>> +===================== >>>> + >>>> +There are many corner cases when things might get messy with the filesystems. >>>> +And it is not always obvious what and when went wrong. Sometimes you might >>>> +get some subtle hints that there is something going on - but by the time >>>> +you realise it, it might be too late as you are already out-of-space >>>> +or the filesystem has been remounted as read-only (i.e.). The generic >>>> +interface for the filesystem events fills the gap by providing a rather >>>> +easy way of real-time notifications triggered whenever something interesting >>>> +happens, allowing filesystems to report events in a common way, as they occur. >>>> + >>>> +2. How does it work: >>>> +==================== >>>> + >>>> +The interface itself has been exposed as fstrace-type Virtual File System, >>>> +primarily to ease the process of setting up the configuration for the >>>> +notifications. So for starters, it needs to get mounted (obviously): >>>> + >>>> + mount -t fstrace none /sys/fs/events >>>> + >>>> +This will unveil the single fstrace filesystem entry - the 'config' file, >>>> +through which the notification are being set-up. >>>> + >>>> +Activating notifications for particular filesystem is as straightforward >>>> +as writing into the 'config' file. Note that by default all events, despite >>>> +the actual filesystem type, are being disregarded. >>>> + >>>> +Synopsis of config: >>>> +------------------ >>>> + >>>> + MOUNT EVENT_TYPE [L1] [L2] >>>> + >>>> + MOUNT : the filesystem's mount point >>>> + EVENT_TYPE : event types - currently two of them are being supported: >>>> + >>>> + * generic events ("G") covering most common warnings >>>> + and errors that might be reported by any filesystem; >>>> + this option does not take any arguments; >>>> + >>>> + * threshold notifications ("T") - events sent whenever >>>> + the amount of available space drops below certain level; >>>> + it is possible to specify two threshold levels though >>>> + only one is required to properly setup the notifications; >>>> + as those refer to the number of available blocks, the lower >>>> + level [L1] needs to be higher than the upper one [L2] >>>> + >>>> +Sample request could look like the following: >>>> + >>>> + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config >>>> + >>>> +Multiple request might be specified provided they are separated with semicolon. >>>> + >>>> +The configuration itself might be modified at any time. One can add/remove >>>> +particular event types for given fielsystem, modify the threshold levels, >>>> +and remove single or all entries from the 'config' file. >>>> + >>>> + - Adding new event type: >>>> + >>>> + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config >>>> + >>>> +(Note that is is enough to provide the event type to be enabled without >>>> +the already set ones.) >>>> + >>>> + - Removing event type: >>>> + >>>> + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config >>>> + >>>> + - Updating threshold limits: >>>> + >>>> + $ echo MOUNT T L1 L2 > /sys/fs/events/config >>>> + >>>> + - Removing single entry: >>>> + >>>> + $ echo '!MOUNT' > /sys/fs/events/config >>>> + >>>> + - Removing all entries: >>>> + >>>> + $ echo > /sys/fs/events/config >>>> + >>>> +Reading the file will list all registered entries with their current set-up >>>> +along with some additional info like the filesystem type and the backing device >>>> +name if available. >>>> + >>>> +Final, though a very important note on the configuration: when and if the >>>> +actual events are being triggered falls way beyond the scope of the generic >>>> +filesystem events interface. It is up to a particular filesystem >>>> +implementation which events are to be supported - if any at all. So if >>>> +given filesystem does not support the event notifications, an attempt to >>>> +enable those through 'config' file will fail. >>>> + >>>> + >>>> +3. The generic netlink interface support: >>>> +========================================= >>>> + >>>> +Whenever an event notification is triggered (by given filesystem) the current >>>> +configuration is being validated to decide whether a userpsace notification >>>> +should be launched. If there has been no request (in a mean of 'config' file >>>> +entry) for given event, one will be silently disregarded. If, on the other >>>> +hand, someone is 'watching' given filesystem for specific events, a generic >>>> +netlink message will be sent. A dedicated multicast group has been provided >>>> +solely for this purpose so in order to receive such notifications, one should >>>> +subscribe to this new multicast group. As for now only the init network >>>> +namespace is being supported. >>>> + >>>> +3.1 Message format >>>> + >>>> +The FS_NL_C_EVENT shall be stored within the generic netlink message header >>>> +as the command field. The message payload will provide more detailed info: >>>> +the backing device major and minor numbers, the event code and the id of >>>> +the process which action led to the event occurrence. In case of threshold >>>> +notifications, the current number of available blocks will be included >>>> +in the payload as well. >>>> + >>>> + >>>> + 0 1 2 3 >>>> + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 >>>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>>> + | NETLINK MESSAGE HEADER | >>>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>>> + | GENERIC NETLINK MESSAGE HEADER | >>>> + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | >>>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>>> + | Optional user specific message header | >>>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>>> + | GENERIC MESSAGE PAYLOAD: | >>>> + +---------------------------------------------------------------+ >>>> + | FS_NL_A_EVENT_ID (NLA_U32) | >>>> + +---------------------------------------------------------------+ >>>> + | FS_NL_A_DEV_MAJOR (NLA_U32) | >>>> + +---------------------------------------------------------------+ >>>> + | FS_NL_A_DEV_MINOR (NLA_U32) | >>> >> ... >> >>>> + >>>> +static int create_common_msg(struct sk_buff *skb, void *data) >>>> +{ >>>> + struct fs_trace_entry *en = (struct fs_trace_entry *)data; >>>> + struct super_block *sb = en->sb; >>>> + >>>> + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) >>>> + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) >>>> + return -EINVAL; >>> What about diskless(nfs,cifs,etc) filesystem? btrfs also has no >>> valid sb->s_dev > > And note that filesystem notifications and also file/directory change > notification are particularly useful in the case of a a network file > system (and heavily used by Windows desktop, Mac etc.) since when a > file is shared a user may not necessarily know that a file (or file > system as a whole) changed via another client (or on the server, or on > the server via a different protocol e.g.SMB3 vs NFSv4), but is more > likely to know about local changes to the same file. In some sense > the users of mounts on network file systems get more benefit from > notifications than a mount on a local file system would. > As for the network file systems... As it has been pointed out there are some serious scalability/performance issues with the current version of the events interface. As it also has been suggested I plan to modify the way the threshold notifications are being handled by shuffling the responsibility for tracking the amount of available space through querying file systems for an update. Thus I'm wondering, if this will not result in yet another issue in case of the network file systems, as for them, handling such query means asking the sever for an update (there is basically no caching on the client side). Best Regards Beata -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bartlomiej Zolnierkiewicz Subject: Re: [RFC v3 2/4] ext4: Add helper function to mark group as corrupted Date: Wed, 22 Jul 2015 12:40:03 +0200 Message-ID: <3417027.tdShitEpvE@amdc1976> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-3-git-send-email-b.michalska@samsung.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7Bit Return-path: In-reply-to: <1434460173-18427-3-git-send-email-b.michalska@samsung.com> Sender: owner-linux-mm@kvack.org To: Beata Michalska , tytso@mit.edu Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org Hi, On Tuesday, June 16, 2015 03:09:31 PM Beata Michalska wrote: > Add ext4_mark_group_corrupted helper function to > simplify the code and to keep the logic in one place. > > Signed-off-by: Beata Michalska This small cleanup patch is not really required for your notifications framework to work and it seems to be a good change on its own. Maybe it can be merged independently of other patches? Ted, what is your opinion on it? Best regards, -- Bartlomiej Zolnierkiewicz Samsung R&D Institute Poland Samsung Electronics > --- > fs/ext4/balloc.c | 15 +++------------ > fs/ext4/ext4.h | 9 +++++++++ > fs/ext4/ialloc.c | 5 +---- > fs/ext4/mballoc.c | 11 ++--------- > 4 files changed, 15 insertions(+), 25 deletions(-) > > diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c > index 83a6f49..e95b27a 100644 > --- a/fs/ext4/balloc.c > +++ b/fs/ext4/balloc.c > @@ -193,10 +193,7 @@ static int ext4_init_block_bitmap(struct super_block *sb, > * essentially implementing a per-group read-only flag. */ > if (!ext4_group_desc_csum_verify(sb, block_group, gdp)) { > grp = ext4_get_group_info(sb, block_group); > - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > - percpu_counter_sub(&sbi->s_freeclusters_counter, > - grp->bb_free); > - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > + ext4_mark_group_corrupted(sbi, grp); > if (!EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) { > int count; > count = ext4_free_inodes_count(sb, gdp); > @@ -379,20 +376,14 @@ static void ext4_validate_block_bitmap(struct super_block *sb, > ext4_unlock_group(sb, block_group); > ext4_error(sb, "bg %u: block %llu: invalid block bitmap", > block_group, blk); > - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > - percpu_counter_sub(&sbi->s_freeclusters_counter, > - grp->bb_free); > - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > + ext4_mark_group_corrupted(sbi, grp); > return; > } > if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group, > desc, bh))) { > ext4_unlock_group(sb, block_group); > ext4_error(sb, "bg %u: bad block bitmap checksum", block_group); > - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > - percpu_counter_sub(&sbi->s_freeclusters_counter, > - grp->bb_free); > - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > + ext4_mark_group_corrupted(sbi, grp); > return; > } > set_buffer_verified(bh); > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index f63c3d5..163afe2 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -2535,6 +2535,15 @@ static inline spinlock_t *ext4_group_lock_ptr(struct super_block *sb, > return bgl_lock_ptr(EXT4_SB(sb)->s_blockgroup_lock, group); > } > > +static inline > +void ext4_mark_group_corrupted(struct ext4_sb_info *sbi, > + struct ext4_group_info *grp) > +{ > + if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > + percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free); > + set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > +} > + > /* > * Returns true if the filesystem is busy enough that attempts to > * access the block group locks has run into contention. > diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c > index ac644c3..ebe0499 100644 > --- a/fs/ext4/ialloc.c > +++ b/fs/ext4/ialloc.c > @@ -79,10 +79,7 @@ static unsigned ext4_init_inode_bitmap(struct super_block *sb, > if (!ext4_group_desc_csum_verify(sb, block_group, gdp)) { > ext4_error(sb, "Checksum bad for group %u", block_group); > grp = ext4_get_group_info(sb, block_group); > - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > - percpu_counter_sub(&sbi->s_freeclusters_counter, > - grp->bb_free); > - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > + ext4_mark_group_corrupted(sbi, grp); > if (!EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) { > int count; > count = ext4_free_inodes_count(sb, gdp); > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 8d1e602..24a4b6d 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -760,10 +760,7 @@ void ext4_mb_generate_buddy(struct super_block *sb, > * corrupt and update bb_free using bitmap value > */ > grp->bb_free = free; > - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > - percpu_counter_sub(&sbi->s_freeclusters_counter, > - grp->bb_free); > - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > + ext4_mark_group_corrupted(sbi, grp); > } > mb_set_largest_free_order(sb, grp); > > @@ -1448,12 +1445,8 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b, > "freeing already freed block " > "(bit %u); block bitmap corrupt.", > block); > - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)) > - percpu_counter_sub(&sbi->s_freeclusters_counter, > - e4b->bd_info->bb_free); > /* Mark the block group as corrupt. */ > - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, > - &e4b->bd_info->bb_state); > + ext4_mark_group_corrupted(sbi, e4b->bd_info); > mb_regenerate_buddy(e4b); > goto done; > } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bartlomiej Zolnierkiewicz Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Wed, 22 Jul 2015 17:55:34 +0200 Message-ID: <6913836.Rhse3j9PM4@amdc1976> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7Bit Return-path: In-reply-to: <1434460173-18427-2-git-send-email-b.michalska@samsung.com> Sender: owner-linux-mm@kvack.org To: Beata Michalska Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org Hi, Some comments below. On Tuesday, June 16, 2015 03:09:30 PM Beata Michalska wrote: > Introduce configurable generic interface for file > system-wide event notifications, to provide file > systems with a common way of reporting any potential > issues as they emerge. > > The notifications are to be issued through generic > netlink interface by newly introduced multicast group. > > Threshold notifications have been included, allowing > triggering an event whenever the amount of free space drops > below a certain level - or levels to be more precise as two > of them are being supported: the lower and the upper range. > The notifications work both ways: once the threshold level > has been reached, an event shall be generated whenever > the number of available blocks goes up again re-activating > the threshold. > > The interface has been exposed through a vfs. Once mounted, > it serves as an entry point for the set-up where one can > register for particular file system events. > > Signed-off-by: Beata Michalska > --- > Documentation/filesystems/events.txt | 232 ++++++++++ > fs/Kconfig | 2 + > fs/Makefile | 1 + > fs/events/Kconfig | 7 + > fs/events/Makefile | 5 + > fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ > fs/events/fs_event.h | 22 + > fs/events/fs_event_netlink.c | 104 +++++ > fs/namespace.c | 1 + > include/linux/fs.h | 6 +- > include/linux/fs_event.h | 72 +++ > include/uapi/linux/Kbuild | 1 + > include/uapi/linux/fs_event.h | 58 +++ > 13 files changed, 1319 insertions(+), 1 deletion(-) > create mode 100644 Documentation/filesystems/events.txt > create mode 100644 fs/events/Kconfig > create mode 100644 fs/events/Makefile > create mode 100644 fs/events/fs_event.c > create mode 100644 fs/events/fs_event.h > create mode 100644 fs/events/fs_event_netlink.c > create mode 100644 include/linux/fs_event.h > create mode 100644 include/uapi/linux/fs_event.h > > diff --git a/Documentation/filesystems/events.txt b/Documentation/filesystems/events.txt > new file mode 100644 > index 0000000..c2e6227 > --- /dev/null > +++ b/Documentation/filesystems/events.txt > @@ -0,0 +1,232 @@ > + > + Generic file system event notification interface > + > +Document created 23 April 2015 by Beata Michalska > + > +1. The reason behind: > +===================== > + > +There are many corner cases when things might get messy with the filesystems. > +And it is not always obvious what and when went wrong. Sometimes you might > +get some subtle hints that there is something going on - but by the time > +you realise it, it might be too late as you are already out-of-space > +or the filesystem has been remounted as read-only (i.e.). The generic > +interface for the filesystem events fills the gap by providing a rather > +easy way of real-time notifications triggered whenever something interesting > +happens, allowing filesystems to report events in a common way, as they occur. > + > +2. How does it work: > +==================== > + > +The interface itself has been exposed as fstrace-type Virtual File System, > +primarily to ease the process of setting up the configuration for the > +notifications. So for starters, it needs to get mounted (obviously): > + > + mount -t fstrace none /sys/fs/events > + > +This will unveil the single fstrace filesystem entry - the 'config' file, > +through which the notification are being set-up. The patch creates a separate virtual filesystem for single file, this is an overkill IMHO and a new sysfs or debugfs entry should be sufficient. > + > +Activating notifications for particular filesystem is as straightforward > +as writing into the 'config' file. Note that by default all events, despite > +the actual filesystem type, are being disregarded. > + > +Synopsis of config: > +------------------ > + > + MOUNT EVENT_TYPE [L1] [L2] OTOH Why not use the advantages of having a separate virtual filesystem and create separate directories for each mount point (+ maybe even extra parent directories for mount namespaces) and put separate entries for each event type in these directories. This would also allow usage of eventfd() notification interface on such files. Please take look at: tools/cgroup/cgroup_event_listener.c and Documentation/cgroups/memcg_test.txt (point 9.10) to see how much easier it is to observe memory usage thresholds on memory cgroups compared to available blocks on filesystems using fs events.. Also while at it please add your example user-space code (posted on request in a some other mail) to tools/fs_events/ (preferably in a separate patch). > + > + MOUNT : the filesystem's mount point > + EVENT_TYPE : event types - currently two of them are being supported: > + > + * generic events ("G") covering most common warnings > + and errors that might be reported by any filesystem; > + this option does not take any arguments; fs_event.h in uapi dir allows following events: /* * Supported set of FS events */ enum { FS_EVENT_NONE, FS_WARN_ENOSPC, /* No space left to reserve data blks */ FS_WARN_ENOSPC_META, /* No space left for metadata */ FS_THR_LRBELOW, /* The threshold lower range has been reached */ FS_THR_LRABOVE, /* The threshold lower range re-activcated*/ FS_THR_URBELOW, FS_THR_URABOVE, FS_ERR_REMOUNT_RO, /* The file system has been remounted as RO */ FS_ERR_CORRUPTED /* Critical error - fs corrupted */ }; For non-threshold related events the current interface allows only configuration of all or none events to be anabled, i.e. you cannot selectively enable notification on FS_WARN_ENOSPC but not on FS_ERR_REMOUNT_RO. I also think that configuration interface should be made to match the notification interface when it comes to event types. > + > + * threshold notifications ("T") - events sent whenever > + the amount of available space drops below certain level; > + it is possible to specify two threshold levels though > + only one is required to properly setup the notifications; > + as those refer to the number of available blocks, the lower > + level [L1] needs to be higher than the upper one [L2] Why is there a limitation of only two thresholds? It should be relatively easy to make the code support unlimited number of thresholds. > + > +Sample request could look like the following: > + > + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config > + > +Multiple request might be specified provided they are separated with semicolon. s/request/requests/ I think that allowing multiple event types and requests in one configuration request is not a good idea. Currently parsing code is relatively simple but once somebody decides to enhance the interface with new event types the parsing code may get complex & ugly. > + > +The configuration itself might be modified at any time. One can add/remove > +particular event types for given fielsystem, modify the threshold levels, s/fielsystem/filesystem/ > +and remove single or all entries from the 'config' file. > + > + - Adding new event type: > + > + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config > + > +(Note that is is enough to provide the event type to be enabled without s/is is/is/ > +the already set ones.) > + > + - Removing event type: > + > + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config > + > + - Updating threshold limits: > + > + $ echo MOUNT T L1 L2 > /sys/fs/events/config > + > + - Removing single entry: > + > + $ echo '!MOUNT' > /sys/fs/events/config > + > + - Removing all entries: > + > + $ echo > /sys/fs/events/config > + > +Reading the file will list all registered entries with their current set-up > +along with some additional info like the filesystem type and the backing device > +name if available. > + > +Final, though a very important note on the configuration: when and if the > +actual events are being triggered falls way beyond the scope of the generic > +filesystem events interface. It is up to a particular filesystem > +implementation which events are to be supported - if any at all. So if > +given filesystem does not support the event notifications, an attempt to > +enable those through 'config' file will fail. > + > + > +3. The generic netlink interface support: > +========================================= > + > +Whenever an event notification is triggered (by given filesystem) the current > +configuration is being validated to decide whether a userpsace notification s/userpsace/userspace/ > +should be launched. If there has been no request (in a mean of 'config' file > +entry) for given event, one will be silently disregarded. If, on the other > +hand, someone is 'watching' given filesystem for specific events, a generic > +netlink message will be sent. A dedicated multicast group has been provided > +solely for this purpose so in order to receive such notifications, one should > +subscribe to this new multicast group. As for now only the init network > +namespace is being supported. > + > +3.1 Message format > + > +The FS_NL_C_EVENT shall be stored within the generic netlink message header > +as the command field. The message payload will provide more detailed info: > +the backing device major and minor numbers, the event code and the id of > +the process which action led to the event occurrence. In case of threshold > +notifications, the current number of available blocks will be included > +in the payload as well. > + > + > + 0 1 2 3 > + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | NETLINK MESSAGE HEADER | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | GENERIC NETLINK MESSAGE HEADER | > + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | Optional user specific message header | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | GENERIC MESSAGE PAYLOAD: | > + +---------------------------------------------------------------+ > + | FS_NL_A_EVENT_ID (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DEV_MAJOR (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DEV_MINOR (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_CAUSED_ID (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DATA (NLA_U64) | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + > + > +The above figure is based on: > + http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto#Message_Format > + > + > +4. API Reference: > +================= > + > + 4.1 Generic file system event interface data & operations > + > + #include > + > + struct fs_trace_info { > + void __rcu *e_priv /* READ ONLY */ It should be marked as private for fs events core code and not for use by filesystems' code. If possible it would be the best to move it out of this struct, > + unsigned int events_cap_mask; /* Supported notifications */ > + const struct fs_trace_operations *ops; > + }; > + > + struct fs_trace_operations { > + void (*query)(struct super_block *, u64 *); > + }; > + > + In order to get the fireworks and stuff, each filesystem needs to setup > + the events_cap_mask field of the fs_trace_info structure, which has been > + embedded within the super_block structure. This should reflect the type of > + events the filesystem wants to support. In case of threshold notifications, > + apart from setting the FS_EVENT_THRESH flag, the 'query' callback should > + be provided as this enables the events interface to get the up-to-date > + state of the number of available blocks whenever those notifications are > + being requested. > + > + The 'e_priv' field of the fs_trace_info structure should be completely ignored > + as it's for INTERNAL USE ONLY. So don't even think of messing with it, if you > + do not want to get yourself into some real trouble. If still, you are tempted > + to do so - feel free, it's gonna be pure fun. Consider yourself warned. > + > + > + 4.2 Event notification: > + > + #include > + void fs_event_notify(struct super_block *sb, unsigned int event_id); > + > + Notify the generic FS event interface of an occurring event. > + This shall be used by any file system that wishes to inform any potential > + listeners/watchers of a particular event. > + - sb: the filesystem's super block > + - event_id: an event identifier > + > + 4.3 Threshold notifications: > + > + #include > + void fs_event_alloc_space(struct super_block *sb, u64 ncount); > + void fs_event_free_space(struct super_block *sb, u64 ncount); > + > + Each filesystme supporting the threshold notifications should call > + fs_event_alloc_space/fs_event_free_space respectively whenever the > + amount of available blocks changes. > + - sb: the filesystem's super block > + - ncount: number of blocks being acquired/released > + > + Note that to properly handle the threshold notifications the fs events > + interface needs to be kept up to date by the filesystems. Each should > + register fs_trace_operations to enable querying the current number of > + available blocks. > + > + 4.4 Sending message through generic netlink interface > + > + #include > + > + int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), void *cbdata); > + > + Although the fs event interface is fully responsible for sending the messages > + over the netlink, filesystems might use the FS_EVENT multicast group to send > + their own custom messages. > + - size: the size of the message payload > + - event_id: the event identifier > + - compose_msg: a callback responsible for filling-in the message payload > + - cbdata: message custom data > + > + Calling fs_netlink_send_event will result in a message being sent by > + the FS_EVENT multicast group. Note that the body of the message should be > + prepared (set-up )by the caller - through compose_msg callback. The message's (set-up) > + sk_buff will be allocated on behalf of the caller (thus the size parameter). > + The compose_msg should only fill the payload with proper data. Unless > + the event id is specified as FS_EVENT_NONE, it's value shall be added > + to the payload prior to calling the compose_msg. > + > + > diff --git a/fs/Kconfig b/fs/Kconfig > index ec35851..a89e678 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -69,6 +69,8 @@ config FILE_LOCKING > for filesystems like NFS and for the flock() system > call. Disabling this option saves about 11k. > > +source "fs/events/Kconfig" > + > source "fs/notify/Kconfig" > > source "fs/quota/Kconfig" > diff --git a/fs/Makefile b/fs/Makefile > index a88ac48..bcb3048 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -126,3 +126,4 @@ obj-y += exofs/ # Multiple modules > obj-$(CONFIG_CEPH_FS) += ceph/ > obj-$(CONFIG_PSTORE) += pstore/ > obj-$(CONFIG_EFIVAR_FS) += efivarfs/ > +obj-$(CONFIG_FS_EVENTS) += events/ > diff --git a/fs/events/Kconfig b/fs/events/Kconfig > new file mode 100644 > index 0000000..1c60195 > --- /dev/null > +++ b/fs/events/Kconfig > @@ -0,0 +1,7 @@ > +# Generic Files System events interface > +config FS_EVENTS > + bool "Generic filesystem events" > + select NET > + default y Do we really want to default to yes? [ If so then maybe we want to make the config option visible only when EXPERT mode is enabled? ] > + help > + Enable generic filesystem events interface Please enhance the help entry. > diff --git a/fs/events/Makefile b/fs/events/Makefile > new file mode 100644 > index 0000000..9c98337 > --- /dev/null > +++ b/fs/events/Makefile > @@ -0,0 +1,5 @@ > +# > +# Makefile for the Linux Generic File System Event Interface > +# > + > +obj-y := fs_event.o fs_event_netlink.o > diff --git a/fs/events/fs_event.c b/fs/events/fs_event.c > new file mode 100644 > index 0000000..1037311 > --- /dev/null > +++ b/fs/events/fs_event.c > @@ -0,0 +1,809 @@ > +/* > + * Generic File System Evens Interface > + * > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "../pnode.h" > +#include "fs_event.h" > + > +static LIST_HEAD(fs_trace_list); > +static DEFINE_MUTEX(fs_trace_lock); > + > +static struct kmem_cache *fs_trace_cachep __read_mostly; > + > +static atomic_t stray_traces = ATOMIC_INIT(0); > +static DECLARE_WAIT_QUEUE_HEAD(trace_wq); > +/* > + * Threshold notification state bits. > + * Note the reverse as this refers to the number > + * of available blocks. > + */ > +#define THRESH_LR_BELOW 0x0001 /* Falling below the lower range */ > +#define THRESH_LR_BEYOND 0x0002 > +#define THRESH_UR_BELOW 0x0004 > +#define THRESH_UR_BEYOND 0x0008 /* Going beyond the upper range */ > + > +#define THRESH_LR_ON (THRESH_LR_BELOW | THRESH_LR_BEYOND) > +#define THRESH_UR_ON (THRESH_UR_BELOW | THRESH_UR_BEYOND) > + > +#define FS_TRACE_ADD 0x100000 > + > +struct fs_trace_entry { > + struct kref count; > + atomic_t active; > + struct super_block *sb; > + unsigned int notify; > + struct path mnt_path; > + struct list_head node; > + > + struct fs_event_thresh { > + u64 avail_space; > + u64 lrange; > + u64 urange; > + unsigned int state; > + } th; > + struct rcu_head rcu_head; > + spinlock_t lock; > +}; > + > +static const match_table_t fs_etypes = { > + { FS_EVENT_GENERIC, "G" }, > + { FS_EVENT_THRESH, "T" }, > + { 0, NULL }, > +}; > + > +static inline int fs_trace_query_data(struct super_block *sb, > + struct fs_trace_entry *en) > +{ > + if (sb->s_etrace.ops && sb->s_etrace.ops->query) { > + sb->s_etrace.ops->query(sb, &en->th.avail_space); > + return 0; > + } > + > + return -EINVAL; > +} > + > +static inline void fs_trace_entry_free(struct fs_trace_entry *en) I don't see a real need for this wrapper (it is used only once). > +{ > + kmem_cache_free(fs_trace_cachep, en); > +} > + > +static void fs_destroy_trace_entry(struct kref *en_ref) > +{ > + struct fs_trace_entry *en = container_of(en_ref, > + struct fs_trace_entry, count); > + > + /* Last reference has been dropped */ > + fs_trace_entry_free(en); > + atomic_dec(&stray_traces); > +} > + > +static void fs_trace_entry_put(struct fs_trace_entry *en) > +{ > + kref_put(&en->count, fs_destroy_trace_entry); > +} > + > +static void fs_release_trace_entry(struct rcu_head *rcu_head) > +{ > + struct fs_trace_entry *en = container_of(rcu_head, > + struct fs_trace_entry, > + rcu_head); > + /* > + * As opposed to typical reference drop, this one is being > + * called from the rcu callback. This is to make sure all > + * readers have managed to safely grab the reference before > + * the change to rcu pointer is visible to all and before > + * the reference is dropped here. > + */ > + fs_trace_entry_put(en); > +} > + > +static void fs_drop_trace_entry(struct fs_trace_entry *en) > +{ > + struct super_block *sb; > + > + lockdep_assert_held(&fs_trace_lock); > + /* > + * The trace entry might have already been removed > + * from the list of active traces with the proper > + * ref drop, though it was still in use handling > + * one of the fs events. This means that the object > + * has been already scheduled for being released. > + * So leave... > + */ > + > + if (!atomic_add_unless(&en->active, -1, 0)) > + return; > + /* > + * At this point the trace entry is being marked as inactive > + * so no new references will be allowed. > + * Still it might be floating around somewhere > + * so drop the reference when the rcu readers are done. > + */ > + spin_lock(&en->lock); > + list_del(&en->node); > + sb = en->sb; > + en->sb = NULL; > + spin_unlock(&en->lock); > + > + rcu_assign_pointer(sb->s_etrace.e_priv, NULL); > + call_rcu(&en->rcu_head, fs_release_trace_entry); > + /* It's safe now to drop the reference to the super */ > + deactivate_super(sb); > + atomic_inc(&stray_traces); > +} > + > +static inline > +struct fs_trace_entry *fs_trace_entry_get(struct fs_trace_entry *en) > +{ > + if (en) { > + if (!kref_get_unless_zero(&en->count)) > + return NULL; > + /* Don't allow referencing inactive object */ > + if (!atomic_read(&en->active)) { > + fs_trace_entry_put(en); > + return NULL; > + } > + } > + return en; > +} > + > +static struct fs_trace_entry *fs_trace_entry_get_rcu(struct super_block *sb) > +{ > + struct fs_trace_entry *en; > + > + if (!sb) > + return NULL; > + > + rcu_read_lock(); > + en = rcu_dereference(sb->s_etrace.e_priv); > + en = fs_trace_entry_get(en); > + rcu_read_unlock(); > + > + return en; > +} > + > +static int fs_remove_trace_entry(struct super_block *sb) > +{ > + struct fs_trace_entry *en; > + > + en = fs_trace_entry_get_rcu(sb); > + if (!en) > + return -EINVAL; > + > + mutex_lock(&fs_trace_lock); > + fs_drop_trace_entry(en); > + mutex_unlock(&fs_trace_lock); > + fs_trace_entry_put(en); > + return 0; > +} > + > +static void fs_remove_all_traces(void) > +{ > + struct fs_trace_entry *en, *guard; > + > + mutex_lock(&fs_trace_lock); > + list_for_each_entry_safe(en, guard, &fs_trace_list, node) > + fs_drop_trace_entry(en); > + mutex_unlock(&fs_trace_lock); > +} > + > +static int create_common_msg(struct sk_buff *skb, void *data) > +{ > + struct fs_trace_entry *en = (struct fs_trace_entry *)data; > + struct super_block *sb = en->sb; > + > + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) > + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) > + return -EINVAL; > + > + if (nla_put_u64(skb, FS_NL_A_CAUSED_ID, pid_vnr(task_pid(current)))) > + return -EINVAL; > + > + return 0; > +} > + > +static int create_thresh_msg(struct sk_buff *skb, void *data) > +{ > + struct fs_trace_entry *en = (struct fs_trace_entry *)data; > + int ret; > + > + ret = create_common_msg(skb, data); > + if (!ret) > + ret = nla_put_u64(skb, FS_NL_A_DATA, en->th.avail_space); > + return ret; > +} > + > +static void fs_event_send(struct fs_trace_entry *en, unsigned int event_id) > +{ > + size_t size = nla_total_size(sizeof(u32)) * 2 + > + nla_total_size(sizeof(u64)); > + > + fs_netlink_send_event(size, event_id, create_common_msg, en); > +} > + > +static void fs_event_send_thresh(struct fs_trace_entry *en, > + unsigned int event_id) > +{ > + size_t size = nla_total_size(sizeof(u32)) * 2 + > + nla_total_size(sizeof(u64)) * 2; > + > + fs_netlink_send_event(size, event_id, create_thresh_msg, en); > +} > + > +void fs_event_notify(struct super_block *sb, unsigned int event_id) > +{ > + struct fs_trace_entry *en; > + > + en = fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + if (atomic_read(&en->active) && (en->notify & FS_EVENT_GENERIC)) > + fs_event_send(en, event_id); > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_notify); > + > +void fs_event_alloc_space(struct super_block *sb, u64 ncount) > +{ > + struct fs_trace_entry *en; > + s64 count; > + > + en = fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + > + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) > + goto leave; > + /* > + * we shouldn't drop below 0 here, > + * unless there is a sync issue somewhere (?) > + */ > + count = en->th.avail_space - ncount; > + en->th.avail_space = count < 0 ? 0 : count; > + > + if (en->th.avail_space > en->th.lrange) > + /* Not 'even' close - leave */ > + goto leave; > + > + if (en->th.avail_space > en->th.urange) { > + /* Close enough - the lower range has been reached */ > + if (!(en->th.state & THRESH_LR_BEYOND)) { > + /* Send notification */ > + fs_event_send_thresh(en, FS_THR_LRBELOW); > + en->th.state &= ~THRESH_LR_BELOW; > + en->th.state |= THRESH_LR_BEYOND; > + } > + goto leave; > + } > + if (!(en->th.state & THRESH_UR_BEYOND)) { > + fs_event_send_thresh(en, FS_THR_URBELOW); > + en->th.state &= ~THRESH_UR_BELOW; > + en->th.state |= THRESH_UR_BEYOND; > + } > + > +leave: > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_alloc_space); > + > +void fs_event_free_space(struct super_block *sb, u64 ncount) > +{ > + struct fs_trace_entry *en; > + > + en = fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + > + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) > + goto leave; > + > + en->th.avail_space += ncount; > + > + if (en->th.avail_space > en->th.lrange) { > + if (!(en->th.state & THRESH_LR_BELOW) > + && en->th.state & THRESH_LR_BEYOND) { > + /* Send notification */ > + fs_event_send_thresh(en, FS_THR_LRABOVE); > + en->th.state &= ~(THRESH_LR_BEYOND|THRESH_UR_BEYOND); > + en->th.state |= THRESH_LR_BELOW; > + goto leave; > + } > + } > + if (en->th.avail_space > en->th.urange) { > + if (!(en->th.state & THRESH_UR_BELOW) > + && en->th.state & THRESH_UR_BEYOND) { > + /* Notify */ > + fs_event_send_thresh(en, FS_THR_URABOVE); > + en->th.state &= ~THRESH_UR_BEYOND; > + en->th.state |= THRESH_UR_BELOW; > + } > + } > +leave: > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_free_space); > + > +void fs_event_mount_dropped(struct vfsmount *mnt) > +{ > + /* > + * The mount is dropped but the super might not get released > + * at once so there is very small chance some notifications > + * will come through. > + * Note that the mount being dropped here might belong to a different > + * namespace - if this is the case, just ignore it. > + */ > + struct fs_trace_entry *en = fs_trace_entry_get_rcu(mnt->mnt_sb); > + struct vfsmount *en_mnt; > + > + if (!en || !atomic_read(&en->active)) > + return; > + /* > + * The entry once set, does not change the mountpoint it's being > + * pinned to, so no need to take the lock here. > + */ > + en_mnt = en->mnt_path.mnt; > + if (!(real_mount(mnt)->mnt_ns != (real_mount(en_mnt))->mnt_ns)) > + fs_remove_trace_entry(mnt->mnt_sb); > + fs_trace_entry_put(en); > +} > + > +static int fs_new_trace_entry(struct path *path, struct fs_event_thresh *thresh, > + unsigned int nmask) > +{ > + struct fs_trace_entry *en; > + struct super_block *sb; > + struct mount *r_mnt; > + > + en = kmem_cache_zalloc(fs_trace_cachep, GFP_KERNEL); > + if (unlikely(!en)) > + return -ENOMEM; > + /* > + * Note that no reference is being taken here for the path as it would > + * make the unmount unnecessarily puzzling (due to an extra 'valid' > + * reference for the mnt). > + * This is *rather* safe as the notification on mount being dropped > + * will get called prior to releasing the super block - so right > + * in time to perform appropriate clean-up > + */ > + r_mnt = real_mount(path->mnt); > + > + en->mnt_path.dentry = r_mnt->mnt.mnt_root; > + en->mnt_path.mnt = &r_mnt->mnt; > + > + sb = path->mnt->mnt_sb; > + en->sb = sb; > + /* > + * Increase the refcount for sb to mark it's being relied on. > + * Note that the reference to path is taken by the caller, so it > + * is safe to assume there is at least single active reference > + * to super as well. > + */ > + atomic_inc(&sb->s_active); > + > + nmask &= sb->s_etrace.events_cap_mask; > + if (!nmask) > + goto leave; > + > + spin_lock_init(&en->lock); > + INIT_LIST_HEAD(&en->node); > + > + en->notify = nmask; > + memcpy(&en->th, thresh, offsetof(struct fs_event_thresh, state)); > + if (nmask & FS_EVENT_THRESH) > + fs_trace_query_data(sb, en); > + > + kref_init(&en->count); > + > + if (rcu_access_pointer(sb->s_etrace.e_priv) != NULL) { > + struct fs_trace_entry *prev_en; > + > + prev_en = fs_trace_entry_get_rcu(sb); > + if (prev_en) { > + WARN_ON(prev_en); > + fs_trace_entry_put(prev_en); > + goto leave; > + } > + } > + atomic_set(&en->active, 1); > + > + mutex_lock(&fs_trace_lock); > + list_add(&en->node, &fs_trace_list); > + mutex_unlock(&fs_trace_lock); > + > + rcu_assign_pointer(sb->s_etrace.e_priv, en); > + synchronize_rcu(); > + > + return 0; > +leave: > + deactivate_super(sb); > + kmem_cache_free(fs_trace_cachep, en); > + return -EINVAL; > +} > + > +static int fs_update_trace_entry(struct path *path, > + struct fs_event_thresh *thresh, > + unsigned int nmask) > +{ > + struct fs_trace_entry *en; > + struct super_block *sb; > + int extend = nmask & FS_TRACE_ADD; > + int ret = -EINVAL; > + > + en = fs_trace_entry_get_rcu(path->mnt->mnt_sb); > + if (!en) > + return (extend) ? fs_new_trace_entry(path, thresh, nmask) > + : -EINVAL; > + > + if (!atomic_read(&en->active)) > + return -EINVAL; > + > + nmask &= ~FS_TRACE_ADD; > + > + spin_lock(&en->lock); > + sb = en->sb; > + if (!sb || !(nmask & sb->s_etrace.events_cap_mask)) > + goto leave; > + > + if (nmask & FS_EVENT_THRESH) { > + if (extend) { > + /* Get the current state */ > + if (!(en->notify & FS_EVENT_THRESH)) > + if (fs_trace_query_data(sb, en)) > + goto leave; > + > + if (thresh->state & THRESH_LR_ON) { > + en->th.lrange = thresh->lrange; > + en->th.state &= ~THRESH_LR_ON; > + } > + > + if (thresh->state & THRESH_UR_ON) { > + en->th.urange = thresh->urange; > + en->th.state &= ~THRESH_UR_ON; > + } > + } else { > + memset(&en->th, 0, sizeof(en->th)); > + } > + } > + > + if (extend) > + en->notify |= nmask; > + else > + en->notify &= ~nmask; > + ret = 0; > +leave: > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > + return ret; > +} > + > +static int fs_parse_trace_request(int argc, char **argv) > +{ > + struct fs_event_thresh thresh = {0}; > + struct path path; > + substring_t args[MAX_OPT_ARGS]; > + unsigned int nmask = FS_TRACE_ADD; > + int token; > + char *s; > + int ret = -EINVAL; > + > + if (!argc) { > + fs_remove_all_traces(); > + return 0; > + } > + > + s = *(argv); > + if (*s == '!') { > + /* Clear the trace entry */ > + nmask &= ~FS_TRACE_ADD; > + ++s; > + } > + > + if (kern_path_mountpoint(AT_FDCWD, s, &path, LOOKUP_FOLLOW)) > + return -EINVAL; > + > + if (!(--argc)) { > + if (!(nmask & FS_TRACE_ADD)) > + ret = fs_remove_trace_entry(path.mnt->mnt_sb); > + goto leave; > + } > + > +repeat: > + args[0].to = args[0].from = NULL; > + token = match_token(*(++argv), fs_etypes, args); > + if (!token && !nmask) > + goto leave; > + > + nmask |= token & FS_EVENTS_ALL; > + --argc; > + if ((token & FS_EVENT_THRESH) && (nmask & FS_TRACE_ADD)) { > + /* > + * Get the threshold config data: > + * lower range > + * upper range > + */ > + if (!argc) > + goto leave; > + > + ret = kstrtoull(*(++argv), 10, &thresh.lrange); > + if (ret) > + goto leave; > + thresh.state |= THRESH_LR_ON; > + if ((--argc)) { > + ret = kstrtoull(*(++argv), 10, &thresh.urange); > + if (ret) > + goto leave; > + thresh.state |= THRESH_UR_ON; > + --argc; > + } > + /* The thresholds are based on number of available blocks */ > + if (thresh.lrange < thresh.urange) { > + ret = -EINVAL; > + goto leave; > + } > + } > + if (argc) > + goto repeat; > + > + ret = fs_update_trace_entry(&path, &thresh, nmask); > +leave: > + path_put(&path); > + return ret; > +} > + > +#define DEFAULT_BUF_SIZE PAGE_SIZE > + > +static ssize_t fs_trace_write(struct file *file, const char __user *buffer, > + size_t count, loff_t *ppos) > +{ > + char **argv; > + char *kern_buf, *next, *cfg; > + size_t size, dcount = 0; > + int argc; > + > + if (!count) > + return 0; > + > + kern_buf = kmalloc(DEFAULT_BUF_SIZE, GFP_KERNEL); > + if (!kern_buf) > + return -ENOMEM; > + > + while (dcount < count) { > + > + size = count - dcount; > + if (size >= DEFAULT_BUF_SIZE) > + size = DEFAULT_BUF_SIZE - 1; > + if (copy_from_user(kern_buf, buffer + dcount, size)) { > + dcount = -EINVAL; > + goto leave; > + } > + > + kern_buf[size] = '\0'; > + > + next = cfg = kern_buf; > + > + do { > + next = strchr(cfg, ';'); > + if (next) > + *next = '\0'; > + > + argv = argv_split(GFP_KERNEL, cfg, &argc); > + if (!argv) { > + dcount = -ENOMEM; > + goto leave; > + } > + > + if (fs_parse_trace_request(argc, argv)) { > + dcount = -EINVAL; > + argv_free(argv); > + goto leave; > + } > + > + argv_free(argv); > + if (next) > + cfg = ++next; > + > + } while (next); > + dcount += size; > + } > +leave: > + kfree(kern_buf); > + return dcount; > +} > + > +static void *fs_trace_seq_start(struct seq_file *m, loff_t *pos) > +{ > + mutex_lock(&fs_trace_lock); > + return seq_list_start(&fs_trace_list, *pos); > +} > + > +static void *fs_trace_seq_next(struct seq_file *m, void *v, loff_t *pos) > +{ > + return seq_list_next(v, &fs_trace_list, pos); > +} > + > +static void fs_trace_seq_stop(struct seq_file *m, void *v) > +{ > + mutex_unlock(&fs_trace_lock); > +} > + > +static int fs_trace_seq_show(struct seq_file *m, void *v) > +{ > + struct fs_trace_entry *en; > + struct super_block *sb; > + struct mount *r_mnt; > + const struct match_token *match; > + unsigned int nmask; > + > + en = list_entry(v, struct fs_trace_entry, node); > + /* Do not show the entries outside current mount namespace */ > + r_mnt = real_mount(en->mnt_path.mnt); > + if (r_mnt->mnt_ns != current->nsproxy->mnt_ns) { > + if (!__is_local_mountpoint(r_mnt->mnt_mountpoint)) > + return 0; > + } > + > + sb = en->sb; > + > + seq_path(m, &en->mnt_path, "\t\n\\"); > + seq_putc(m, ' '); > + > + seq_escape(m, sb->s_type->name, " \t\n\\"); > + if (sb->s_subtype && sb->s_subtype[0]) { > + seq_putc(m, '.'); > + seq_escape(m, sb->s_subtype, " \t\n\\"); > + } > + > + seq_putc(m, ' '); > + if (sb->s_op->show_devname) { > + sb->s_op->show_devname(m, en->mnt_path.mnt->mnt_root); > + } else { > + seq_escape(m, r_mnt->mnt_devname ? r_mnt->mnt_devname : "none", > + " \t\n\\"); > + } > + seq_puts(m, " ("); > + > + nmask = en->notify; > + for (match = fs_etypes; match->pattern; ++match) { > + if (match->token & nmask) { > + seq_puts(m, match->pattern); > + nmask &= ~match->token; > + if (nmask) > + seq_putc(m, ','); > + } > + } > + seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); > + seq_puts(m, ")\n"); > + return 0; > +} > + > +static const struct seq_operations fs_trace_seq_ops = { > + .start = fs_trace_seq_start, > + .next = fs_trace_seq_next, > + .stop = fs_trace_seq_stop, > + .show = fs_trace_seq_show, > +}; > + > +static int fs_trace_open(struct inode *inode, struct file *file) > +{ > + return seq_open(file, &fs_trace_seq_ops); > +} > + > +static const struct file_operations fs_trace_fops = { > + .owner = THIS_MODULE, > + .open = fs_trace_open, > + .write = fs_trace_write, > + .read = seq_read, > + .llseek = seq_lseek, > + .release = seq_release, > +}; > + > +static int fs_trace_init(void) > +{ > + fs_trace_cachep = KMEM_CACHE(fs_trace_entry, 0); > + if (!fs_trace_cachep) > + return -EINVAL; > + init_waitqueue_head(&trace_wq); > + return 0; > +} > + > +/* VFS support */ > +static int fs_trace_fill_super(struct super_block *sb, void *data, int silen) > +{ > + int ret; > + static struct tree_descr desc[] = { > + [2] = { > + .name = "config", > + .ops = &fs_trace_fops, > + .mode = S_IWUSR | S_IRUGO, > + }, > + {""}, > + }; > + > + ret = simple_fill_super(sb, 0x7246332, desc); Please use a define for a magic number. > + return !ret ? fs_trace_init() : ret; > +} > + > +static struct dentry *fs_trace_do_mount(struct file_system_type *fs_type, > + int ntype, const char *dev_name, void *data) > +{ > + return mount_single(fs_type, ntype, data, fs_trace_fill_super); > +} > + > +static void fs_trace_kill_super(struct super_block *sb) > +{ > + /* > + * The rcu_barrier here will/should make sure all call_rcu > + * callbacks are completed - still there might be some active > + * trace objects in use which can make calling the > + * kmem_cache_destroy unsafe. So we wait until all traces > + * are finally released. > + */ > + fs_remove_all_traces(); > + rcu_barrier(); > + wait_event(trace_wq, !atomic_read(&stray_traces)); > + > + kmem_cache_destroy(fs_trace_cachep); > + kill_litter_super(sb); > +} > + > +static struct kset *fs_trace_kset; > + > +static struct file_system_type fs_trace_fstype = { > + .name = "fstrace", > + .mount = fs_trace_do_mount, > + .kill_sb = fs_trace_kill_super, > +}; > + > +static void __init fs_trace_vfs_init(void) > +{ > + fs_trace_kset = kset_create_and_add("events", NULL, fs_kobj); > + > + if (!fs_trace_kset) > + return; > + > + if (!register_filesystem(&fs_trace_fstype)) { > + if (!fs_event_netlink_register()) > + return; > + unregister_filesystem(&fs_trace_fstype); > + } > + kset_unregister(fs_trace_kset); > +} > + > +static int __init fs_trace_evens_init(void) > +{ > + fs_trace_vfs_init(); > + return 0; > +}; > +module_init(fs_trace_evens_init); > + > diff --git a/fs/events/fs_event.h b/fs/events/fs_event.h > new file mode 100644 > index 0000000..23f24c8 > --- /dev/null > +++ b/fs/events/fs_event.h > @@ -0,0 +1,22 @@ > +/* > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > + > +#ifndef __GENERIC_FS_EVENTS_H > +#define __GENERIC_FS_EVENTS_H > + > +int fs_event_netlink_register(void); > +void fs_event_netlink_unregister(void); > + > +#endif /* __GENERIC_FS_EVENTS_H */ > diff --git a/fs/events/fs_event_netlink.c b/fs/events/fs_event_netlink.c > new file mode 100644 > index 0000000..0c97eb7 > --- /dev/null > +++ b/fs/events/fs_event_netlink.c > @@ -0,0 +1,104 @@ > +/* > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "fs_event.h" > + > +static const struct genl_multicast_group fs_event_mcgroups[] = { > + { .name = FS_EVENTS_MCAST_GRP_NAME, }, > +}; > + > +static struct genl_family fs_event_family = { > + .id = GENL_ID_GENERATE, > + .name = FS_EVENTS_FAMILY_NAME, > + .version = 1, > + .maxattr = FS_NL_A_MAX, > + .mcgrps = fs_event_mcgroups, > + .n_mcgrps = ARRAY_SIZE(fs_event_mcgroups), > +}; > + > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), > + void *cbdata) > +{ > + static atomic_t seq; > + struct sk_buff *skb; > + void *msg_head; > + int ret = 0; > + > + if (!size || !compose_msg) > + return -EINVAL; > + > + /* Skip if there are no listeners */ > + if (!genl_has_listeners(&fs_event_family, &init_net, 0)) > + return 0; > + > + if (event_id != FS_EVENT_NONE) > + size += nla_total_size(sizeof(u32)); > + size += nla_total_size(sizeof(u64)); > + skb = genlmsg_new(size, GFP_NOWAIT); > + > + if (!skb) { > + pr_debug("Failed to allocate new FS generic netlink message\n"); > + return -ENOMEM; > + } > + > + msg_head = genlmsg_put(skb, 0, atomic_add_return(1, &seq), > + &fs_event_family, 0, FS_NL_C_EVENT); > + if (!msg_head) > + goto cleanup; > + > + if (event_id != FS_EVENT_NONE) > + if (nla_put_u32(skb, FS_NL_A_EVENT_ID, event_id)) > + goto cancel; > + > + ret = compose_msg(skb, cbdata); > + if (ret) > + goto cancel; > + > + genlmsg_end(skb, msg_head); > + ret = genlmsg_multicast(&fs_event_family, skb, 0, 0, GFP_NOWAIT); > + if (ret && ret != -ENOBUFS && ret != -ESRCH) > + goto cleanup; > + > + return ret; > + > +cancel: > + genlmsg_cancel(skb, msg_head); > +cleanup: > + nlmsg_free(skb); > + return ret; > +} > +EXPORT_SYMBOL(fs_netlink_send_event); > + > +int fs_event_netlink_register(void) > +{ > + int ret; > + > + ret = genl_register_family(&fs_event_family); > + if (ret) > + pr_err("Failed to register FS netlink interface\n"); > + return ret; > +} > + > +void fs_event_netlink_unregister(void) > +{ > + genl_unregister_family(&fs_event_family); > +} > diff --git a/fs/namespace.c b/fs/namespace.c > index 82ef140..ec6e2ef 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -1031,6 +1031,7 @@ static void cleanup_mnt(struct mount *mnt) > if (unlikely(mnt->mnt_pins.first)) > mnt_pin_kill(mnt); > fsnotify_vfsmount_delete(&mnt->mnt); > + fs_event_mount_dropped(&mnt->mnt); > dput(mnt->mnt.mnt_root); > deactivate_super(mnt->mnt.mnt_sb); > mnt_free_id(mnt); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index b4d71b5..b7dadd9 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -263,6 +263,10 @@ struct iattr { > * Includes for diskquotas. > */ > #include > +/* > + * Include for Generic File System Events Interface > + */ > +#include > > /* > * Maximum number of layers of fs stack. Needs to be limited to > @@ -1253,7 +1257,7 @@ struct super_block { > struct hlist_node s_instances; > unsigned int s_quota_types; /* Bitmask of supported quota types */ > struct quota_info s_dquot; /* Diskquota specific options */ > - > + struct fs_trace_info s_etrace; > struct sb_writers s_writers; > > char s_id[32]; /* Informational name */ > diff --git a/include/linux/fs_event.h b/include/linux/fs_event.h > new file mode 100644 > index 0000000..83e22dd > --- /dev/null > +++ b/include/linux/fs_event.h > @@ -0,0 +1,72 @@ > +/* > + * Generic File System Events Interface > + * > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > +#ifndef _LINUX_GENERIC_FS_EVETS_ > +#define _LINUX_GENERIC_FS_EVETS_ EVETS? Also the define name usually corresponds to the header filename. > +#include > +#include > + > +/* > + * Currently supported event types > + */ > +#define FS_EVENT_GENERIC 0x001 > +#define FS_EVENT_THRESH 0x002 > + > +#define FS_EVENTS_ALL (FS_EVENT_GENERIC | FS_EVENT_THRESH) > + > +struct fs_trace_operations { > + void (*query)(struct super_block *, u64 *); > +}; > + > +struct fs_trace_info { > + void __rcu *e_priv; /* READ ONLY */ > + unsigned int events_cap_mask; /* Supported notifications */ > + const struct fs_trace_operations *ops; > +}; > + > +#ifdef CONFIG_FS_EVENTS > + > +void fs_event_notify(struct super_block *sb, unsigned int event_id); > +void fs_event_alloc_space(struct super_block *sb, u64 ncount); > +void fs_event_free_space(struct super_block *sb, u64 ncount); > +void fs_event_mount_dropped(struct vfsmount *mnt); > + > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), > + void *cbdata); > + > +#else /* CONFIG_FS_EVENTS */ > + > +static inline > +void fs_event_notify(struct super_block *sb, unsigned int event_id) {}; > +static inline > +void fs_event_alloc_space(struct super_block *sb, u64 ncount) {}; > +static inline > +void fs_event_free_space(struct super_block *sb, u64 ncount) {}; > +static inline > +void fs_event_mount_dropped(struct vfsmount *mnt) {}; > + > +static inline > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msig)(struct sk_buff *skb, void *data), > + void *cbdata) > +{ > + return -ENOSYS; > +} > +#endif /* CONFIG_FS_EVENTS */ > + > +#endif /* _LINUX_GENERIC_FS_EVENTS_ */ > + > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild > index 68ceb97..dae0fab 100644 > --- a/include/uapi/linux/Kbuild > +++ b/include/uapi/linux/Kbuild > @@ -129,6 +129,7 @@ header-y += firewire-constants.h > header-y += flat.h > header-y += fou.h > header-y += fs.h > +header-y += fs_event.h > header-y += fsl_hypervisor.h > header-y += fuse.h > header-y += futex.h > diff --git a/include/uapi/linux/fs_event.h b/include/uapi/linux/fs_event.h > new file mode 100644 > index 0000000..d8b07da > --- /dev/null > +++ b/include/uapi/linux/fs_event.h > @@ -0,0 +1,58 @@ > +/* > + * Generic netlink support for Generic File System Events Interface > + * > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > +#ifndef _UAPI_LINUX_GENERIC_FS_EVENTS_ > +#define _UAPI_LINUX_GENERIC_FS_EVENTS_ Define name usually corresponds to the header filename. > +#define FS_EVENTS_FAMILY_NAME "fs_event" > +#define FS_EVENTS_MCAST_GRP_NAME "fs_event_mc_grp" > + > +/* > + * Generic netlink attribute types > + */ > +enum { > + FS_NL_A_NONE, > + FS_NL_A_EVENT_ID, > + FS_NL_A_DEV_MAJOR, > + FS_NL_A_DEV_MINOR, > + FS_NL_A_CAUSED_ID, > + FS_NL_A_DATA, > + __FS_NL_A_MAX, > +}; > +#define FS_NL_A_MAX (__FS_NL_A_MAX - 1) > +/* > + * Generic netlink commands > + */ > +#define FS_NL_C_EVENT 1 > + > +/* > + * Supported set of FS events > + */ > +enum { > + FS_EVENT_NONE, > + FS_WARN_ENOSPC, /* No space left to reserve data blks */ > + FS_WARN_ENOSPC_META, /* No space left for metadata */ > + FS_THR_LRBELOW, /* The threshold lower range has been reached */ > + FS_THR_LRABOVE, /* The threshold lower range re-activcated*/ > + FS_THR_URBELOW, > + FS_THR_URABOVE, > + FS_ERR_REMOUNT_RO, /* The file system has been remounted as RO */ > + FS_ERR_CORRUPTED /* Critical error - fs corrupted */ > + > +}; > + > +#endif /* _UAPI_LINUX_GENERIC_FS_EVENTS_ */ > + Best regards, -- Bartlomiej Zolnierkiewicz Samsung R&D Institute Poland Samsung Electronics -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Beata Michalska Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Thu, 30 Jul 2015 10:22:33 +0200 Message-ID: <55B9DEC9.8020506@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <6913836.Rhse3j9PM4@amdc1976> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Return-path: In-reply-to: <6913836.Rhse3j9PM4@amdc1976> Sender: owner-linux-mm@kvack.org To: Bartlomiej Zolnierkiewicz Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org List-Id: linux-api@vger.kernel.org On 07/22/2015 05:55 PM, Bartlomiej Zolnierkiewicz wrote: > > Hi, > > Some comments below. > > On Tuesday, June 16, 2015 03:09:30 PM Beata Michalska wrote: >> Introduce configurable generic interface for file >> system-wide event notifications, to provide file >> systems with a common way of reporting any potential >> issues as they emerge. >> >> The notifications are to be issued through generic >> netlink interface by newly introduced multicast group. >> >> Threshold notifications have been included, allowing >> triggering an event whenever the amount of free space drops >> below a certain level - or levels to be more precise as two >> of them are being supported: the lower and the upper range. >> The notifications work both ways: once the threshold level >> has been reached, an event shall be generated whenever >> the number of available blocks goes up again re-activating >> the threshold. >> >> The interface has been exposed through a vfs. Once mounted, >> it serves as an entry point for the set-up where one can >> register for particular file system events. >> >> Signed-off-by: Beata Michalska >> --- >> Documentation/filesystems/events.txt | 232 ++++++++++ >> fs/Kconfig | 2 + >> fs/Makefile | 1 + >> fs/events/Kconfig | 7 + >> fs/events/Makefile | 5 + >> fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ >> fs/events/fs_event.h | 22 + >> fs/events/fs_event_netlink.c | 104 +++++ >> fs/namespace.c | 1 + >> include/linux/fs.h | 6 +- >> include/linux/fs_event.h | 72 +++ >> include/uapi/linux/Kbuild | 1 + >> include/uapi/linux/fs_event.h | 58 +++ >> 13 files changed, 1319 insertions(+), 1 deletion(-) >> create mode 100644 Documentation/filesystems/events.txt >> create mode 100644 fs/events/Kconfig >> create mode 100644 fs/events/Makefile >> create mode 100644 fs/events/fs_event.c >> create mode 100644 fs/events/fs_event.h >> create mode 100644 fs/events/fs_event_netlink.c >> create mode 100644 include/linux/fs_event.h >> create mode 100644 include/uapi/linux/fs_event.h >> >> diff --git a/Documentation/filesystems/events.txt b/Documentation/filesystems/events.txt >> new file mode 100644 >> index 0000000..c2e6227 >> --- /dev/null >> +++ b/Documentation/filesystems/events.txt >> @@ -0,0 +1,232 @@ >> + >> + Generic file system event notification interface >> + >> +Document created 23 April 2015 by Beata Michalska >> + >> +1. The reason behind: >> +===================== >> + >> +There are many corner cases when things might get messy with the filesystems. >> +And it is not always obvious what and when went wrong. Sometimes you might >> +get some subtle hints that there is something going on - but by the time >> +you realise it, it might be too late as you are already out-of-space >> +or the filesystem has been remounted as read-only (i.e.). The generic >> +interface for the filesystem events fills the gap by providing a rather >> +easy way of real-time notifications triggered whenever something interesting >> +happens, allowing filesystems to report events in a common way, as they occur. >> + >> +2. How does it work: >> +==================== >> + >> +The interface itself has been exposed as fstrace-type Virtual File System, >> +primarily to ease the process of setting up the configuration for the >> +notifications. So for starters, it needs to get mounted (obviously): >> + >> + mount -t fstrace none /sys/fs/events >> + >> +This will unveil the single fstrace filesystem entry - the 'config' file, >> +through which the notification are being set-up. > > The patch creates a separate virtual filesystem for single file, > this is an overkill IMHO and a new sysfs or debugfs entry should > be sufficient. > >> + >> +Activating notifications for particular filesystem is as straightforward >> +as writing into the 'config' file. Note that by default all events, despite >> +the actual filesystem type, are being disregarded. >> + >> +Synopsis of config: >> +------------------ >> + >> + MOUNT EVENT_TYPE [L1] [L2] > > OTOH Why not use the advantages of having a separate virtual > filesystem and create separate directories for each mount point > (+ maybe even extra parent directories for mount namespaces) and > put separate entries for each event type in these directories. > > This would also allow usage of eventfd() notification interface > on such files. > > Please take look at: > > tools/cgroup/cgroup_event_listener.c > > and > > Documentation/cgroups/memcg_test.txt (point 9.10) > > to see how much easier it is to observe memory usage thresholds > on memory cgroups compared to available blocks on filesystems > using fs events.. > I'll give it some thoughts as the solution you are proposing eliminates some issues related with the generic netlink (mostly the one concerning the network namespaces) though I'd rather avoid creating numerous entries for each mount/mount namespace. I guess the best option is to meet halfway. > Also while at it please add your example user-space code (posted > on request in a some other mail) to tools/fs_events/ (preferably > in a separate patch). > Will do, once there is an overall agreement on the form of the events interface. >> + >> + MOUNT : the filesystem's mount point >> + EVENT_TYPE : event types - currently two of them are being supported: >> + >> + * generic events ("G") covering most common warnings >> + and errors that might be reported by any filesystem; >> + this option does not take any arguments; > > fs_event.h in uapi dir allows following events: > > /* > * Supported set of FS events > */ > enum { > FS_EVENT_NONE, > FS_WARN_ENOSPC, /* No space left to reserve data blks */ > FS_WARN_ENOSPC_META, /* No space left for metadata */ > FS_THR_LRBELOW, /* The threshold lower range has been reached */ > FS_THR_LRABOVE, /* The threshold lower range re-activcated*/ > FS_THR_URBELOW, > FS_THR_URABOVE, > FS_ERR_REMOUNT_RO, /* The file system has been remounted as RO */ > FS_ERR_CORRUPTED /* Critical error - fs corrupted */ > > }; > > For non-threshold related events the current interface allows > only configuration of all or none events to be anabled, i.e. > you cannot selectively enable notification on FS_WARN_ENOSPC > but not on FS_ERR_REMOUNT_RO. > > I also think that configuration interface should be made to > match the notification interface when it comes to event types. > Will take it into consideration - thanks. >> + >> + * threshold notifications ("T") - events sent whenever >> + the amount of available space drops below certain level; >> + it is possible to specify two threshold levels though >> + only one is required to properly setup the notifications; >> + as those refer to the number of available blocks, the lower >> + level [L1] needs to be higher than the upper one [L2] > > Why is there a limitation of only two thresholds? > > It should be relatively easy to make the code support > unlimited number of thresholds. > >> + >> +Sample request could look like the following: >> + >> + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config >> + >> +Multiple request might be specified provided they are separated with semicolon. > > s/request/requests/ > > I think that allowing multiple event types and requests in one > configuration request is not a good idea. Currently parsing > code is relatively simple but once somebody decides to enhance > the interface with new event types the parsing code may get > complex & ugly. > Noted. >> + >> +The configuration itself might be modified at any time. One can add/remove >> +particular event types for given fielsystem, modify the threshold levels, > > s/fielsystem/filesystem/ > >> +and remove single or all entries from the 'config' file. >> + >> + - Adding new event type: >> + >> + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config >> + >> +(Note that is is enough to provide the event type to be enabled without > > s/is is/is/ > >> +the already set ones.) >> + >> + - Removing event type: >> + >> + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config >> + >> + - Updating threshold limits: >> + >> + $ echo MOUNT T L1 L2 > /sys/fs/events/config >> + >> + - Removing single entry: >> + >> + $ echo '!MOUNT' > /sys/fs/events/config >> + >> + - Removing all entries: >> + >> + $ echo > /sys/fs/events/config >> + >> +Reading the file will list all registered entries with their current set-up >> +along with some additional info like the filesystem type and the backing device >> +name if available. >> + >> +Final, though a very important note on the configuration: when and if the >> +actual events are being triggered falls way beyond the scope of the generic >> +filesystem events interface. It is up to a particular filesystem >> +implementation which events are to be supported - if any at all. So if >> +given filesystem does not support the event notifications, an attempt to >> +enable those through 'config' file will fail. >> + >> + >> +3. The generic netlink interface support: >> +========================================= >> + >> +Whenever an event notification is triggered (by given filesystem) the current >> +configuration is being validated to decide whether a userpsace notification > > s/userpsace/userspace/ > >> +should be launched. If there has been no request (in a mean of 'config' file >> +entry) for given event, one will be silently disregarded. If, on the other >> +hand, someone is 'watching' given filesystem for specific events, a generic >> +netlink message will be sent. A dedicated multicast group has been provided >> +solely for this purpose so in order to receive such notifications, one should >> +subscribe to this new multicast group. As for now only the init network >> +namespace is being supported. >> + >> +3.1 Message format >> + >> +The FS_NL_C_EVENT shall be stored within the generic netlink message header >> +as the command field. The message payload will provide more detailed info: >> +the backing device major and minor numbers, the event code and the id of >> +the process which action led to the event occurrence. In case of threshold >> +notifications, the current number of available blocks will be included >> +in the payload as well. >> + >> + >> + 0 1 2 3 >> + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | NETLINK MESSAGE HEADER | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | GENERIC NETLINK MESSAGE HEADER | >> + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | Optional user specific message header | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | GENERIC MESSAGE PAYLOAD: | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_EVENT_ID (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DEV_MAJOR (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DEV_MINOR (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_CAUSED_ID (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DATA (NLA_U64) | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + >> + >> +The above figure is based on: >> + http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto#Message_Format >> + >> + >> +4. API Reference: >> +================= >> + >> + 4.1 Generic file system event interface data & operations >> + >> + #include >> + >> + struct fs_trace_info { >> + void __rcu *e_priv /* READ ONLY */ > > It should be marked as private for fs events core code and > not for use by filesystems' code. If possible it would be > the best to move it out of this struct, > >> + unsigned int events_cap_mask; /* Supported notifications */ >> + const struct fs_trace_operations *ops; >> + }; >> + >> + struct fs_trace_operations { >> + void (*query)(struct super_block *, u64 *); >> + }; >> + >> + In order to get the fireworks and stuff, each filesystem needs to setup >> + the events_cap_mask field of the fs_trace_info structure, which has been >> + embedded within the super_block structure. This should reflect the type of >> + events the filesystem wants to support. In case of threshold notifications, >> + apart from setting the FS_EVENT_THRESH flag, the 'query' callback should >> + be provided as this enables the events interface to get the up-to-date >> + state of the number of available blocks whenever those notifications are >> + being requested. >> + >> + The 'e_priv' field of the fs_trace_info structure should be completely ignored >> + as it's for INTERNAL USE ONLY. So don't even think of messing with it, if you >> + do not want to get yourself into some real trouble. If still, you are tempted >> + to do so - feel free, it's gonna be pure fun. Consider yourself warned. >> + >> + >> + 4.2 Event notification: >> + >> + #include >> + void fs_event_notify(struct super_block *sb, unsigned int event_id); >> + >> + Notify the generic FS event interface of an occurring event. >> + This shall be used by any file system that wishes to inform any potential >> + listeners/watchers of a particular event. >> + - sb: the filesystem's super block >> + - event_id: an event identifier >> + >> + 4.3 Threshold notifications: >> + >> + #include >> + void fs_event_alloc_space(struct super_block *sb, u64 ncount); >> + void fs_event_free_space(struct super_block *sb, u64 ncount); >> + >> + Each filesystme supporting the threshold notifications should call >> + fs_event_alloc_space/fs_event_free_space respectively whenever the >> + amount of available blocks changes. >> + - sb: the filesystem's super block >> + - ncount: number of blocks being acquired/released >> + >> + Note that to properly handle the threshold notifications the fs events >> + interface needs to be kept up to date by the filesystems. Each should >> + register fs_trace_operations to enable querying the current number of >> + available blocks. >> + >> + 4.4 Sending message through generic netlink interface >> + >> + #include >> + >> + int fs_netlink_send_event(size_t size, unsigned int event_id, >> + int (*compose_msg)(struct sk_buff *skb, void *data), void *cbdata); >> + >> + Although the fs event interface is fully responsible for sending the messages >> + over the netlink, filesystems might use the FS_EVENT multicast group to send >> + their own custom messages. >> + - size: the size of the message payload >> + - event_id: the event identifier >> + - compose_msg: a callback responsible for filling-in the message payload >> + - cbdata: message custom data >> + >> + Calling fs_netlink_send_event will result in a message being sent by >> + the FS_EVENT multicast group. Note that the body of the message should be >> + prepared (set-up )by the caller - through compose_msg callback. The message's > > (set-up) > >> + sk_buff will be allocated on behalf of the caller (thus the size parameter). >> + The compose_msg should only fill the payload with proper data. Unless >> + the event id is specified as FS_EVENT_NONE, it's value shall be added >> + to the payload prior to calling the compose_msg. >> + >> + >> diff --git a/fs/Kconfig b/fs/Kconfig >> index ec35851..a89e678 100644 >> --- a/fs/Kconfig >> +++ b/fs/Kconfig >> @@ -69,6 +69,8 @@ config FILE_LOCKING >> for filesystems like NFS and for the flock() system >> call. Disabling this option saves about 11k. >> >> +source "fs/events/Kconfig" >> + >> source "fs/notify/Kconfig" >> >> source "fs/quota/Kconfig" >> diff --git a/fs/Makefile b/fs/Makefile >> index a88ac48..bcb3048 100644 >> --- a/fs/Makefile >> +++ b/fs/Makefile >> @@ -126,3 +126,4 @@ obj-y += exofs/ # Multiple modules >> obj-$(CONFIG_CEPH_FS) += ceph/ >> obj-$(CONFIG_PSTORE) += pstore/ >> obj-$(CONFIG_EFIVAR_FS) += efivarfs/ >> +obj-$(CONFIG_FS_EVENTS) += events/ >> diff --git a/fs/events/Kconfig b/fs/events/Kconfig >> new file mode 100644 >> index 0000000..1c60195 >> --- /dev/null >> +++ b/fs/events/Kconfig >> @@ -0,0 +1,7 @@ >> +# Generic Files System events interface >> +config FS_EVENTS >> + bool "Generic filesystem events" >> + select NET >> + default y > > Do we really want to default to yes? > > [ If so then maybe we want to make the config option visible > only when EXPERT mode is enabled? ] > >> + help >> + Enable generic filesystem events interface > > Please enhance the help entry. > >> diff --git a/fs/events/Makefile b/fs/events/Makefile >> new file mode 100644 >> index 0000000..9c98337 >> --- /dev/null >> +++ b/fs/events/Makefile >> @@ -0,0 +1,5 @@ >> +# >> +# Makefile for the Linux Generic File System Event Interface >> +# >> + >> +obj-y := fs_event.o fs_event_netlink.o >> diff --git a/fs/events/fs_event.c b/fs/events/fs_event.c >> new file mode 100644 >> index 0000000..1037311 >> --- /dev/null >> +++ b/fs/events/fs_event.c >> @@ -0,0 +1,809 @@ >> +/* >> + * Generic File System Evens Interface >> + * >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include "../pnode.h" >> +#include "fs_event.h" >> + >> +static LIST_HEAD(fs_trace_list); >> +static DEFINE_MUTEX(fs_trace_lock); >> + >> +static struct kmem_cache *fs_trace_cachep __read_mostly; >> + >> +static atomic_t stray_traces = ATOMIC_INIT(0); >> +static DECLARE_WAIT_QUEUE_HEAD(trace_wq); >> +/* >> + * Threshold notification state bits. >> + * Note the reverse as this refers to the number >> + * of available blocks. >> + */ >> +#define THRESH_LR_BELOW 0x0001 /* Falling below the lower range */ >> +#define THRESH_LR_BEYOND 0x0002 >> +#define THRESH_UR_BELOW 0x0004 >> +#define THRESH_UR_BEYOND 0x0008 /* Going beyond the upper range */ >> + >> +#define THRESH_LR_ON (THRESH_LR_BELOW | THRESH_LR_BEYOND) >> +#define THRESH_UR_ON (THRESH_UR_BELOW | THRESH_UR_BEYOND) >> + >> +#define FS_TRACE_ADD 0x100000 >> + >> +struct fs_trace_entry { >> + struct kref count; >> + atomic_t active; >> + struct super_block *sb; >> + unsigned int notify; >> + struct path mnt_path; >> + struct list_head node; >> + >> + struct fs_event_thresh { >> + u64 avail_space; >> + u64 lrange; >> + u64 urange; >> + unsigned int state; >> + } th; >> + struct rcu_head rcu_head; >> + spinlock_t lock; >> +}; >> + >> +static const match_table_t fs_etypes = { >> + { FS_EVENT_GENERIC, "G" }, >> + { FS_EVENT_THRESH, "T" }, >> + { 0, NULL }, >> +}; >> + >> +static inline int fs_trace_query_data(struct super_block *sb, >> + struct fs_trace_entry *en) >> +{ >> + if (sb->s_etrace.ops && sb->s_etrace.ops->query) { >> + sb->s_etrace.ops->query(sb, &en->th.avail_space); >> + return 0; >> + } >> + >> + return -EINVAL; >> +} >> + >> +static inline void fs_trace_entry_free(struct fs_trace_entry *en) > > I don't see a real need for this wrapper (it is used only once). > >> +{ >> + kmem_cache_free(fs_trace_cachep, en); >> +} >> + >> +static void fs_destroy_trace_entry(struct kref *en_ref) >> +{ >> + struct fs_trace_entry *en = container_of(en_ref, >> + struct fs_trace_entry, count); >> + >> + /* Last reference has been dropped */ >> + fs_trace_entry_free(en); >> + atomic_dec(&stray_traces); >> +} >> + >> +static void fs_trace_entry_put(struct fs_trace_entry *en) >> +{ >> + kref_put(&en->count, fs_destroy_trace_entry); >> +} >> + >> +static void fs_release_trace_entry(struct rcu_head *rcu_head) >> +{ >> + struct fs_trace_entry *en = container_of(rcu_head, >> + struct fs_trace_entry, >> + rcu_head); >> + /* >> + * As opposed to typical reference drop, this one is being >> + * called from the rcu callback. This is to make sure all >> + * readers have managed to safely grab the reference before >> + * the change to rcu pointer is visible to all and before >> + * the reference is dropped here. >> + */ >> + fs_trace_entry_put(en); >> +} >> + >> +static void fs_drop_trace_entry(struct fs_trace_entry *en) >> +{ >> + struct super_block *sb; >> + >> + lockdep_assert_held(&fs_trace_lock); >> + /* >> + * The trace entry might have already been removed >> + * from the list of active traces with the proper >> + * ref drop, though it was still in use handling >> + * one of the fs events. This means that the object >> + * has been already scheduled for being released. >> + * So leave... >> + */ >> + >> + if (!atomic_add_unless(&en->active, -1, 0)) >> + return; >> + /* >> + * At this point the trace entry is being marked as inactive >> + * so no new references will be allowed. >> + * Still it might be floating around somewhere >> + * so drop the reference when the rcu readers are done. >> + */ >> + spin_lock(&en->lock); >> + list_del(&en->node); >> + sb = en->sb; >> + en->sb = NULL; >> + spin_unlock(&en->lock); >> + >> + rcu_assign_pointer(sb->s_etrace.e_priv, NULL); >> + call_rcu(&en->rcu_head, fs_release_trace_entry); >> + /* It's safe now to drop the reference to the super */ >> + deactivate_super(sb); >> + atomic_inc(&stray_traces); >> +} >> + >> +static inline >> +struct fs_trace_entry *fs_trace_entry_get(struct fs_trace_entry *en) >> +{ >> + if (en) { >> + if (!kref_get_unless_zero(&en->count)) >> + return NULL; >> + /* Don't allow referencing inactive object */ >> + if (!atomic_read(&en->active)) { >> + fs_trace_entry_put(en); >> + return NULL; >> + } >> + } >> + return en; >> +} >> + >> +static struct fs_trace_entry *fs_trace_entry_get_rcu(struct super_block *sb) >> +{ >> + struct fs_trace_entry *en; >> + >> + if (!sb) >> + return NULL; >> + >> + rcu_read_lock(); >> + en = rcu_dereference(sb->s_etrace.e_priv); >> + en = fs_trace_entry_get(en); >> + rcu_read_unlock(); >> + >> + return en; >> +} >> + >> +static int fs_remove_trace_entry(struct super_block *sb) >> +{ >> + struct fs_trace_entry *en; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return -EINVAL; >> + >> + mutex_lock(&fs_trace_lock); >> + fs_drop_trace_entry(en); >> + mutex_unlock(&fs_trace_lock); >> + fs_trace_entry_put(en); >> + return 0; >> +} >> + >> +static void fs_remove_all_traces(void) >> +{ >> + struct fs_trace_entry *en, *guard; >> + >> + mutex_lock(&fs_trace_lock); >> + list_for_each_entry_safe(en, guard, &fs_trace_list, node) >> + fs_drop_trace_entry(en); >> + mutex_unlock(&fs_trace_lock); >> +} >> + >> +static int create_common_msg(struct sk_buff *skb, void *data) >> +{ >> + struct fs_trace_entry *en = (struct fs_trace_entry *)data; >> + struct super_block *sb = en->sb; >> + >> + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) >> + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) >> + return -EINVAL; >> + >> + if (nla_put_u64(skb, FS_NL_A_CAUSED_ID, pid_vnr(task_pid(current)))) >> + return -EINVAL; >> + >> + return 0; >> +} >> + >> +static int create_thresh_msg(struct sk_buff *skb, void *data) >> +{ >> + struct fs_trace_entry *en = (struct fs_trace_entry *)data; >> + int ret; >> + >> + ret = create_common_msg(skb, data); >> + if (!ret) >> + ret = nla_put_u64(skb, FS_NL_A_DATA, en->th.avail_space); >> + return ret; >> +} >> + >> +static void fs_event_send(struct fs_trace_entry *en, unsigned int event_id) >> +{ >> + size_t size = nla_total_size(sizeof(u32)) * 2 + >> + nla_total_size(sizeof(u64)); >> + >> + fs_netlink_send_event(size, event_id, create_common_msg, en); >> +} >> + >> +static void fs_event_send_thresh(struct fs_trace_entry *en, >> + unsigned int event_id) >> +{ >> + size_t size = nla_total_size(sizeof(u32)) * 2 + >> + nla_total_size(sizeof(u64)) * 2; >> + >> + fs_netlink_send_event(size, event_id, create_thresh_msg, en); >> +} >> + >> +void fs_event_notify(struct super_block *sb, unsigned int event_id) >> +{ >> + struct fs_trace_entry *en; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return; >> + >> + spin_lock(&en->lock); >> + if (atomic_read(&en->active) && (en->notify & FS_EVENT_GENERIC)) >> + fs_event_send(en, event_id); >> + spin_unlock(&en->lock); >> + fs_trace_entry_put(en); >> +} >> +EXPORT_SYMBOL(fs_event_notify); >> + >> +void fs_event_alloc_space(struct super_block *sb, u64 ncount) >> +{ >> + struct fs_trace_entry *en; >> + s64 count; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return; >> + >> + spin_lock(&en->lock); >> + >> + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) >> + goto leave; >> + /* >> + * we shouldn't drop below 0 here, >> + * unless there is a sync issue somewhere (?) >> + */ >> + count = en->th.avail_space - ncount; >> + en->th.avail_space = count < 0 ? 0 : count; >> + >> + if (en->th.avail_space > en->th.lrange) >> + /* Not 'even' close - leave */ >> + goto leave; >> + >> + if (en->th.avail_space > en->th.urange) { >> + /* Close enough - the lower range has been reached */ >> + if (!(en->th.state & THRESH_LR_BEYOND)) { >> + /* Send notification */ >> + fs_event_send_thresh(en, FS_THR_LRBELOW); >> + en->th.state &= ~THRESH_LR_BELOW; >> + en->th.state |= THRESH_LR_BEYOND; >> + } >> + goto leave; >> + } >> + if (!(en->th.state & THRESH_UR_BEYOND)) { >> + fs_event_send_thresh(en, FS_THR_URBELOW); >> + en->th.state &= ~THRESH_UR_BELOW; >> + en->th.state |= THRESH_UR_BEYOND; >> + } >> + >> +leave: >> + spin_unlock(&en->lock); >> + fs_trace_entry_put(en); >> +} >> +EXPORT_SYMBOL(fs_event_alloc_space); >> + >> +void fs_event_free_space(struct super_block *sb, u64 ncount) >> +{ >> + struct fs_trace_entry *en; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return; >> + >> + spin_lock(&en->lock); >> + >> + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) >> + goto leave; >> + >> + en->th.avail_space += ncount; >> + >> + if (en->th.avail_space > en->th.lrange) { >> + if (!(en->th.state & THRESH_LR_BELOW) >> + && en->th.state & THRESH_LR_BEYOND) { >> + /* Send notification */ >> + fs_event_send_thresh(en, FS_THR_LRABOVE); >> + en->th.state &= ~(THRESH_LR_BEYOND|THRESH_UR_BEYOND); >> + en->th.state |= THRESH_LR_BELOW; >> + goto leave; >> + } >> + } >> + if (en->th.avail_space > en->th.urange) { >> + if (!(en->th.state & THRESH_UR_BELOW) >> + && en->th.state & THRESH_UR_BEYOND) { >> + /* Notify */ >> + fs_event_send_thresh(en, FS_THR_URABOVE); >> + en->th.state &= ~THRESH_UR_BEYOND; >> + en->th.state |= THRESH_UR_BELOW; >> + } >> + } >> +leave: >> + spin_unlock(&en->lock); >> + fs_trace_entry_put(en); >> +} >> +EXPORT_SYMBOL(fs_event_free_space); >> + >> +void fs_event_mount_dropped(struct vfsmount *mnt) >> +{ >> + /* >> + * The mount is dropped but the super might not get released >> + * at once so there is very small chance some notifications >> + * will come through. >> + * Note that the mount being dropped here might belong to a different >> + * namespace - if this is the case, just ignore it. >> + */ >> + struct fs_trace_entry *en = fs_trace_entry_get_rcu(mnt->mnt_sb); >> + struct vfsmount *en_mnt; >> + >> + if (!en || !atomic_read(&en->active)) >> + return; >> + /* >> + * The entry once set, does not change the mountpoint it's being >> + * pinned to, so no need to take the lock here. >> + */ >> + en_mnt = en->mnt_path.mnt; >> + if (!(real_mount(mnt)->mnt_ns != (real_mount(en_mnt))->mnt_ns)) >> + fs_remove_trace_entry(mnt->mnt_sb); >> + fs_trace_entry_put(en); >> +} >> + >> +static int fs_new_trace_entry(struct path *path, struct fs_event_thresh *thresh, >> + unsigned int nmask) >> +{ >> + struct fs_trace_entry *en; >> + struct super_block *sb; >> + struct mount *r_mnt; >> + >> + en = kmem_cache_zalloc(fs_trace_cachep, GFP_KERNEL); >> + if (unlikely(!en)) >> + return -ENOMEM; >> + /* >> + * Note that no reference is being taken here for the path as it would >> + * make the unmount unnecessarily puzzling (due to an extra 'valid' >> + * reference for the mnt). >> + * This is *rather* safe as the notification on mount being dropped >> + * will get called prior to releasing the super block - so right >> + * in time to perform appropriate clean-up >> + */ >> + r_mnt = real_mount(path->mnt); >> + >> + en->mnt_path.dentry = r_mnt->mnt.mnt_root; >> + en->mnt_path.mnt = &r_mnt->mnt; >> + >> + sb = path->mnt->mnt_sb; >> + en->sb = sb; >> + /* >> + * Increase the refcount for sb to mark it's being relied on. >> + * Note that the reference to path is taken by the caller, so it >> + * is safe to assume there is at least single active reference >> + * to super as well. >> + */ >> + atomic_inc(&sb->s_active); >> + >> + nmask &= sb->s_etrace.events_cap_mask; >> + if (!nmask) >> + goto leave; >> + >> + spin_lock_init(&en->lock); >> + INIT_LIST_HEAD(&en->node); >> + >> + en->notify = nmask; >> + memcpy(&en->th, thresh, offsetof(struct fs_event_thresh, state)); >> + if (nmask & FS_EVENT_THRESH) >> + fs_trace_query_data(sb, en); >> + >> + kref_init(&en->count); >> + >> + if (rcu_access_pointer(sb->s_etrace.e_priv) != NULL) { >> + struct fs_trace_entry *prev_en; >> + >> + prev_en = fs_trace_entry_get_rcu(sb); >> + if (prev_en) { >> + WARN_ON(prev_en); >> + fs_trace_entry_put(prev_en); >> + goto leave; >> + } >> + } >> + atomic_set(&en->active, 1); >> + >> + mutex_lock(&fs_trace_lock); >> + list_add(&en->node, &fs_trace_list); >> + mutex_unlock(&fs_trace_lock); >> + >> + rcu_assign_pointer(sb->s_etrace.e_priv, en); >> + synchronize_rcu(); >> + >> + return 0; >> +leave: >> + deactivate_super(sb); >> + kmem_cache_free(fs_trace_cachep, en); >> + return -EINVAL; >> +} >> + >> +static int fs_update_trace_entry(struct path *path, >> + struct fs_event_thresh *thresh, >> + unsigned int nmask) >> +{ >> + struct fs_trace_entry *en; >> + struct super_block *sb; >> + int extend = nmask & FS_TRACE_ADD; >> + int ret = -EINVAL; >> + >> + en = fs_trace_entry_get_rcu(path->mnt->mnt_sb); >> + if (!en) >> + return (extend) ? fs_new_trace_entry(path, thresh, nmask) >> + : -EINVAL; >> + >> + if (!atomic_read(&en->active)) >> + return -EINVAL; >> + >> + nmask &= ~FS_TRACE_ADD; >> + >> + spin_lock(&en->lock); >> + sb = en->sb; >> + if (!sb || !(nmask & sb->s_etrace.events_cap_mask)) >> + goto leave; >> + >> + if (nmask & FS_EVENT_THRESH) { >> + if (extend) { >> + /* Get the current state */ >> + if (!(en->notify & FS_EVENT_THRESH)) >> + if (fs_trace_query_data(sb, en)) >> + goto leave; >> + >> + if (thresh->state & THRESH_LR_ON) { >> + en->th.lrange = thresh->lrange; >> + en->th.state &= ~THRESH_LR_ON; >> + } >> + >> + if (thresh->state & THRESH_UR_ON) { >> + en->th.urange = thresh->urange; >> + en->th.state &= ~THRESH_UR_ON; >> + } >> + } else { >> + memset(&en->th, 0, sizeof(en->th)); >> + } >> + } >> + >> + if (extend) >> + en->notify |= nmask; >> + else >> + en->notify &= ~nmask; >> + ret = 0; >> +leave: >> + spin_unlock(&en->lock); >> + fs_trace_entry_put(en); >> + return ret; >> +} >> + >> +static int fs_parse_trace_request(int argc, char **argv) >> +{ >> + struct fs_event_thresh thresh = {0}; >> + struct path path; >> + substring_t args[MAX_OPT_ARGS]; >> + unsigned int nmask = FS_TRACE_ADD; >> + int token; >> + char *s; >> + int ret = -EINVAL; >> + >> + if (!argc) { >> + fs_remove_all_traces(); >> + return 0; >> + } >> + >> + s = *(argv); >> + if (*s == '!') { >> + /* Clear the trace entry */ >> + nmask &= ~FS_TRACE_ADD; >> + ++s; >> + } >> + >> + if (kern_path_mountpoint(AT_FDCWD, s, &path, LOOKUP_FOLLOW)) >> + return -EINVAL; >> + >> + if (!(--argc)) { >> + if (!(nmask & FS_TRACE_ADD)) >> + ret = fs_remove_trace_entry(path.mnt->mnt_sb); >> + goto leave; >> + } >> + >> +repeat: >> + args[0].to = args[0].from = NULL; >> + token = match_token(*(++argv), fs_etypes, args); >> + if (!token && !nmask) >> + goto leave; >> + >> + nmask |= token & FS_EVENTS_ALL; >> + --argc; >> + if ((token & FS_EVENT_THRESH) && (nmask & FS_TRACE_ADD)) { >> + /* >> + * Get the threshold config data: >> + * lower range >> + * upper range >> + */ >> + if (!argc) >> + goto leave; >> + >> + ret = kstrtoull(*(++argv), 10, &thresh.lrange); >> + if (ret) >> + goto leave; >> + thresh.state |= THRESH_LR_ON; >> + if ((--argc)) { >> + ret = kstrtoull(*(++argv), 10, &thresh.urange); >> + if (ret) >> + goto leave; >> + thresh.state |= THRESH_UR_ON; >> + --argc; >> + } >> + /* The thresholds are based on number of available blocks */ >> + if (thresh.lrange < thresh.urange) { >> + ret = -EINVAL; >> + goto leave; >> + } >> + } >> + if (argc) >> + goto repeat; >> + >> + ret = fs_update_trace_entry(&path, &thresh, nmask); >> +leave: >> + path_put(&path); >> + return ret; >> +} >> + >> +#define DEFAULT_BUF_SIZE PAGE_SIZE >> + >> +static ssize_t fs_trace_write(struct file *file, const char __user *buffer, >> + size_t count, loff_t *ppos) >> +{ >> + char **argv; >> + char *kern_buf, *next, *cfg; >> + size_t size, dcount = 0; >> + int argc; >> + >> + if (!count) >> + return 0; >> + >> + kern_buf = kmalloc(DEFAULT_BUF_SIZE, GFP_KERNEL); >> + if (!kern_buf) >> + return -ENOMEM; >> + >> + while (dcount < count) { >> + >> + size = count - dcount; >> + if (size >= DEFAULT_BUF_SIZE) >> + size = DEFAULT_BUF_SIZE - 1; >> + if (copy_from_user(kern_buf, buffer + dcount, size)) { >> + dcount = -EINVAL; >> + goto leave; >> + } >> + >> + kern_buf[size] = '\0'; >> + >> + next = cfg = kern_buf; >> + >> + do { >> + next = strchr(cfg, ';'); >> + if (next) >> + *next = '\0'; >> + >> + argv = argv_split(GFP_KERNEL, cfg, &argc); >> + if (!argv) { >> + dcount = -ENOMEM; >> + goto leave; >> + } >> + >> + if (fs_parse_trace_request(argc, argv)) { >> + dcount = -EINVAL; >> + argv_free(argv); >> + goto leave; >> + } >> + >> + argv_free(argv); >> + if (next) >> + cfg = ++next; >> + >> + } while (next); >> + dcount += size; >> + } >> +leave: >> + kfree(kern_buf); >> + return dcount; >> +} >> + >> +static void *fs_trace_seq_start(struct seq_file *m, loff_t *pos) >> +{ >> + mutex_lock(&fs_trace_lock); >> + return seq_list_start(&fs_trace_list, *pos); >> +} >> + >> +static void *fs_trace_seq_next(struct seq_file *m, void *v, loff_t *pos) >> +{ >> + return seq_list_next(v, &fs_trace_list, pos); >> +} >> + >> +static void fs_trace_seq_stop(struct seq_file *m, void *v) >> +{ >> + mutex_unlock(&fs_trace_lock); >> +} >> + >> +static int fs_trace_seq_show(struct seq_file *m, void *v) >> +{ >> + struct fs_trace_entry *en; >> + struct super_block *sb; >> + struct mount *r_mnt; >> + const struct match_token *match; >> + unsigned int nmask; >> + >> + en = list_entry(v, struct fs_trace_entry, node); >> + /* Do not show the entries outside current mount namespace */ >> + r_mnt = real_mount(en->mnt_path.mnt); >> + if (r_mnt->mnt_ns != current->nsproxy->mnt_ns) { >> + if (!__is_local_mountpoint(r_mnt->mnt_mountpoint)) >> + return 0; >> + } >> + >> + sb = en->sb; >> + >> + seq_path(m, &en->mnt_path, "\t\n\\"); >> + seq_putc(m, ' '); >> + >> + seq_escape(m, sb->s_type->name, " \t\n\\"); >> + if (sb->s_subtype && sb->s_subtype[0]) { >> + seq_putc(m, '.'); >> + seq_escape(m, sb->s_subtype, " \t\n\\"); >> + } >> + >> + seq_putc(m, ' '); >> + if (sb->s_op->show_devname) { >> + sb->s_op->show_devname(m, en->mnt_path.mnt->mnt_root); >> + } else { >> + seq_escape(m, r_mnt->mnt_devname ? r_mnt->mnt_devname : "none", >> + " \t\n\\"); >> + } >> + seq_puts(m, " ("); >> + >> + nmask = en->notify; >> + for (match = fs_etypes; match->pattern; ++match) { >> + if (match->token & nmask) { >> + seq_puts(m, match->pattern); >> + nmask &= ~match->token; >> + if (nmask) >> + seq_putc(m, ','); >> + } >> + } >> + seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); >> + seq_puts(m, ")\n"); >> + return 0; >> +} >> + >> +static const struct seq_operations fs_trace_seq_ops = { >> + .start = fs_trace_seq_start, >> + .next = fs_trace_seq_next, >> + .stop = fs_trace_seq_stop, >> + .show = fs_trace_seq_show, >> +}; >> + >> +static int fs_trace_open(struct inode *inode, struct file *file) >> +{ >> + return seq_open(file, &fs_trace_seq_ops); >> +} >> + >> +static const struct file_operations fs_trace_fops = { >> + .owner = THIS_MODULE, >> + .open = fs_trace_open, >> + .write = fs_trace_write, >> + .read = seq_read, >> + .llseek = seq_lseek, >> + .release = seq_release, >> +}; >> + >> +static int fs_trace_init(void) >> +{ >> + fs_trace_cachep = KMEM_CACHE(fs_trace_entry, 0); >> + if (!fs_trace_cachep) >> + return -EINVAL; >> + init_waitqueue_head(&trace_wq); >> + return 0; >> +} >> + >> +/* VFS support */ >> +static int fs_trace_fill_super(struct super_block *sb, void *data, int silen) >> +{ >> + int ret; >> + static struct tree_descr desc[] = { >> + [2] = { >> + .name = "config", >> + .ops = &fs_trace_fops, >> + .mode = S_IWUSR | S_IRUGO, >> + }, >> + {""}, >> + }; >> + >> + ret = simple_fill_super(sb, 0x7246332, desc); > > Please use a define for a magic number. > >> + return !ret ? fs_trace_init() : ret; >> +} >> + >> +static struct dentry *fs_trace_do_mount(struct file_system_type *fs_type, >> + int ntype, const char *dev_name, void *data) >> +{ >> + return mount_single(fs_type, ntype, data, fs_trace_fill_super); >> +} >> + >> +static void fs_trace_kill_super(struct super_block *sb) >> +{ >> + /* >> + * The rcu_barrier here will/should make sure all call_rcu >> + * callbacks are completed - still there might be some active >> + * trace objects in use which can make calling the >> + * kmem_cache_destroy unsafe. So we wait until all traces >> + * are finally released. >> + */ >> + fs_remove_all_traces(); >> + rcu_barrier(); >> + wait_event(trace_wq, !atomic_read(&stray_traces)); >> + >> + kmem_cache_destroy(fs_trace_cachep); >> + kill_litter_super(sb); >> +} >> + >> +static struct kset *fs_trace_kset; >> + >> +static struct file_system_type fs_trace_fstype = { >> + .name = "fstrace", >> + .mount = fs_trace_do_mount, >> + .kill_sb = fs_trace_kill_super, >> +}; >> + >> +static void __init fs_trace_vfs_init(void) >> +{ >> + fs_trace_kset = kset_create_and_add("events", NULL, fs_kobj); >> + >> + if (!fs_trace_kset) >> + return; >> + >> + if (!register_filesystem(&fs_trace_fstype)) { >> + if (!fs_event_netlink_register()) >> + return; >> + unregister_filesystem(&fs_trace_fstype); >> + } >> + kset_unregister(fs_trace_kset); >> +} >> + >> +static int __init fs_trace_evens_init(void) >> +{ >> + fs_trace_vfs_init(); >> + return 0; >> +}; >> +module_init(fs_trace_evens_init); >> + >> diff --git a/fs/events/fs_event.h b/fs/events/fs_event.h >> new file mode 100644 >> index 0000000..23f24c8 >> --- /dev/null >> +++ b/fs/events/fs_event.h >> @@ -0,0 +1,22 @@ >> +/* >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> + >> +#ifndef __GENERIC_FS_EVENTS_H >> +#define __GENERIC_FS_EVENTS_H >> + >> +int fs_event_netlink_register(void); >> +void fs_event_netlink_unregister(void); >> + >> +#endif /* __GENERIC_FS_EVENTS_H */ >> diff --git a/fs/events/fs_event_netlink.c b/fs/events/fs_event_netlink.c >> new file mode 100644 >> index 0000000..0c97eb7 >> --- /dev/null >> +++ b/fs/events/fs_event_netlink.c >> @@ -0,0 +1,104 @@ >> +/* >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include "fs_event.h" >> + >> +static const struct genl_multicast_group fs_event_mcgroups[] = { >> + { .name = FS_EVENTS_MCAST_GRP_NAME, }, >> +}; >> + >> +static struct genl_family fs_event_family = { >> + .id = GENL_ID_GENERATE, >> + .name = FS_EVENTS_FAMILY_NAME, >> + .version = 1, >> + .maxattr = FS_NL_A_MAX, >> + .mcgrps = fs_event_mcgroups, >> + .n_mcgrps = ARRAY_SIZE(fs_event_mcgroups), >> +}; >> + >> +int fs_netlink_send_event(size_t size, unsigned int event_id, >> + int (*compose_msg)(struct sk_buff *skb, void *data), >> + void *cbdata) >> +{ >> + static atomic_t seq; >> + struct sk_buff *skb; >> + void *msg_head; >> + int ret = 0; >> + >> + if (!size || !compose_msg) >> + return -EINVAL; >> + >> + /* Skip if there are no listeners */ >> + if (!genl_has_listeners(&fs_event_family, &init_net, 0)) >> + return 0; >> + >> + if (event_id != FS_EVENT_NONE) >> + size += nla_total_size(sizeof(u32)); >> + size += nla_total_size(sizeof(u64)); >> + skb = genlmsg_new(size, GFP_NOWAIT); >> + >> + if (!skb) { >> + pr_debug("Failed to allocate new FS generic netlink message\n"); >> + return -ENOMEM; >> + } >> + >> + msg_head = genlmsg_put(skb, 0, atomic_add_return(1, &seq), >> + &fs_event_family, 0, FS_NL_C_EVENT); >> + if (!msg_head) >> + goto cleanup; >> + >> + if (event_id != FS_EVENT_NONE) >> + if (nla_put_u32(skb, FS_NL_A_EVENT_ID, event_id)) >> + goto cancel; >> + >> + ret = compose_msg(skb, cbdata); >> + if (ret) >> + goto cancel; >> + >> + genlmsg_end(skb, msg_head); >> + ret = genlmsg_multicast(&fs_event_family, skb, 0, 0, GFP_NOWAIT); >> + if (ret && ret != -ENOBUFS && ret != -ESRCH) >> + goto cleanup; >> + >> + return ret; >> + >> +cancel: >> + genlmsg_cancel(skb, msg_head); >> +cleanup: >> + nlmsg_free(skb); >> + return ret; >> +} >> +EXPORT_SYMBOL(fs_netlink_send_event); >> + >> +int fs_event_netlink_register(void) >> +{ >> + int ret; >> + >> + ret = genl_register_family(&fs_event_family); >> + if (ret) >> + pr_err("Failed to register FS netlink interface\n"); >> + return ret; >> +} >> + >> +void fs_event_netlink_unregister(void) >> +{ >> + genl_unregister_family(&fs_event_family); >> +} >> diff --git a/fs/namespace.c b/fs/namespace.c >> index 82ef140..ec6e2ef 100644 >> --- a/fs/namespace.c >> +++ b/fs/namespace.c >> @@ -1031,6 +1031,7 @@ static void cleanup_mnt(struct mount *mnt) >> if (unlikely(mnt->mnt_pins.first)) >> mnt_pin_kill(mnt); >> fsnotify_vfsmount_delete(&mnt->mnt); >> + fs_event_mount_dropped(&mnt->mnt); >> dput(mnt->mnt.mnt_root); >> deactivate_super(mnt->mnt.mnt_sb); >> mnt_free_id(mnt); >> diff --git a/include/linux/fs.h b/include/linux/fs.h >> index b4d71b5..b7dadd9 100644 >> --- a/include/linux/fs.h >> +++ b/include/linux/fs.h >> @@ -263,6 +263,10 @@ struct iattr { >> * Includes for diskquotas. >> */ >> #include >> +/* >> + * Include for Generic File System Events Interface >> + */ >> +#include >> >> /* >> * Maximum number of layers of fs stack. Needs to be limited to >> @@ -1253,7 +1257,7 @@ struct super_block { >> struct hlist_node s_instances; >> unsigned int s_quota_types; /* Bitmask of supported quota types */ >> struct quota_info s_dquot; /* Diskquota specific options */ >> - >> + struct fs_trace_info s_etrace; >> struct sb_writers s_writers; >> >> char s_id[32]; /* Informational name */ >> diff --git a/include/linux/fs_event.h b/include/linux/fs_event.h >> new file mode 100644 >> index 0000000..83e22dd >> --- /dev/null >> +++ b/include/linux/fs_event.h >> @@ -0,0 +1,72 @@ >> +/* >> + * Generic File System Events Interface >> + * >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> +#ifndef _LINUX_GENERIC_FS_EVETS_ >> +#define _LINUX_GENERIC_FS_EVETS_ > > EVETS? > > Also the define name usually corresponds to the header filename. > >> +#include >> +#include >> + >> +/* >> + * Currently supported event types >> + */ >> +#define FS_EVENT_GENERIC 0x001 >> +#define FS_EVENT_THRESH 0x002 >> + >> +#define FS_EVENTS_ALL (FS_EVENT_GENERIC | FS_EVENT_THRESH) >> + >> +struct fs_trace_operations { >> + void (*query)(struct super_block *, u64 *); >> +}; >> + >> +struct fs_trace_info { >> + void __rcu *e_priv; /* READ ONLY */ >> + unsigned int events_cap_mask; /* Supported notifications */ >> + const struct fs_trace_operations *ops; >> +}; >> + >> +#ifdef CONFIG_FS_EVENTS >> + >> +void fs_event_notify(struct super_block *sb, unsigned int event_id); >> +void fs_event_alloc_space(struct super_block *sb, u64 ncount); >> +void fs_event_free_space(struct super_block *sb, u64 ncount); >> +void fs_event_mount_dropped(struct vfsmount *mnt); >> + >> +int fs_netlink_send_event(size_t size, unsigned int event_id, >> + int (*compose_msg)(struct sk_buff *skb, void *data), >> + void *cbdata); >> + >> +#else /* CONFIG_FS_EVENTS */ >> + >> +static inline >> +void fs_event_notify(struct super_block *sb, unsigned int event_id) {}; >> +static inline >> +void fs_event_alloc_space(struct super_block *sb, u64 ncount) {}; >> +static inline >> +void fs_event_free_space(struct super_block *sb, u64 ncount) {}; >> +static inline >> +void fs_event_mount_dropped(struct vfsmount *mnt) {}; >> + >> +static inline >> +int fs_netlink_send_event(size_t size, unsigned int event_id, >> + int (*compose_msig)(struct sk_buff *skb, void *data), >> + void *cbdata) >> +{ >> + return -ENOSYS; >> +} >> +#endif /* CONFIG_FS_EVENTS */ >> + >> +#endif /* _LINUX_GENERIC_FS_EVENTS_ */ >> + >> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild >> index 68ceb97..dae0fab 100644 >> --- a/include/uapi/linux/Kbuild >> +++ b/include/uapi/linux/Kbuild >> @@ -129,6 +129,7 @@ header-y += firewire-constants.h >> header-y += flat.h >> header-y += fou.h >> header-y += fs.h >> +header-y += fs_event.h >> header-y += fsl_hypervisor.h >> header-y += fuse.h >> header-y += futex.h >> diff --git a/include/uapi/linux/fs_event.h b/include/uapi/linux/fs_event.h >> new file mode 100644 >> index 0000000..d8b07da >> --- /dev/null >> +++ b/include/uapi/linux/fs_event.h >> @@ -0,0 +1,58 @@ >> +/* >> + * Generic netlink support for Generic File System Events Interface >> + * >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> +#ifndef _UAPI_LINUX_GENERIC_FS_EVENTS_ >> +#define _UAPI_LINUX_GENERIC_FS_EVENTS_ > > Define name usually corresponds to the header filename. > >> +#define FS_EVENTS_FAMILY_NAME "fs_event" >> +#define FS_EVENTS_MCAST_GRP_NAME "fs_event_mc_grp" >> + >> +/* >> + * Generic netlink attribute types >> + */ >> +enum { >> + FS_NL_A_NONE, >> + FS_NL_A_EVENT_ID, >> + FS_NL_A_DEV_MAJOR, >> + FS_NL_A_DEV_MINOR, >> + FS_NL_A_CAUSED_ID, >> + FS_NL_A_DATA, >> + __FS_NL_A_MAX, >> +}; >> +#define FS_NL_A_MAX (__FS_NL_A_MAX - 1) >> +/* >> + * Generic netlink commands >> + */ >> +#define FS_NL_C_EVENT 1 >> + >> +/* >> + * Supported set of FS events >> + */ >> +enum { >> + FS_EVENT_NONE, >> + FS_WARN_ENOSPC, /* No space left to reserve data blks */ >> + FS_WARN_ENOSPC_META, /* No space left for metadata */ >> + FS_THR_LRBELOW, /* The threshold lower range has been reached */ >> + FS_THR_LRABOVE, /* The threshold lower range re-activcated*/ >> + FS_THR_URBELOW, >> + FS_THR_URABOVE, >> + FS_ERR_REMOUNT_RO, /* The file system has been remounted as RO */ >> + FS_ERR_CORRUPTED /* Critical error - fs corrupted */ >> + >> +}; >> + >> +#endif /* _UAPI_LINUX_GENERIC_FS_EVENTS_ */ >> + > > Best regards, > -- > Bartlomiej Zolnierkiewicz > Samsung R&D Institute Poland > Samsung Electronics > > Thanks for Your comments. Best Regards Beata -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f182.google.com (mail-pd0-f182.google.com [209.85.192.182]) by kanga.kvack.org (Postfix) with ESMTP id 43DA16B0073 for ; Tue, 16 Jun 2015 09:09:55 -0400 (EDT) Received: by pdbnf5 with SMTP id nf5so14256407pdb.2 for ; Tue, 16 Jun 2015 06:09:55 -0700 (PDT) Received: from mailout2.w1.samsung.com (mailout2.w1.samsung.com. [210.118.77.12]) by mx.google.com with ESMTPS id wj1si1302768pbc.235.2015.06.16.06.09.47 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 16 Jun 2015 06:09:47 -0700 (PDT) Received: from eucpsbgm2.samsung.com (unknown [203.254.199.245]) by mailout2.w1.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTP id <0NQ100DKKGK79480@mailout2.w1.samsung.com> for linux-mm@kvack.org; Tue, 16 Jun 2015 14:09:43 +0100 (BST) From: Beata Michalska Subject: [RFC v3 3/4] ext4: Add support for generic FS events Date: Tue, 16 Jun 2015 15:09:32 +0200 Message-id: <1434460173-18427-4-git-send-email-b.michalska@samsung.com> In-reply-to: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Cc: greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Add support for generic FS events including threshold notifications, ENOSPC and remount as read-only warnings, along with generic internal warnings/errors. Signed-off-by: Beata Michalska --- fs/ext4/balloc.c | 10 ++++++++-- fs/ext4/ext4.h | 1 + fs/ext4/inode.c | 2 +- fs/ext4/mballoc.c | 6 +++++- fs/ext4/resize.c | 1 + fs/ext4/super.c | 39 +++++++++++++++++++++++++++++++++++++++ 6 files changed, 55 insertions(+), 4 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index e95b27a..a48450f 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -569,6 +569,7 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi, { if (ext4_has_free_clusters(sbi, nclusters, flags)) { percpu_counter_add(&sbi->s_dirtyclusters_counter, nclusters); + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, nclusters)); return 0; } else return -ENOSPC; @@ -590,9 +591,10 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries) { if (!ext4_has_free_clusters(EXT4_SB(sb), 1, 0) || (*retries)++ > 3 || - !EXT4_SB(sb)->s_journal) + !EXT4_SB(sb)->s_journal) { + fs_event_notify(sb, FS_WARN_ENOSPC); return 0; - + } jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id); return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal); @@ -637,6 +639,10 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode, dquot_alloc_block_nofail(inode, EXT4_C2B(EXT4_SB(inode->i_sb), ar.len)); } + + if (*errp == -ENOSPC) + fs_event_notify(inode->i_sb, FS_WARN_ENOSPC_META); + return ret; } diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 163afe2..7d75ff9 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2542,6 +2542,7 @@ void ext4_mark_group_corrupted(struct ext4_sb_info *sbi, if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free); set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, grp->bb_free)); } /* diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5cb9a21..2a7af0f 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1238,7 +1238,7 @@ static void ext4_da_release_space(struct inode *inode, int to_free) percpu_counter_sub(&sbi->s_dirtyclusters_counter, to_free); spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); - + fs_event_free_space(sbi->s_sb, to_free); dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free)); } diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 24a4b6d..c2df6f0 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4511,6 +4511,9 @@ out: kmem_cache_free(ext4_ac_cachep, ac); if (inquota && ar->len < inquota) dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len)); + if (reserv_clstrs && ar->len < reserv_clstrs) + fs_event_free_space(sbi->s_sb, + EXT4_C2B(sbi, reserv_clstrs - ar->len)); if (!ar->len) { if ((ar->flags & EXT4_MB_DELALLOC_RESERVED) == 0) /* release all the reserved blocks if non delalloc */ @@ -4848,7 +4851,7 @@ do_more: if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) dquot_free_block(inode, EXT4_C2B(sbi, count_clusters)); percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters); - + fs_event_free_space(sb, EXT4_C2B(sbi, count_clusters)); ext4_mb_unload_buddy(&e4b); /* We dirtied the bitmap block */ @@ -4982,6 +4985,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb, ext4_unlock_group(sb, block_group); percpu_counter_add(&sbi->s_freeclusters_counter, EXT4_NUM_B2C(sbi, blocks_freed)); + fs_event_free_space(sb, blocks_freed); if (sbi->s_log_groups_per_flex) { ext4_group_t flex_group = ext4_flex_group(sbi, block_group); diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index 8a8ec62..dbf08d6 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -1378,6 +1378,7 @@ static void ext4_update_super(struct super_block *sb, EXT4_NUM_B2C(sbi, free_blocks)); percpu_counter_add(&sbi->s_freeinodes_counter, EXT4_INODES_PER_GROUP(sb) * flex_gd->count); + fs_event_free_space(sb, free_blocks - reserved_blocks); ext4_debug("free blocks count %llu", percpu_counter_read(&sbi->s_freeclusters_counter)); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index e061e66..108b667 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -585,6 +585,8 @@ void __ext4_abort(struct super_block *sb, const char *function, if (EXT4_SB(sb)->s_journal) jbd2_journal_abort(EXT4_SB(sb)->s_journal, -EIO); save_error_info(sb, function, line); + fs_event_notify(sb, FS_ERR_REMOUNT_RO); + } if (test_opt(sb, ERRORS_PANIC)) panic("EXT4-fs panic from previous error\n"); @@ -1083,6 +1085,12 @@ static const struct quotactl_ops ext4_qctl_operations = { }; #endif +static void ext4_trace_query(struct super_block *sb, u64 *ncount); + +static const struct fs_trace_operations ext4_trace_ops = { + .query = ext4_trace_query, +}; + static const struct super_operations ext4_sops = { .alloc_inode = ext4_alloc_inode, .destroy_inode = ext4_destroy_inode, @@ -3398,11 +3406,20 @@ static int ext4_reserve_clusters(struct ext4_sb_info *sbi, ext4_fsblk_t count) { ext4_fsblk_t clusters = ext4_blocks_count(sbi->s_es) >> sbi->s_cluster_bits; + ext4_fsblk_t current_resv; if (count >= clusters) return -EINVAL; + current_resv = atomic64_read(&sbi->s_resv_clusters); atomic64_set(&sbi->s_resv_clusters, count); + + if (count > current_resv) + fs_event_alloc_space(sbi->s_sb, + EXT4_C2B(sbi, count - current_resv)); + else + fs_event_free_space(sbi->s_sb, + EXT4_C2B(sbi, current_resv - count)); return 0; } @@ -3966,6 +3983,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) sb->s_qcop = &ext4_qctl_operations; sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP; #endif + sb->s_etrace.ops = &ext4_trace_ops; + sb->s_etrace.events_cap_mask = FS_EVENTS_ALL; + memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid)); INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */ @@ -5438,6 +5458,25 @@ out: #endif +static void ext4_trace_query(struct super_block *sb, u64 *ncount) +{ + struct ext4_sb_info *sbi = EXT4_SB(sb); + struct ext4_super_block *es = sbi->s_es; + ext4_fsblk_t rsv_blocks; + ext4_fsblk_t nblocks; + + nblocks = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) - + percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter); + nblocks = EXT4_C2B(sbi, nblocks); + rsv_blocks = ext4_r_blocks_count(es) + + EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters)); + if (nblocks < rsv_blocks) + nblocks = 0; + else + nblocks -= rsv_blocks; + *ncount = nblocks; +} + static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { -- 1.7.9.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com [209.85.212.177]) by kanga.kvack.org (Postfix) with ESMTP id C92536B0038 for ; Tue, 16 Jun 2015 12:22:02 -0400 (EDT) Received: by wicnd19 with SMTP id nd19so3354528wic.1 for ; Tue, 16 Jun 2015 09:22:02 -0700 (PDT) Received: from ZenIV.linux.org.uk (zeniv.linux.org.uk. [2002:c35c:fd02::1]) by mx.google.com with ESMTPS id et1si24879772wib.116.2015.06.16.09.22.00 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 16 Jun 2015 09:22:01 -0700 (PDT) Date: Tue, 16 Jun 2015 17:21:47 +0100 From: Al Viro Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Message-ID: <20150616162147.GA17109@ZenIV.linux.org.uk> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1434460173-18427-2-git-send-email-b.michalska@samsung.com> Sender: owner-linux-mm@kvack.org List-ID: To: Beata Michalska Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: > Introduce configurable generic interface for file > system-wide event notifications, to provide file > systems with a common way of reporting any potential > issues as they emerge. > > The notifications are to be issued through generic > netlink interface by newly introduced multicast group. > > Threshold notifications have been included, allowing > triggering an event whenever the amount of free space drops > below a certain level - or levels to be more precise as two > of them are being supported: the lower and the upper range. > The notifications work both ways: once the threshold level > has been reached, an event shall be generated whenever > the number of available blocks goes up again re-activating > the threshold. > > The interface has been exposed through a vfs. Once mounted, > it serves as an entry point for the set-up where one can > register for particular file system events. Hmm... 1) what happens if two processes write to that file at the same time, trying to create an entry for the same fs? WARN_ON() and fail for one of them if they race? 2) what happens if fs is mounted more than once (e.g. in different namespaces, or bound at different mountpoints, or just plain mounted several times in different places) and we add an event for each? More specifically, what should happen when one of those gets unmounted? 3) what's the meaning of ->active? Is that "fs_drop_trace_entry() hadn't been called yet" flag? Unless I'm misreading it, we can very well get explicit removal race with umount, resulting in cleanup_mnt() returning from fs_event_mount_dropped() before the first process (i.e. write asking to remove that entry) gets around to its deactivate_super(), ending up with umount(2) on a filesystem that isn't mounted anywhere else reporting success to userland before the actual fs shutdown, which is not a nice thing to do... 4) test in fs_event_mount_dropped() looks very odd - by that point we are absolutely guaranteed to have ->mnt_ns == NULL. What's that supposed to do? Al, trying to figure out the lifetime rules in all of that... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f50.google.com (mail-wg0-f50.google.com [74.125.82.50]) by kanga.kvack.org (Postfix) with ESMTP id 295746B0032 for ; Wed, 17 Jun 2015 02:09:05 -0400 (EDT) Received: by wgbhy7 with SMTP id hy7so27725515wgb.2 for ; Tue, 16 Jun 2015 23:09:04 -0700 (PDT) Received: from mail-wg0-f46.google.com (mail-wg0-f46.google.com. [74.125.82.46]) by mx.google.com with ESMTPS id m2si7111160wib.0.2015.06.16.23.09.02 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 16 Jun 2015 23:09:02 -0700 (PDT) Received: by wgbhy7 with SMTP id hy7so27724954wgb.2 for ; Tue, 16 Jun 2015 23:09:02 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1434460173-18427-5-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-5-git-send-email-b.michalska@samsung.com> From: Leon Romanovsky Date: Wed, 17 Jun 2015 09:08:41 +0300 Message-ID: Subject: Re: [RFC v3 4/4] shmem: Add support for generic FS events Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Beata Michalska Cc: "linux-kernel@vger.kernel.org" , Linux-FSDevel , linux-api@vger.kernel.org, Greg Kroah , jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, Hugh Dickins , lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, Linux-MM , kyungmin.park@samsung.com, kmpark@infradead.org > } > - if (error == -ENOSPC && !once++) { > + if (error == -ENOSPC) { > + if (!once++) { > info = SHMEM_I(inode); > spin_lock(&info->lock); > shmem_recalc_inode(inode); > spin_unlock(&info->lock); > goto repeat; > + } else { > + fs_event_notify(inode->i_sb, FS_WARN_ENOSPC); > + } > } Very minor remark, please fix indentation. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f42.google.com (mail-pa0-f42.google.com [209.85.220.42]) by kanga.kvack.org (Postfix) with ESMTP id F2E4B6B007B for ; Wed, 17 Jun 2015 05:23:48 -0400 (EDT) Received: by pacgb13 with SMTP id gb13so31564354pac.1 for ; Wed, 17 Jun 2015 02:23:48 -0700 (PDT) Received: from mailout2.w1.samsung.com (mailout2.w1.samsung.com. [210.118.77.12]) by mx.google.com with ESMTPS id ha1si5330561pbd.249.2015.06.17.02.23.47 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Wed, 17 Jun 2015 02:23:48 -0700 (PDT) Received: from eucpsbgm1.samsung.com (unknown [203.254.199.244]) by mailout2.w1.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTP id <0NQ300I870RJ7R00@mailout2.w1.samsung.com> for linux-mm@kvack.org; Wed, 17 Jun 2015 10:23:43 +0100 (BST) Message-id: <55813C9C.1010608@samsung.com> Date: Wed, 17 Jun 2015 11:23:40 +0200 From: Beata Michalska MIME-version: 1.0 Subject: Re: [RFC v3 4/4] shmem: Add support for generic FS events References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-5-git-send-email-b.michalska@samsung.com> In-reply-to: Content-type: text/plain; charset=UTF-8 Content-transfer-encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Leon Romanovsky Cc: "linux-kernel@vger.kernel.org" , Linux-FSDevel , linux-api@vger.kernel.org, Greg Kroah , jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, Hugh Dickins , lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, Linux-MM , kyungmin.park@samsung.com, kmpark@infradead.org On 06/17/2015 08:08 AM, Leon Romanovsky wrote: >> } >> - if (error == -ENOSPC && !once++) { >> + if (error == -ENOSPC) { >> + if (!once++) { >> info = SHMEM_I(inode); >> spin_lock(&info->lock); >> shmem_recalc_inode(inode); >> spin_unlock(&info->lock); >> goto repeat; >> + } else { >> + fs_event_notify(inode->i_sb, FS_WARN_ENOSPC); >> + } >> } > > Very minor remark, please fix indentation. > I will, thank You. BR Beata -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f175.google.com (mail-pd0-f175.google.com [209.85.192.175]) by kanga.kvack.org (Postfix) with ESMTP id 4045D6B0074 for ; Thu, 18 Jun 2015 04:25:31 -0400 (EDT) Received: by pdjm12 with SMTP id m12so61465422pdj.3 for ; Thu, 18 Jun 2015 01:25:31 -0700 (PDT) Received: from mailout1.w1.samsung.com (mailout1.w1.samsung.com. [210.118.77.11]) by mx.google.com with ESMTPS id mr5si10305239pbb.204.2015.06.18.01.25.29 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Thu, 18 Jun 2015 01:25:30 -0700 (PDT) Received: from eucpsbgm1.samsung.com (unknown [203.254.199.244]) by mailout1.w1.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTP id <0NQ400KVOSQD4K60@mailout1.w1.samsung.com> for linux-mm@kvack.org; Thu, 18 Jun 2015 09:25:25 +0100 (BST) Message-id: <55828064.5040301@samsung.com> Date: Thu, 18 Jun 2015 10:25:08 +0200 From: Beata Michalska MIME-version: 1.0 Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150617230605.GK10224@dastard> In-reply-to: <20150617230605.GK10224@dastard> Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Hi, On 06/18/2015 01:06 AM, Dave Chinner wrote: > On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: >> Introduce configurable generic interface for file >> system-wide event notifications, to provide file >> systems with a common way of reporting any potential >> issues as they emerge. >> >> The notifications are to be issued through generic >> netlink interface by newly introduced multicast group. >> >> Threshold notifications have been included, allowing >> triggering an event whenever the amount of free space drops >> below a certain level - or levels to be more precise as two >> of them are being supported: the lower and the upper range. >> The notifications work both ways: once the threshold level >> has been reached, an event shall be generated whenever >> the number of available blocks goes up again re-activating >> the threshold. >> >> The interface has been exposed through a vfs. Once mounted, >> it serves as an entry point for the set-up where one can >> register for particular file system events. >> >> Signed-off-by: Beata Michalska > > This has massive scalability problems: > >> + 4.3 Threshold notifications: >> + >> + #include >> + void fs_event_alloc_space(struct super_block *sb, u64 ncount); >> + void fs_event_free_space(struct super_block *sb, u64 ncount); >> + >> + Each filesystme supporting the threshold notifications should call >> + fs_event_alloc_space/fs_event_free_space respectively whenever the >> + amount of available blocks changes. >> + - sb: the filesystem's super block >> + - ncount: number of blocks being acquired/released > > ... here. > >> + Note that to properly handle the threshold notifications the fs events >> + interface needs to be kept up to date by the filesystems. Each should >> + register fs_trace_operations to enable querying the current number of >> + available blocks. > > Have you noticed that the filesystems have percpu counters for > tracking global space usage? There's good reason for that - taking a > spinlock in such a hot accounting path causes severe contention. > >> +static void fs_event_send(struct fs_trace_entry *en, unsigned int event_id) >> +{ >> + size_t size = nla_total_size(sizeof(u32)) * 2 + >> + nla_total_size(sizeof(u64)); >> + >> + fs_netlink_send_event(size, event_id, create_common_msg, en); >> +} >> + >> +static void fs_event_send_thresh(struct fs_trace_entry *en, >> + unsigned int event_id) >> +{ >> + size_t size = nla_total_size(sizeof(u32)) * 2 + >> + nla_total_size(sizeof(u64)) * 2; >> + >> + fs_netlink_send_event(size, event_id, create_thresh_msg, en); >> +} >> + >> +void fs_event_notify(struct super_block *sb, unsigned int event_id) >> +{ >> + struct fs_trace_entry *en; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return; >> + >> + spin_lock(&en->lock); >> + if (atomic_read(&en->active) && (en->notify & FS_EVENT_GENERIC)) >> + fs_event_send(en, event_id); >> + spin_unlock(&en->lock); >> + fs_trace_entry_put(en); >> +} >> +EXPORT_SYMBOL(fs_event_notify); >> + >> +void fs_event_alloc_space(struct super_block *sb, u64 ncount) >> +{ >> + struct fs_trace_entry *en; >> + s64 count; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return; > > Adds an atomic write to get the trace entry, > >> + spin_lock(&en->lock); > > a spin lock to lock the entry, > > >> + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) >> + goto leave; >> + /* >> + * we shouldn't drop below 0 here, >> + * unless there is a sync issue somewhere (?) >> + */ >> + count = en->th.avail_space - ncount; >> + en->th.avail_space = count < 0 ? 0 : count; >> + >> + if (en->th.avail_space > en->th.lrange) >> + /* Not 'even' close - leave */ >> + goto leave; >> + >> + if (en->th.avail_space > en->th.urange) { >> + /* Close enough - the lower range has been reached */ >> + if (!(en->th.state & THRESH_LR_BEYOND)) { >> + /* Send notification */ >> + fs_event_send_thresh(en, FS_THR_LRBELOW); >> + en->th.state &= ~THRESH_LR_BELOW; >> + en->th.state |= THRESH_LR_BEYOND; >> + } >> + goto leave; > > Then puts the entire netlink send path inside this spinlock, which > includes memory allocation and all sorts of non-filesystem code > paths. And it may be inside critical filesystem locks as well.... > > Apart from the serialisation problem of the locking, adding > memory allocation and the network send path to filesystem code > that is effectively considered "innermost" filesystem code is going > to have all sorts of problems for various filesystems. In the XFS > case, we simply cannot execute this sort of function in the places > where we update global space accounting. > > As it is, I think the basic concept of separate tracking of free > space if fundamentally flawed. What I think needs to be done is that > filesystems need access to the thresholds for events, and then the > filesystems call fs_event_send_thresh() themselves from appropriate > contexts (ie. without compromising locking, scalability, memory > allocation recursion constraints, etc). > > e.g. instead of tracking every change in free space, a filesystem > might execute this once every few seconds from a workqueue: > > event = fs_event_need_space_warning(sb, ) > if (event) > fs_event_send_thresh(sb, event); > > User still gets warnings about space usage, but there's no runtime > overhead or problems with lock/memory allocation contexts, etc. > > Cheers, > > Dave. > Having fs to keep a firm hand on thresholds limits would indeed be far more sane approach though that would require each fs to add support for that and handle most of it on their own. Avoiding this was the main rationale behind this rfc. If fs people agree to that, I'll be more than willing to drop this in favour of the per-fs tracking solution. Personally, I hope they will. Best Regards Beata -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pd0-f179.google.com (mail-pd0-f179.google.com [209.85.192.179]) by kanga.kvack.org (Postfix) with ESMTP id E1A846B0096 for ; Fri, 19 Jun 2015 13:28:18 -0400 (EDT) Received: by pdbki1 with SMTP id ki1so94928243pdb.1 for ; Fri, 19 Jun 2015 10:28:18 -0700 (PDT) Received: from mailout4.w1.samsung.com (mailout4.w1.samsung.com. [210.118.77.14]) by mx.google.com with ESMTPS id u4si4255444pdh.9.2015.06.19.10.28.17 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Fri, 19 Jun 2015 10:28:17 -0700 (PDT) Received: from eucpsbgm1.samsung.com (unknown [203.254.199.244]) by mailout4.w1.samsung.com (Oracle Communications Messaging Server 7.0.5.31.0 64bit (built May 5 2014)) with ESMTP id <0NQ700294CJ17Z20@mailout4.w1.samsung.com> for linux-mm@kvack.org; Fri, 19 Jun 2015 18:28:13 +0100 (BST) Message-id: <5584512B.5020301@samsung.com> Date: Fri, 19 Jun 2015 19:28:11 +0200 From: Beata Michalska MIME-version: 1.0 Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150617230605.GK10224@dastard> <55828064.5040301@samsung.com> <20150619000341.GM10224@dastard> In-reply-to: <20150619000341.GM10224@dastard> Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Dave Chinner Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org On 06/19/2015 02:03 AM, Dave Chinner wrote: > On Thu, Jun 18, 2015 at 10:25:08AM +0200, Beata Michalska wrote: >> On 06/18/2015 01:06 AM, Dave Chinner wrote: >>> On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: >>>> Introduce configurable generic interface for file >>>> system-wide event notifications, to provide file >>>> systems with a common way of reporting any potential >>>> issues as they emerge. >>>> >>>> The notifications are to be issued through generic >>>> netlink interface by newly introduced multicast group. >>>> >>>> Threshold notifications have been included, allowing >>>> triggering an event whenever the amount of free space drops >>>> below a certain level - or levels to be more precise as two >>>> of them are being supported: the lower and the upper range. >>>> The notifications work both ways: once the threshold level >>>> has been reached, an event shall be generated whenever >>>> the number of available blocks goes up again re-activating >>>> the threshold. >>>> >>>> The interface has been exposed through a vfs. Once mounted, >>>> it serves as an entry point for the set-up where one can >>>> register for particular file system events. >>>> >>>> Signed-off-by: Beata Michalska >>> >>> This has massive scalability problems: > .... >>> Have you noticed that the filesystems have percpu counters for >>> tracking global space usage? There's good reason for that - taking a >>> spinlock in such a hot accounting path causes severe contention. > .... >>> Then puts the entire netlink send path inside this spinlock, which >>> includes memory allocation and all sorts of non-filesystem code >>> paths. And it may be inside critical filesystem locks as well.... >>> >>> Apart from the serialisation problem of the locking, adding >>> memory allocation and the network send path to filesystem code >>> that is effectively considered "innermost" filesystem code is going >>> to have all sorts of problems for various filesystems. In the XFS >>> case, we simply cannot execute this sort of function in the places >>> where we update global space accounting. >>> >>> As it is, I think the basic concept of separate tracking of free >>> space if fundamentally flawed. What I think needs to be done is that >>> filesystems need access to the thresholds for events, and then the >>> filesystems call fs_event_send_thresh() themselves from appropriate >>> contexts (ie. without compromising locking, scalability, memory >>> allocation recursion constraints, etc). >>> >>> e.g. instead of tracking every change in free space, a filesystem >>> might execute this once every few seconds from a workqueue: >>> >>> event = fs_event_need_space_warning(sb, ) >>> if (event) >>> fs_event_send_thresh(sb, event); >>> >>> User still gets warnings about space usage, but there's no runtime >>> overhead or problems with lock/memory allocation contexts, etc. >> >> Having fs to keep a firm hand on thresholds limits would indeed be >> far more sane approach though that would require each fs to >> add support for that and handle most of it on their own. Avoiding >>> this was the main rationale behind this rfc. >> If fs people agree to that, I'll be more than willing to drop this >> in favour of the per-fs tracking solution. >> Personally, I hope they will. > > I was hoping that you'd think a little more about my suggestion and > work out how to do background threshold event detection generically. > I kind of left it as "an exercise for the reader" because it seems > obvious to me. > > Hint: ->statfs allows you to get the total, free and used space > from filesystems in a generic manner. > > Cheers, > > Dave. > I haven't given up on that, so yes, I'm still working on a more suitable generic solution. Background detection is one of the options, though it needs some more thoughts. Giving up the sync approach means less accuracy for the threshold notifications, but I guess this could be fine-tuned to get an acceptable level. Another bump: how this tuning is supposed to be done (additional config option maybe)? The interface would have to keep it somehow sane - but what would 'sane' mean in this case (?) Also, I'm not sure whether single approach would server here well for all the potentially supported file systems so this would have to be properly adjusted (taking the threshold levels into consideration as well). And still,it would require some form of synchronization with tracked fs so that this 'detection' is not being unnecessarily performed (i.e. while fs remains frozen). There is also an idea of using an interface resembling the stackable fs: a transparent file system layered on top of the tracked one (solely for the tracking purposes). This would simplify handling the trace object's lifetime - no more list of registered traces. It would also give a way of tracking (to some extent) the changes in the amount of available space, which combined with tweaked background check could give a solution with less performance overhead than the original one. I'll try this one and see how it goes. Thank You for your feedback so far - I really appreciate it. Best Regards Beata -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f176.google.com (mail-lb0-f176.google.com [209.85.217.176]) by kanga.kvack.org (Postfix) with ESMTP id 136F26B006E for ; Wed, 24 Jun 2015 04:50:12 -0400 (EDT) Received: by lbbpo10 with SMTP id po10so22059319lbb.3 for ; Wed, 24 Jun 2015 01:50:11 -0700 (PDT) Received: from mail-lb0-x22f.google.com (mail-lb0-x22f.google.com. [2a00:1450:4010:c04::22f]) by mx.google.com with ESMTPS id o5si21445551lao.159.2015.06.24.01.50.08 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 24 Jun 2015 01:50:08 -0700 (PDT) Received: by lbbpo10 with SMTP id po10so22058595lbb.3 for ; Wed, 24 Jun 2015 01:50:08 -0700 (PDT) From: Dmitry Monakhov Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications In-Reply-To: <1434460173-18427-2-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> Date: Wed, 24 Jun 2015 11:47:18 +0300 Message-ID: <87oak5ebmx.fsf@openvz.org> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha512; protocol="application/pgp-signature" Sender: owner-linux-mm@kvack.org List-ID: To: Beata Michalska , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Cc: greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org --=-=-= Content-Type: text/plain Content-Transfer-Encoding: quoted-printable Beata Michalska writes: > Introduce configurable generic interface for file > system-wide event notifications, to provide file > systems with a common way of reporting any potential > issues as they emerge. > > The notifications are to be issued through generic > netlink interface by newly introduced multicast group. > > Threshold notifications have been included, allowing > triggering an event whenever the amount of free space drops > below a certain level - or levels to be more precise as two > of them are being supported: the lower and the upper range. > The notifications work both ways: once the threshold level > has been reached, an event shall be generated whenever > the number of available blocks goes up again re-activating > the threshold. > > The interface has been exposed through a vfs. Once mounted, > it serves as an entry point for the set-up where one can > register for particular file system events. > > Signed-off-by: Beata Michalska > --- > Documentation/filesystems/events.txt | 232 ++++++++++ > fs/Kconfig | 2 + > fs/Makefile | 1 + > fs/events/Kconfig | 7 + > fs/events/Makefile | 5 + > fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++= ++++++ > fs/events/fs_event.h | 22 + > fs/events/fs_event_netlink.c | 104 +++++ > fs/namespace.c | 1 + > include/linux/fs.h | 6 +- > include/linux/fs_event.h | 72 +++ > include/uapi/linux/Kbuild | 1 + > include/uapi/linux/fs_event.h | 58 +++ > 13 files changed, 1319 insertions(+), 1 deletion(-) > create mode 100644 Documentation/filesystems/events.txt > create mode 100644 fs/events/Kconfig > create mode 100644 fs/events/Makefile > create mode 100644 fs/events/fs_event.c > create mode 100644 fs/events/fs_event.h > create mode 100644 fs/events/fs_event_netlink.c > create mode 100644 include/linux/fs_event.h > create mode 100644 include/uapi/linux/fs_event.h > > diff --git a/Documentation/filesystems/events.txt b/Documentation/filesys= tems/events.txt > new file mode 100644 > index 0000000..c2e6227 > --- /dev/null > +++ b/Documentation/filesystems/events.txt > @@ -0,0 +1,232 @@ > + > + Generic file system event notification interface > + > +Document created 23 April 2015 by Beata Michalska > + > +1. The reason behind: > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +There are many corner cases when things might get messy with the filesys= tems. > +And it is not always obvious what and when went wrong. Sometimes you mig= ht > +get some subtle hints that there is something going on - but by the time > +you realise it, it might be too late as you are already out-of-space > +or the filesystem has been remounted as read-only (i.e.). The generic > +interface for the filesystem events fills the gap by providing a rather > +easy way of real-time notifications triggered whenever something interes= ting > +happens, allowing filesystems to report events in a common way, as they = occur. > + > +2. How does it work: > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +The interface itself has been exposed as fstrace-type Virtual File Syste= m, > +primarily to ease the process of setting up the configuration for the > +notifications. So for starters, it needs to get mounted (obviously): > + > + mount -t fstrace none /sys/fs/events > + > +This will unveil the single fstrace filesystem entry - the 'config' file, > +through which the notification are being set-up. > + > +Activating notifications for particular filesystem is as straightforward > +as writing into the 'config' file. Note that by default all events, desp= ite > +the actual filesystem type, are being disregarded. > + > +Synopsis of config: > +------------------ > + > + MOUNT EVENT_TYPE [L1] [L2] > + > + MOUNT : the filesystem's mount point > + EVENT_TYPE : event types - currently two of them are being supported: > + > + * generic events ("G") covering most common warnings > + and errors that might be reported by any filesystem; > + this option does not take any arguments; > + > + * threshold notifications ("T") - events sent whenever > + the amount of available space drops below certain level; > + it is possible to specify two threshold levels though > + only one is required to properly setup the notifications; > + as those refer to the number of available blocks, the lower > + level [L1] needs to be higher than the upper one [L2] > + > +Sample request could look like the following: > + > + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config > + > +Multiple request might be specified provided they are separated with sem= icolon. > + > +The configuration itself might be modified at any time. One can add/remo= ve > +particular event types for given fielsystem, modify the threshold levels, > +and remove single or all entries from the 'config' file. > + > + - Adding new event type: > + > + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config > + > +(Note that is is enough to provide the event type to be enabled without > +the already set ones.) > + > + - Removing event type: > + > + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config > + > + - Updating threshold limits: > + > + $ echo MOUNT T L1 L2 > /sys/fs/events/config > + > + - Removing single entry: > + > + $ echo '!MOUNT' > /sys/fs/events/config > + > + - Removing all entries: > + > + $ echo > /sys/fs/events/config > + > +Reading the file will list all registered entries with their current set= -up > +along with some additional info like the filesystem type and the backing= device > +name if available. > + > +Final, though a very important note on the configuration: when and if the > +actual events are being triggered falls way beyond the scope of the gene= ric > +filesystem events interface. It is up to a particular filesystem > +implementation which events are to be supported - if any at all. So if > +given filesystem does not support the event notifications, an attempt to > +enable those through 'config' file will fail. > + > + > +3. The generic netlink interface support: > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > +Whenever an event notification is triggered (by given filesystem) the cu= rrent > +configuration is being validated to decide whether a userpsace notificat= ion > +should be launched. If there has been no request (in a mean of 'config' = file > +entry) for given event, one will be silently disregarded. If, on the oth= er > +hand, someone is 'watching' given filesystem for specific events, a gene= ric > +netlink message will be sent. A dedicated multicast group has been provi= ded > +solely for this purpose so in order to receive such notifications, one s= hould > +subscribe to this new multicast group. As for now only the init network > +namespace is being supported. > + > +3.1 Message format > + > +The FS_NL_C_EVENT shall be stored within the generic netlink message hea= der > +as the command field. The message payload will provide more detailed inf= o: > +the backing device major and minor numbers, the event code and the id of > +the process which action led to the event occurrence. In case of thresho= ld > +notifications, the current number of available blocks will be included > +in the payload as well. > + > + > + 0 1 2 3 > + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | NETLINK MESSAGE HEADER | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | GENERIC NETLINK MESSAGE HEADER | > + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | Optional user specific message header | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | GENERIC MESSAGE PAYLOAD: | > + +---------------------------------------------------------------+ > + | FS_NL_A_EVENT_ID (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DEV_MAJOR (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DEV_MINOR (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_CAUSED_ID (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DATA (NLA_U64) | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + > + > +The above figure is based on: > + http://www.linuxfoundation.org/collaborate/workgroups/networking/generi= c_netlink_howto#Message_Format > + > + > +4. API Reference: > +=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > + > + 4.1 Generic file system event interface data & operations > + > + #include > + > + struct fs_trace_info { > + void __rcu *e_priv /* READ ONLY */ > + unsigned int events_cap_mask; /* Supported notifications */ > + const struct fs_trace_operations *ops; > + }; > + > + struct fs_trace_operations { > + void (*query)(struct super_block *, u64 *); > + }; > + > + In order to get the fireworks and stuff, each filesystem needs to setup > + the events_cap_mask field of the fs_trace_info structure, which has been > + embedded within the super_block structure. This should reflect the type= of > + events the filesystem wants to support. In case of threshold notificati= ons, > + apart from setting the FS_EVENT_THRESH flag, the 'query' callback should > + be provided as this enables the events interface to get the up-to-date > + state of the number of available blocks whenever those notifications are > + being requested. > + > + The 'e_priv' field of the fs_trace_info structure should be completely = ignored > + as it's for INTERNAL USE ONLY. So don't even think of messing with it, = if you > + do not want to get yourself into some real trouble. If still, you are t= empted > + to do so - feel free, it's gonna be pure fun. Consider yourself warned. > + > + > + 4.2 Event notification: > + > + #include > + void fs_event_notify(struct super_block *sb, unsigned int event_id); > + > + Notify the generic FS event interface of an occurring event. > + This shall be used by any file system that wishes to inform any potenti= al > + listeners/watchers of a particular event. > + - sb: the filesystem's super block > + - event_id: an event identifier > + > + 4.3 Threshold notifications: > + > + #include > + void fs_event_alloc_space(struct super_block *sb, u64 ncount); > + void fs_event_free_space(struct super_block *sb, u64 ncount); > + > + Each filesystme supporting the threshold notifications should call > + fs_event_alloc_space/fs_event_free_space respectively whenever the > + amount of available blocks changes. > + - sb: the filesystem's super block > + - ncount: number of blocks being acquired/released > + > + Note that to properly handle the threshold notifications the fs events > + interface needs to be kept up to date by the filesystems. Each should > + register fs_trace_operations to enable querying the current number of > + available blocks. > + > + 4.4 Sending message through generic netlink interface > + > + #include > + > + int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), void *cbdata); > + > + Although the fs event interface is fully responsible for sending the me= ssages > + over the netlink, filesystems might use the FS_EVENT multicast group to= send > + their own custom messages. > + - size: the size of the message payload > + - event_id: the event identifier > + - compose_msg: a callback responsible for filling-in the message payload > + - cbdata: message custom data > + > + Calling fs_netlink_send_event will result in a message being sent by > + the FS_EVENT multicast group. Note that the body of the message should = be > + prepared (set-up )by the caller - through compose_msg callback. The mes= sage's > + sk_buff will be allocated on behalf of the caller (thus the size parame= ter). > + The compose_msg should only fill the payload with proper data. Unless > + the event id is specified as FS_EVENT_NONE, it's value shall be added > + to the payload prior to calling the compose_msg. > + > + > diff --git a/fs/Kconfig b/fs/Kconfig > index ec35851..a89e678 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -69,6 +69,8 @@ config FILE_LOCKING > for filesystems like NFS and for the flock() system > call. Disabling this option saves about 11k. >=20=20 > +source "fs/events/Kconfig" > + > source "fs/notify/Kconfig" >=20=20 > source "fs/quota/Kconfig" > diff --git a/fs/Makefile b/fs/Makefile > index a88ac48..bcb3048 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -126,3 +126,4 @@ obj-y +=3D exofs/ # Multiple modules > obj-$(CONFIG_CEPH_FS) +=3D ceph/ > obj-$(CONFIG_PSTORE) +=3D pstore/ > obj-$(CONFIG_EFIVAR_FS) +=3D efivarfs/ > +obj-$(CONFIG_FS_EVENTS) +=3D events/ > diff --git a/fs/events/Kconfig b/fs/events/Kconfig > new file mode 100644 > index 0000000..1c60195 > --- /dev/null > +++ b/fs/events/Kconfig > @@ -0,0 +1,7 @@ > +# Generic Files System events interface > +config FS_EVENTS > + bool "Generic filesystem events" > + select NET > + default y > + help > + Enable generic filesystem events interface > diff --git a/fs/events/Makefile b/fs/events/Makefile > new file mode 100644 > index 0000000..9c98337 > --- /dev/null > +++ b/fs/events/Makefile > @@ -0,0 +1,5 @@ > +# > +# Makefile for the Linux Generic File System Event Interface > +# > + > +obj-y :=3D fs_event.o fs_event_netlink.o > diff --git a/fs/events/fs_event.c b/fs/events/fs_event.c > new file mode 100644 > index 0000000..1037311 > --- /dev/null > +++ b/fs/events/fs_event.c > @@ -0,0 +1,809 @@ > +/* > + * Generic File System Evens Interface > + * > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify = it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution = in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but W= ITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License = for > + * more details. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "../pnode.h" > +#include "fs_event.h" > + > +static LIST_HEAD(fs_trace_list); > +static DEFINE_MUTEX(fs_trace_lock); > + > +static struct kmem_cache *fs_trace_cachep __read_mostly; > + > +static atomic_t stray_traces =3D ATOMIC_INIT(0); > +static DECLARE_WAIT_QUEUE_HEAD(trace_wq); > +/* > + * Threshold notification state bits. > + * Note the reverse as this refers to the number > + * of available blocks. > + */ > +#define THRESH_LR_BELOW 0x0001 /* Falling below the lower range */ > +#define THRESH_LR_BEYOND 0x0002 > +#define THRESH_UR_BELOW 0x0004 > +#define THRESH_UR_BEYOND 0x0008 /* Going beyond the upper range */ > + > +#define THRESH_LR_ON (THRESH_LR_BELOW | THRESH_LR_BEYOND) > +#define THRESH_UR_ON (THRESH_UR_BELOW | THRESH_UR_BEYOND) > + > +#define FS_TRACE_ADD 0x100000 > + > +struct fs_trace_entry { > + struct kref count; > + atomic_t active; > + struct super_block *sb; > + unsigned int notify; > + struct path mnt_path; > + struct list_head node; > + > + struct fs_event_thresh { > + u64 avail_space; > + u64 lrange; > + u64 urange; > + unsigned int state; > + } th; > + struct rcu_head rcu_head; > + spinlock_t lock; > +}; > + > +static const match_table_t fs_etypes =3D { > + { FS_EVENT_GENERIC, "G" }, > + { FS_EVENT_THRESH, "T" }, > + { 0, NULL }, > +}; > + > +static inline int fs_trace_query_data(struct super_block *sb, > + struct fs_trace_entry *en) > +{ > + if (sb->s_etrace.ops && sb->s_etrace.ops->query) { > + sb->s_etrace.ops->query(sb, &en->th.avail_space); > + return 0; > + } > + > + return -EINVAL; > +} > + > +static inline void fs_trace_entry_free(struct fs_trace_entry *en) > +{ > + kmem_cache_free(fs_trace_cachep, en); > +} > + > +static void fs_destroy_trace_entry(struct kref *en_ref) > +{ > + struct fs_trace_entry *en =3D container_of(en_ref, > + struct fs_trace_entry, count); > + > + /* Last reference has been dropped */ > + fs_trace_entry_free(en); > + atomic_dec(&stray_traces); > +} > + > +static void fs_trace_entry_put(struct fs_trace_entry *en) > +{ > + kref_put(&en->count, fs_destroy_trace_entry); > +} > + > +static void fs_release_trace_entry(struct rcu_head *rcu_head) > +{ > + struct fs_trace_entry *en =3D container_of(rcu_head, > + struct fs_trace_entry, > + rcu_head); > + /* > + * As opposed to typical reference drop, this one is being > + * called from the rcu callback. This is to make sure all > + * readers have managed to safely grab the reference before > + * the change to rcu pointer is visible to all and before > + * the reference is dropped here. > + */ > + fs_trace_entry_put(en); > +} > + > +static void fs_drop_trace_entry(struct fs_trace_entry *en) > +{ > + struct super_block *sb; > + > + lockdep_assert_held(&fs_trace_lock); > + /* > + * The trace entry might have already been removed > + * from the list of active traces with the proper > + * ref drop, though it was still in use handling > + * one of the fs events. This means that the object > + * has been already scheduled for being released. > + * So leave... > + */ > + > + if (!atomic_add_unless(&en->active, -1, 0)) > + return; > + /* > + * At this point the trace entry is being marked as inactive > + * so no new references will be allowed. > + * Still it might be floating around somewhere > + * so drop the reference when the rcu readers are done. > + */ > + spin_lock(&en->lock); > + list_del(&en->node); > + sb =3D en->sb; > + en->sb =3D NULL; > + spin_unlock(&en->lock); > + > + rcu_assign_pointer(sb->s_etrace.e_priv, NULL); > + call_rcu(&en->rcu_head, fs_release_trace_entry); > + /* It's safe now to drop the reference to the super */ > + deactivate_super(sb); > + atomic_inc(&stray_traces); > +} > + > +static inline > +struct fs_trace_entry *fs_trace_entry_get(struct fs_trace_entry *en) > +{ > + if (en) { > + if (!kref_get_unless_zero(&en->count)) > + return NULL; > + /* Don't allow referencing inactive object */ > + if (!atomic_read(&en->active)) { > + fs_trace_entry_put(en); > + return NULL; > + } > + } > + return en; > +} > + > +static struct fs_trace_entry *fs_trace_entry_get_rcu(struct super_block = *sb) > +{ > + struct fs_trace_entry *en; > + > + if (!sb) > + return NULL; > + > + rcu_read_lock(); > + en =3D rcu_dereference(sb->s_etrace.e_priv); > + en =3D fs_trace_entry_get(en); > + rcu_read_unlock(); > + > + return en; > +} > + > +static int fs_remove_trace_entry(struct super_block *sb) > +{ > + struct fs_trace_entry *en; > + > + en =3D fs_trace_entry_get_rcu(sb); > + if (!en) > + return -EINVAL; > + > + mutex_lock(&fs_trace_lock); > + fs_drop_trace_entry(en); > + mutex_unlock(&fs_trace_lock); > + fs_trace_entry_put(en); > + return 0; > +} > + > +static void fs_remove_all_traces(void) > +{ > + struct fs_trace_entry *en, *guard; > + > + mutex_lock(&fs_trace_lock); > + list_for_each_entry_safe(en, guard, &fs_trace_list, node) > + fs_drop_trace_entry(en); > + mutex_unlock(&fs_trace_lock); > +} > + > +static int create_common_msg(struct sk_buff *skb, void *data) > +{ > + struct fs_trace_entry *en =3D (struct fs_trace_entry *)data; > + struct super_block *sb =3D en->sb; > + > + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) > + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) > + return -EINVAL; What about diskless(nfs,cifs,etc) filesystem? btrfs also has no valid sb->s_dev=20=20 > + > + if (nla_put_u64(skb, FS_NL_A_CAUSED_ID, pid_vnr(task_pid(current)))) > + return -EINVAL; > + > + return 0; > +} > + > +static int create_thresh_msg(struct sk_buff *skb, void *data) > +{ > + struct fs_trace_entry *en =3D (struct fs_trace_entry *)data; > + int ret; > + > + ret =3D create_common_msg(skb, data); > + if (!ret) > + ret =3D nla_put_u64(skb, FS_NL_A_DATA, en->th.avail_space); > + return ret; > +} > + > +static void fs_event_send(struct fs_trace_entry *en, unsigned int event_= id) > +{ > + size_t size =3D nla_total_size(sizeof(u32)) * 2 + > + nla_total_size(sizeof(u64)); > + > + fs_netlink_send_event(size, event_id, create_common_msg, en); > +} > + > +static void fs_event_send_thresh(struct fs_trace_entry *en, > + unsigned int event_id) > +{ > + size_t size =3D nla_total_size(sizeof(u32)) * 2 + > + nla_total_size(sizeof(u64)) * 2; > + > + fs_netlink_send_event(size, event_id, create_thresh_msg, en); > +} > + > +void fs_event_notify(struct super_block *sb, unsigned int event_id) > +{ > + struct fs_trace_entry *en; > + > + en =3D fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + if (atomic_read(&en->active) && (en->notify & FS_EVENT_GENERIC)) > + fs_event_send(en, event_id); > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_notify); > + > +void fs_event_alloc_space(struct super_block *sb, u64 ncount) > +{ > + struct fs_trace_entry *en; > + s64 count; > + > + en =3D fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + > + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) > + goto leave; > + /* > + * we shouldn't drop below 0 here, > + * unless there is a sync issue somewhere (?) > + */ > + count =3D en->th.avail_space - ncount; > + en->th.avail_space =3D count < 0 ? 0 : count; > + > + if (en->th.avail_space > en->th.lrange) > + /* Not 'even' close - leave */ > + goto leave; > + > + if (en->th.avail_space > en->th.urange) { > + /* Close enough - the lower range has been reached */ > + if (!(en->th.state & THRESH_LR_BEYOND)) { > + /* Send notification */ > + fs_event_send_thresh(en, FS_THR_LRBELOW); > + en->th.state &=3D ~THRESH_LR_BELOW; > + en->th.state |=3D THRESH_LR_BEYOND; > + } > + goto leave; > + } > + if (!(en->th.state & THRESH_UR_BEYOND)) { > + fs_event_send_thresh(en, FS_THR_URBELOW); > + en->th.state &=3D ~THRESH_UR_BELOW; > + en->th.state |=3D THRESH_UR_BEYOND; > + } > + > +leave: > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_alloc_space); > + > +void fs_event_free_space(struct super_block *sb, u64 ncount) > +{ > + struct fs_trace_entry *en; > + > + en =3D fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + > + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) > + goto leave; > + > + en->th.avail_space +=3D ncount; > + > + if (en->th.avail_space > en->th.lrange) { > + if (!(en->th.state & THRESH_LR_BELOW) > + && en->th.state & THRESH_LR_BEYOND) { > + /* Send notification */ > + fs_event_send_thresh(en, FS_THR_LRABOVE); > + en->th.state &=3D ~(THRESH_LR_BEYOND|THRESH_UR_BEYOND); > + en->th.state |=3D THRESH_LR_BELOW; > + goto leave; > + } > + } > + if (en->th.avail_space > en->th.urange) { > + if (!(en->th.state & THRESH_UR_BELOW) > + && en->th.state & THRESH_UR_BEYOND) { > + /* Notify */ > + fs_event_send_thresh(en, FS_THR_URABOVE); > + en->th.state &=3D ~THRESH_UR_BEYOND; > + en->th.state |=3D THRESH_UR_BELOW; > + } > + } > +leave: > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_free_space); > + > +void fs_event_mount_dropped(struct vfsmount *mnt) > +{ > + /* > + * The mount is dropped but the super might not get released > + * at once so there is very small chance some notifications > + * will come through. > + * Note that the mount being dropped here might belong to a different > + * namespace - if this is the case, just ignore it. > + */ > + struct fs_trace_entry *en =3D fs_trace_entry_get_rcu(mnt->mnt_sb); > + struct vfsmount *en_mnt; > + > + if (!en || !atomic_read(&en->active)) > + return; > + /* > + * The entry once set, does not change the mountpoint it's being > + * pinned to, so no need to take the lock here. > + */ > + en_mnt =3D en->mnt_path.mnt; > + if (!(real_mount(mnt)->mnt_ns !=3D (real_mount(en_mnt))->mnt_ns)) > + fs_remove_trace_entry(mnt->mnt_sb); > + fs_trace_entry_put(en); > +} > + > +static int fs_new_trace_entry(struct path *path, struct fs_event_thresh = *thresh, > + unsigned int nmask) > +{ > + struct fs_trace_entry *en; > + struct super_block *sb; > + struct mount *r_mnt; > + > + en =3D kmem_cache_zalloc(fs_trace_cachep, GFP_KERNEL); > + if (unlikely(!en)) > + return -ENOMEM; > + /* > + * Note that no reference is being taken here for the path as it would > + * make the unmount unnecessarily puzzling (due to an extra 'valid' > + * reference for the mnt). > + * This is *rather* safe as the notification on mount being dropped > + * will get called prior to releasing the super block - so right > + * in time to perform appropriate clean-up > + */ > + r_mnt =3D real_mount(path->mnt); > + > + en->mnt_path.dentry =3D r_mnt->mnt.mnt_root; > + en->mnt_path.mnt =3D &r_mnt->mnt; > + > + sb =3D path->mnt->mnt_sb; > + en->sb =3D sb; > + /* > + * Increase the refcount for sb to mark it's being relied on. > + * Note that the reference to path is taken by the caller, so it > + * is safe to assume there is at least single active reference > + * to super as well. > + */ > + atomic_inc(&sb->s_active); > + > + nmask &=3D sb->s_etrace.events_cap_mask; > + if (!nmask) > + goto leave; > + > + spin_lock_init(&en->lock); > + INIT_LIST_HEAD(&en->node); > + > + en->notify =3D nmask; > + memcpy(&en->th, thresh, offsetof(struct fs_event_thresh, state)); > + if (nmask & FS_EVENT_THRESH) > + fs_trace_query_data(sb, en); > + > + kref_init(&en->count); > + > + if (rcu_access_pointer(sb->s_etrace.e_priv) !=3D NULL) { > + struct fs_trace_entry *prev_en; > + > + prev_en =3D fs_trace_entry_get_rcu(sb); > + if (prev_en) { > + WARN_ON(prev_en); > + fs_trace_entry_put(prev_en); > + goto leave; > + } > + } > + atomic_set(&en->active, 1); > + > + mutex_lock(&fs_trace_lock); > + list_add(&en->node, &fs_trace_list); > + mutex_unlock(&fs_trace_lock); > + > + rcu_assign_pointer(sb->s_etrace.e_priv, en); > + synchronize_rcu(); > + > + return 0; > +leave: > + deactivate_super(sb); > + kmem_cache_free(fs_trace_cachep, en); > + return -EINVAL; > +} > + > +static int fs_update_trace_entry(struct path *path, > + struct fs_event_thresh *thresh, > + unsigned int nmask) > +{ > + struct fs_trace_entry *en; > + struct super_block *sb; > + int extend =3D nmask & FS_TRACE_ADD; > + int ret =3D -EINVAL; > + > + en =3D fs_trace_entry_get_rcu(path->mnt->mnt_sb); > + if (!en) > + return (extend) ? fs_new_trace_entry(path, thresh, nmask) > + : -EINVAL; > + > + if (!atomic_read(&en->active)) > + return -EINVAL; > + > + nmask &=3D ~FS_TRACE_ADD; > + > + spin_lock(&en->lock); > + sb =3D en->sb; > + if (!sb || !(nmask & sb->s_etrace.events_cap_mask)) > + goto leave; > + > + if (nmask & FS_EVENT_THRESH) { > + if (extend) { > + /* Get the current state */ > + if (!(en->notify & FS_EVENT_THRESH)) > + if (fs_trace_query_data(sb, en)) > + goto leave; > + > + if (thresh->state & THRESH_LR_ON) { > + en->th.lrange =3D thresh->lrange; > + en->th.state &=3D ~THRESH_LR_ON; > + } > + > + if (thresh->state & THRESH_UR_ON) { > + en->th.urange =3D thresh->urange; > + en->th.state &=3D ~THRESH_UR_ON; > + } > + } else { > + memset(&en->th, 0, sizeof(en->th)); > + } > + } > + > + if (extend) > + en->notify |=3D nmask; > + else > + en->notify &=3D ~nmask; > + ret =3D 0; > +leave: > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > + return ret; > +} > + > +static int fs_parse_trace_request(int argc, char **argv) > +{ > + struct fs_event_thresh thresh =3D {0}; > + struct path path; > + substring_t args[MAX_OPT_ARGS]; > + unsigned int nmask =3D FS_TRACE_ADD; > + int token; > + char *s; > + int ret =3D -EINVAL; > + > + if (!argc) { > + fs_remove_all_traces(); > + return 0; > + } > + > + s =3D *(argv); > + if (*s =3D=3D '!') { > + /* Clear the trace entry */ > + nmask &=3D ~FS_TRACE_ADD; > + ++s; > + } > + > + if (kern_path_mountpoint(AT_FDCWD, s, &path, LOOKUP_FOLLOW)) > + return -EINVAL; > + > + if (!(--argc)) { > + if (!(nmask & FS_TRACE_ADD)) > + ret =3D fs_remove_trace_entry(path.mnt->mnt_sb); > + goto leave; > + } > + > +repeat: > + args[0].to =3D args[0].from =3D NULL; > + token =3D match_token(*(++argv), fs_etypes, args); > + if (!token && !nmask) > + goto leave; > + > + nmask |=3D token & FS_EVENTS_ALL; > + --argc; > + if ((token & FS_EVENT_THRESH) && (nmask & FS_TRACE_ADD)) { > + /* > + * Get the threshold config data: > + * lower range > + * upper range > + */ > + if (!argc) > + goto leave; > + > + ret =3D kstrtoull(*(++argv), 10, &thresh.lrange); > + if (ret) > + goto leave; > + thresh.state |=3D THRESH_LR_ON; > + if ((--argc)) { > + ret =3D kstrtoull(*(++argv), 10, &thresh.urange); > + if (ret) > + goto leave; > + thresh.state |=3D THRESH_UR_ON; > + --argc; > + } > + /* The thresholds are based on number of available blocks */ > + if (thresh.lrange < thresh.urange) { > + ret =3D -EINVAL; > + goto leave; > + } > + } > + if (argc) > + goto repeat; > + > + ret =3D fs_update_trace_entry(&path, &thresh, nmask); > +leave: > + path_put(&path); > + return ret; > +} > + > +#define DEFAULT_BUF_SIZE PAGE_SIZE > + > +static ssize_t fs_trace_write(struct file *file, const char __user *buff= er, > + size_t count, loff_t *ppos) > +{ > + char **argv; > + char *kern_buf, *next, *cfg; > + size_t size, dcount =3D 0; > + int argc; > + > + if (!count) > + return 0; > + > + kern_buf =3D kmalloc(DEFAULT_BUF_SIZE, GFP_KERNEL); > + if (!kern_buf) > + return -ENOMEM; > + > + while (dcount < count) { > + > + size =3D count - dcount; > + if (size >=3D DEFAULT_BUF_SIZE) > + size =3D DEFAULT_BUF_SIZE - 1; > + if (copy_from_user(kern_buf, buffer + dcount, size)) { > + dcount =3D -EINVAL; > + goto leave; > + } > + > + kern_buf[size] =3D '\0'; > + > + next =3D cfg =3D kern_buf; > + > + do { > + next =3D strchr(cfg, ';'); > + if (next) > + *next =3D '\0'; > + > + argv =3D argv_split(GFP_KERNEL, cfg, &argc); > + if (!argv) { > + dcount =3D -ENOMEM; > + goto leave; > + } > + > + if (fs_parse_trace_request(argc, argv)) { > + dcount =3D -EINVAL; > + argv_free(argv); > + goto leave; > + } > + > + argv_free(argv); > + if (next) > + cfg =3D ++next; > + > + } while (next); > + dcount +=3D size; > + } > +leave: > + kfree(kern_buf); > + return dcount; > +} > + > +static void *fs_trace_seq_start(struct seq_file *m, loff_t *pos) > +{ > + mutex_lock(&fs_trace_lock); > + return seq_list_start(&fs_trace_list, *pos); > +} > + > +static void *fs_trace_seq_next(struct seq_file *m, void *v, loff_t *pos) > +{ > + return seq_list_next(v, &fs_trace_list, pos); > +} > + > +static void fs_trace_seq_stop(struct seq_file *m, void *v) > +{ > + mutex_unlock(&fs_trace_lock); > +} > + > +static int fs_trace_seq_show(struct seq_file *m, void *v) > +{ > + struct fs_trace_entry *en; > + struct super_block *sb; > + struct mount *r_mnt; > + const struct match_token *match; > + unsigned int nmask; > + > + en =3D list_entry(v, struct fs_trace_entry, node); > + /* Do not show the entries outside current mount namespace */ > + r_mnt =3D real_mount(en->mnt_path.mnt); > + if (r_mnt->mnt_ns !=3D current->nsproxy->mnt_ns) { > + if (!__is_local_mountpoint(r_mnt->mnt_mountpoint)) > + return 0; > + } > + > + sb =3D en->sb; > + > + seq_path(m, &en->mnt_path, "\t\n\\"); > + seq_putc(m, ' '); > + > + seq_escape(m, sb->s_type->name, " \t\n\\"); > + if (sb->s_subtype && sb->s_subtype[0]) { > + seq_putc(m, '.'); > + seq_escape(m, sb->s_subtype, " \t\n\\"); > + } > + > + seq_putc(m, ' '); > + if (sb->s_op->show_devname) { > + sb->s_op->show_devname(m, en->mnt_path.mnt->mnt_root); > + } else { > + seq_escape(m, r_mnt->mnt_devname ? r_mnt->mnt_devname : "none", > + " \t\n\\"); > + } > + seq_puts(m, " ("); > + > + nmask =3D en->notify; > + for (match =3D fs_etypes; match->pattern; ++match) { > + if (match->token & nmask) { > + seq_puts(m, match->pattern); > + nmask &=3D ~match->token; > + if (nmask) > + seq_putc(m, ','); > + } > + } > + seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); > + seq_puts(m, ")\n"); > + return 0; > +} > + > +static const struct seq_operations fs_trace_seq_ops =3D { > + .start =3D fs_trace_seq_start, > + .next =3D fs_trace_seq_next, > + .stop =3D fs_trace_seq_stop, > + .show =3D fs_trace_seq_show, > +}; > + > +static int fs_trace_open(struct inode *inode, struct file *file) > +{ > + return seq_open(file, &fs_trace_seq_ops); > +} > + > +static const struct file_operations fs_trace_fops =3D { > + .owner =3D THIS_MODULE, > + .open =3D fs_trace_open, > + .write =3D fs_trace_write, > + .read =3D seq_read, > + .llseek =3D seq_lseek, > + .release =3D seq_release, > +}; > + > +static int fs_trace_init(void) > +{ > + fs_trace_cachep =3D KMEM_CACHE(fs_trace_entry, 0); > + if (!fs_trace_cachep) > + return -EINVAL; > + init_waitqueue_head(&trace_wq); > + return 0; > +} > + > +/* VFS support */ > +static int fs_trace_fill_super(struct super_block *sb, void *data, int s= ilen) > +{ > + int ret; > + static struct tree_descr desc[] =3D { > + [2] =3D { > + .name =3D "config", > + .ops =3D &fs_trace_fops, > + .mode =3D S_IWUSR | S_IRUGO, > + }, > + {""}, > + }; > + > + ret =3D simple_fill_super(sb, 0x7246332, desc); > + return !ret ? fs_trace_init() : ret; > +} > + > +static struct dentry *fs_trace_do_mount(struct file_system_type *fs_type, > + int ntype, const char *dev_name, void *data) > +{ > + return mount_single(fs_type, ntype, data, fs_trace_fill_super); > +} > + > +static void fs_trace_kill_super(struct super_block *sb) > +{ > + /* > + * The rcu_barrier here will/should make sure all call_rcu > + * callbacks are completed - still there might be some active > + * trace objects in use which can make calling the > + * kmem_cache_destroy unsafe. So we wait until all traces > + * are finally released. > + */ > + fs_remove_all_traces(); > + rcu_barrier(); > + wait_event(trace_wq, !atomic_read(&stray_traces)); > + > + kmem_cache_destroy(fs_trace_cachep); > + kill_litter_super(sb); > +} > + > +static struct kset *fs_trace_kset; > + > +static struct file_system_type fs_trace_fstype =3D { > + .name =3D "fstrace", > + .mount =3D fs_trace_do_mount, > + .kill_sb =3D fs_trace_kill_super, > +}; > + > +static void __init fs_trace_vfs_init(void) > +{ > + fs_trace_kset =3D kset_create_and_add("events", NULL, fs_kobj); > + > + if (!fs_trace_kset) > + return; > + > + if (!register_filesystem(&fs_trace_fstype)) { > + if (!fs_event_netlink_register()) > + return; > + unregister_filesystem(&fs_trace_fstype); > + } > + kset_unregister(fs_trace_kset); > +} > + > +static int __init fs_trace_evens_init(void) > +{ > + fs_trace_vfs_init(); > + return 0; > +}; > +module_init(fs_trace_evens_init); > + > diff --git a/fs/events/fs_event.h b/fs/events/fs_event.h > new file mode 100644 > index 0000000..23f24c8 > --- /dev/null > +++ b/fs/events/fs_event.h > @@ -0,0 +1,22 @@ > +/* > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify = it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution = in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but W= ITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License = for > + * more details. > + */ > + > +#ifndef __GENERIC_FS_EVENTS_H > +#define __GENERIC_FS_EVENTS_H > + > +int fs_event_netlink_register(void); > +void fs_event_netlink_unregister(void); > + > +#endif /* __GENERIC_FS_EVENTS_H */ > diff --git a/fs/events/fs_event_netlink.c b/fs/events/fs_event_netlink.c > new file mode 100644 > index 0000000..0c97eb7 > --- /dev/null > +++ b/fs/events/fs_event_netlink.c > @@ -0,0 +1,104 @@ > +/* > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify = it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution = in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but W= ITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License = for > + * more details. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "fs_event.h" > + > +static const struct genl_multicast_group fs_event_mcgroups[] =3D { > + { .name =3D FS_EVENTS_MCAST_GRP_NAME, }, > +}; > + > +static struct genl_family fs_event_family =3D { > + .id =3D GENL_ID_GENERATE, > + .name =3D FS_EVENTS_FAMILY_NAME, > + .version =3D 1, > + .maxattr =3D FS_NL_A_MAX, > + .mcgrps =3D fs_event_mcgroups, > + .n_mcgrps =3D ARRAY_SIZE(fs_event_mcgroups), > +}; > + > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), > + void *cbdata) > +{ > + static atomic_t seq; > + struct sk_buff *skb; > + void *msg_head; > + int ret =3D 0; > + > + if (!size || !compose_msg) > + return -EINVAL; > + > + /* Skip if there are no listeners */ > + if (!genl_has_listeners(&fs_event_family, &init_net, 0)) > + return 0; > + > + if (event_id !=3D FS_EVENT_NONE) > + size +=3D nla_total_size(sizeof(u32)); > + size +=3D nla_total_size(sizeof(u64)); > + skb =3D genlmsg_new(size, GFP_NOWAIT); > + > + if (!skb) { > + pr_debug("Failed to allocate new FS generic netlink message\n"); > + return -ENOMEM; > + } > + > + msg_head =3D genlmsg_put(skb, 0, atomic_add_return(1, &seq), > + &fs_event_family, 0, FS_NL_C_EVENT); > + if (!msg_head) > + goto cleanup; > + > + if (event_id !=3D FS_EVENT_NONE) > + if (nla_put_u32(skb, FS_NL_A_EVENT_ID, event_id)) > + goto cancel; > + > + ret =3D compose_msg(skb, cbdata); > + if (ret) > + goto cancel; > + > + genlmsg_end(skb, msg_head); > + ret =3D genlmsg_multicast(&fs_event_family, skb, 0, 0, GFP_NOWAIT); > + if (ret && ret !=3D -ENOBUFS && ret !=3D -ESRCH) > + goto cleanup; > + > + return ret; > + > +cancel: > + genlmsg_cancel(skb, msg_head); > +cleanup: > + nlmsg_free(skb); > + return ret; > +} > +EXPORT_SYMBOL(fs_netlink_send_event); > + > +int fs_event_netlink_register(void) > +{ > + int ret; > + > + ret =3D genl_register_family(&fs_event_family); > + if (ret) > + pr_err("Failed to register FS netlink interface\n"); > + return ret; > +} > + > +void fs_event_netlink_unregister(void) > +{ > + genl_unregister_family(&fs_event_family); > +} > diff --git a/fs/namespace.c b/fs/namespace.c > index 82ef140..ec6e2ef 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -1031,6 +1031,7 @@ static void cleanup_mnt(struct mount *mnt) > if (unlikely(mnt->mnt_pins.first)) > mnt_pin_kill(mnt); > fsnotify_vfsmount_delete(&mnt->mnt); > + fs_event_mount_dropped(&mnt->mnt); > dput(mnt->mnt.mnt_root); > deactivate_super(mnt->mnt.mnt_sb); > mnt_free_id(mnt); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index b4d71b5..b7dadd9 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -263,6 +263,10 @@ struct iattr { > * Includes for diskquotas. > */ > #include > +/* > + * Include for Generic File System Events Interface > + */ > +#include >=20=20 > /* > * Maximum number of layers of fs stack. Needs to be limited to > @@ -1253,7 +1257,7 @@ struct super_block { > struct hlist_node s_instances; > unsigned int s_quota_types; /* Bitmask of supported quota types */ > struct quota_info s_dquot; /* Diskquota specific options */ > - > + struct fs_trace_info s_etrace; > struct sb_writers s_writers; >=20=20 > char s_id[32]; /* Informational name */ > diff --git a/include/linux/fs_event.h b/include/linux/fs_event.h > new file mode 100644 > index 0000000..83e22dd > --- /dev/null > +++ b/include/linux/fs_event.h > @@ -0,0 +1,72 @@ > +/* > + * Generic File System Events Interface > + * > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify = it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution = in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but W= ITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License = for > + * more details. > + */ > +#ifndef _LINUX_GENERIC_FS_EVETS_ > +#define _LINUX_GENERIC_FS_EVETS_ > +#include > +#include > + > +/* > + * Currently supported event types > + */ > +#define FS_EVENT_GENERIC 0x001 > +#define FS_EVENT_THRESH 0x002 > + > +#define FS_EVENTS_ALL (FS_EVENT_GENERIC | FS_EVENT_THRESH) > + > +struct fs_trace_operations { > + void (*query)(struct super_block *, u64 *); > +}; > + > +struct fs_trace_info { > + void __rcu *e_priv; /* READ ONLY */ > + unsigned int events_cap_mask; /* Supported notifications */ > + const struct fs_trace_operations *ops; > +}; > + > +#ifdef CONFIG_FS_EVENTS > + > +void fs_event_notify(struct super_block *sb, unsigned int event_id); > +void fs_event_alloc_space(struct super_block *sb, u64 ncount); > +void fs_event_free_space(struct super_block *sb, u64 ncount); > +void fs_event_mount_dropped(struct vfsmount *mnt); > + > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), > + void *cbdata); > + > +#else /* CONFIG_FS_EVENTS */ > + > +static inline > +void fs_event_notify(struct super_block *sb, unsigned int event_id) {}; > +static inline > +void fs_event_alloc_space(struct super_block *sb, u64 ncount) {}; > +static inline > +void fs_event_free_space(struct super_block *sb, u64 ncount) {}; > +static inline > +void fs_event_mount_dropped(struct vfsmount *mnt) {}; > + > +static inline > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msig)(struct sk_buff *skb, void *data), > + void *cbdata) > +{ > + return -ENOSYS; > +} > +#endif /* CONFIG_FS_EVENTS */ > + > +#endif /* _LINUX_GENERIC_FS_EVENTS_ */ > + > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild > index 68ceb97..dae0fab 100644 > --- a/include/uapi/linux/Kbuild > +++ b/include/uapi/linux/Kbuild > @@ -129,6 +129,7 @@ header-y +=3D firewire-constants.h > header-y +=3D flat.h > header-y +=3D fou.h > header-y +=3D fs.h > +header-y +=3D fs_event.h > header-y +=3D fsl_hypervisor.h > header-y +=3D fuse.h > header-y +=3D futex.h > diff --git a/include/uapi/linux/fs_event.h b/include/uapi/linux/fs_event.h > new file mode 100644 > index 0000000..d8b07da > --- /dev/null > +++ b/include/uapi/linux/fs_event.h > @@ -0,0 +1,58 @@ > +/* > + * Generic netlink support for Generic File System Events Interface > + * > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify = it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution = in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but W= ITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License = for > + * more details. > + */ > +#ifndef _UAPI_LINUX_GENERIC_FS_EVENTS_ > +#define _UAPI_LINUX_GENERIC_FS_EVENTS_ > + > +#define FS_EVENTS_FAMILY_NAME "fs_event" > +#define FS_EVENTS_MCAST_GRP_NAME "fs_event_mc_grp" > + > +/* > + * Generic netlink attribute types > + */ > +enum { > + FS_NL_A_NONE, > + FS_NL_A_EVENT_ID, > + FS_NL_A_DEV_MAJOR, > + FS_NL_A_DEV_MINOR, > + FS_NL_A_CAUSED_ID, > + FS_NL_A_DATA, > + __FS_NL_A_MAX, > +}; > +#define FS_NL_A_MAX (__FS_NL_A_MAX - 1) > +/* > + * Generic netlink commands > + */ > +#define FS_NL_C_EVENT 1 > + > +/* > + * Supported set of FS events > + */ > +enum { > + FS_EVENT_NONE, > + FS_WARN_ENOSPC, /* No space left to reserve data blks */ > + FS_WARN_ENOSPC_META, /* No space left for metadata */ > + FS_THR_LRBELOW, /* The threshold lower range has been reached */ > + FS_THR_LRABOVE, /* The threshold lower range re-activcated*/ > + FS_THR_URBELOW, > + FS_THR_URABOVE, > + FS_ERR_REMOUNT_RO, /* The file system has been remounted as RO */ > + FS_ERR_CORRUPTED /* Critical error - fs corrupted */ > + > +}; > + > +#endif /* _UAPI_LINUX_GENERIC_FS_EVENTS_ */ > + > --=20 > 1.7.9.5 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJVim6XAAoJELhyPTmIL6kB/loH/29j4u8BsOA1srlbsS7kHNNO Ii2NLDJZbCSfGmTIgEslALWcOnA1QostL8E5+CuCVBhwOCrZaZiLu4mSGwcc9D+E 0OIY3V7zCb03YsILTUhxCSutmyltyhe4IRL8PvQlMMDTYCiYzvatnGyXPP/CYcrA x5HSbp0xWgdA/Frg0wIiXZc/DMsm/W+eJK8tw/kIc1BWQ3lvFlRWaiTNIehqwHnA GzovJt97vqkchl92UvhgTLx3My7NYmi2V74XRVLuU07eEKnhlWd4YC0VcIe/Z4jG siE5y5Qr5AIFbgbejAQsOPr20bZN97goLrzzrt88mjZSLYGuxi1mQ514sw1XchE= =1Dh/ -----END PGP SIGNATURE----- --=-=-=-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lb0-f169.google.com (mail-lb0-f169.google.com [209.85.217.169]) by kanga.kvack.org (Postfix) with ESMTP id 2008A6B0032 for ; Wed, 24 Jun 2015 12:50:22 -0400 (EDT) Received: by lbbwc1 with SMTP id wc1so30079353lbb.2 for ; Wed, 24 Jun 2015 09:50:21 -0700 (PDT) Received: from mail-la0-x233.google.com (mail-la0-x233.google.com. [2a00:1450:4010:c03::233]) by mx.google.com with ESMTPS id kl13si12442525lbb.9.2015.06.24.09.26.47 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 24 Jun 2015 09:27:17 -0700 (PDT) Received: by laka10 with SMTP id a10so29372604lak.0 for ; Wed, 24 Jun 2015 09:26:46 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <558ACD3A.2020508@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <87oak5ebmx.fsf@openvz.org> <558ACD3A.2020508@samsung.com> From: Steve French Date: Wed, 24 Jun 2015 11:26:27 -0500 Message-ID: Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Beata Michalska Cc: Dmitry Monakhov , LKML , linux-fsdevel , "linux-api@vger.kernel.org" , Greg Kroah-Hartman , Jan Kara , Theodore Ts'o , adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, Christoph Hellwig , "linux-ext4@vger.kernel.org" , linux-mm , kyungmin.park@samsung.com, kmpark@infradead.org On Wed, Jun 24, 2015 at 10:31 AM, Beata Michalska wrote: > On 06/24/2015 10:47 AM, Dmitry Monakhov wrote: >> Beata Michalska writes: >> >>> Introduce configurable generic interface for file >>> system-wide event notifications, to provide file >>> systems with a common way of reporting any potential >>> issues as they emerge. >>> >>> The notifications are to be issued through generic >>> netlink interface by newly introduced multicast group. >>> >>> Threshold notifications have been included, allowing >>> triggering an event whenever the amount of free space drops >>> below a certain level - or levels to be more precise as two >>> of them are being supported: the lower and the upper range. >>> The notifications work both ways: once the threshold level >>> has been reached, an event shall be generated whenever >>> the number of available blocks goes up again re-activating >>> the threshold. >>> >>> The interface has been exposed through a vfs. Once mounted, >>> it serves as an entry point for the set-up where one can >>> register for particular file system events. >>> >>> Signed-off-by: Beata Michalska >>> --- >>> Documentation/filesystems/events.txt | 232 ++++++++++ >>> fs/Kconfig | 2 + >>> fs/Makefile | 1 + >>> fs/events/Kconfig | 7 + >>> fs/events/Makefile | 5 + >>> fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ >>> fs/events/fs_event.h | 22 + >>> fs/events/fs_event_netlink.c | 104 +++++ >>> fs/namespace.c | 1 + >>> include/linux/fs.h | 6 +- >>> include/linux/fs_event.h | 72 +++ >>> include/uapi/linux/Kbuild | 1 + >>> include/uapi/linux/fs_event.h | 58 +++ >>> 13 files changed, 1319 insertions(+), 1 deletion(-) >>> create mode 100644 Documentation/filesystems/events.txt >>> create mode 100644 fs/events/Kconfig >>> create mode 100644 fs/events/Makefile >>> create mode 100644 fs/events/fs_event.c >>> create mode 100644 fs/events/fs_event.h >>> create mode 100644 fs/events/fs_event_netlink.c >>> create mode 100644 include/linux/fs_event.h >>> create mode 100644 include/uapi/linux/fs_event.h >>> >>> diff --git a/Documentation/filesystems/events.txt b/Documentation/filesystems/events.txt >>> new file mode 100644 >>> index 0000000..c2e6227 >>> --- /dev/null >>> +++ b/Documentation/filesystems/events.txt >>> @@ -0,0 +1,232 @@ >>> + >>> + Generic file system event notification interface >>> + >>> +Document created 23 April 2015 by Beata Michalska >>> + >>> +1. The reason behind: >>> +===================== >>> + >>> +There are many corner cases when things might get messy with the filesystems. >>> +And it is not always obvious what and when went wrong. Sometimes you might >>> +get some subtle hints that there is something going on - but by the time >>> +you realise it, it might be too late as you are already out-of-space >>> +or the filesystem has been remounted as read-only (i.e.). The generic >>> +interface for the filesystem events fills the gap by providing a rather >>> +easy way of real-time notifications triggered whenever something interesting >>> +happens, allowing filesystems to report events in a common way, as they occur. >>> + >>> +2. How does it work: >>> +==================== >>> + >>> +The interface itself has been exposed as fstrace-type Virtual File System, >>> +primarily to ease the process of setting up the configuration for the >>> +notifications. So for starters, it needs to get mounted (obviously): >>> + >>> + mount -t fstrace none /sys/fs/events >>> + >>> +This will unveil the single fstrace filesystem entry - the 'config' file, >>> +through which the notification are being set-up. >>> + >>> +Activating notifications for particular filesystem is as straightforward >>> +as writing into the 'config' file. Note that by default all events, despite >>> +the actual filesystem type, are being disregarded. >>> + >>> +Synopsis of config: >>> +------------------ >>> + >>> + MOUNT EVENT_TYPE [L1] [L2] >>> + >>> + MOUNT : the filesystem's mount point >>> + EVENT_TYPE : event types - currently two of them are being supported: >>> + >>> + * generic events ("G") covering most common warnings >>> + and errors that might be reported by any filesystem; >>> + this option does not take any arguments; >>> + >>> + * threshold notifications ("T") - events sent whenever >>> + the amount of available space drops below certain level; >>> + it is possible to specify two threshold levels though >>> + only one is required to properly setup the notifications; >>> + as those refer to the number of available blocks, the lower >>> + level [L1] needs to be higher than the upper one [L2] >>> + >>> +Sample request could look like the following: >>> + >>> + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config >>> + >>> +Multiple request might be specified provided they are separated with semicolon. >>> + >>> +The configuration itself might be modified at any time. One can add/remove >>> +particular event types for given fielsystem, modify the threshold levels, >>> +and remove single or all entries from the 'config' file. >>> + >>> + - Adding new event type: >>> + >>> + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config >>> + >>> +(Note that is is enough to provide the event type to be enabled without >>> +the already set ones.) >>> + >>> + - Removing event type: >>> + >>> + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config >>> + >>> + - Updating threshold limits: >>> + >>> + $ echo MOUNT T L1 L2 > /sys/fs/events/config >>> + >>> + - Removing single entry: >>> + >>> + $ echo '!MOUNT' > /sys/fs/events/config >>> + >>> + - Removing all entries: >>> + >>> + $ echo > /sys/fs/events/config >>> + >>> +Reading the file will list all registered entries with their current set-up >>> +along with some additional info like the filesystem type and the backing device >>> +name if available. >>> + >>> +Final, though a very important note on the configuration: when and if the >>> +actual events are being triggered falls way beyond the scope of the generic >>> +filesystem events interface. It is up to a particular filesystem >>> +implementation which events are to be supported - if any at all. So if >>> +given filesystem does not support the event notifications, an attempt to >>> +enable those through 'config' file will fail. >>> + >>> + >>> +3. The generic netlink interface support: >>> +========================================= >>> + >>> +Whenever an event notification is triggered (by given filesystem) the current >>> +configuration is being validated to decide whether a userpsace notification >>> +should be launched. If there has been no request (in a mean of 'config' file >>> +entry) for given event, one will be silently disregarded. If, on the other >>> +hand, someone is 'watching' given filesystem for specific events, a generic >>> +netlink message will be sent. A dedicated multicast group has been provided >>> +solely for this purpose so in order to receive such notifications, one should >>> +subscribe to this new multicast group. As for now only the init network >>> +namespace is being supported. >>> + >>> +3.1 Message format >>> + >>> +The FS_NL_C_EVENT shall be stored within the generic netlink message header >>> +as the command field. The message payload will provide more detailed info: >>> +the backing device major and minor numbers, the event code and the id of >>> +the process which action led to the event occurrence. In case of threshold >>> +notifications, the current number of available blocks will be included >>> +in the payload as well. >>> + >>> + >>> + 0 1 2 3 >>> + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 >>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>> + | NETLINK MESSAGE HEADER | >>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>> + | GENERIC NETLINK MESSAGE HEADER | >>> + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | >>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>> + | Optional user specific message header | >>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>> + | GENERIC MESSAGE PAYLOAD: | >>> + +---------------------------------------------------------------+ >>> + | FS_NL_A_EVENT_ID (NLA_U32) | >>> + +---------------------------------------------------------------+ >>> + | FS_NL_A_DEV_MAJOR (NLA_U32) | >>> + +---------------------------------------------------------------+ >>> + | FS_NL_A_DEV_MINOR (NLA_U32) | >> > ... > >>> + >>> +static int create_common_msg(struct sk_buff *skb, void *data) >>> +{ >>> + struct fs_trace_entry *en = (struct fs_trace_entry *)data; >>> + struct super_block *sb = en->sb; >>> + >>> + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) >>> + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) >>> + return -EINVAL; >> What about diskless(nfs,cifs,etc) filesystem? btrfs also has no >> valid sb->s_dev And note that filesystem notifications and also file/directory change notification are particularly useful in the case of a a network file system (and heavily used by Windows desktop, Mac etc.) since when a file is shared a user may not necessarily know that a file (or file system as a whole) changed via another client (or on the server, or on the server via a different protocol e.g.SMB3 vs NFSv4), but is more likely to know about local changes to the same file. In some sense the users of mounts on network file systems get more benefit from notifications than a mount on a local file system would. -- Thanks, Steve -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756318AbbFPNJy (ORCPT ); Tue, 16 Jun 2015 09:09:54 -0400 Received: from mailout3.w1.samsung.com ([210.118.77.13]:15142 "EHLO mailout3.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752404AbbFPNJo (ORCPT ); Tue, 16 Jun 2015 09:09:44 -0400 X-AuditID: cbfec7f5-f794b6d000001495-cb-55802015df72 From: Beata Michalska To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Cc: greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: [RFC v3 0/4] fs: Add generic file system event notifications Date: Tue, 16 Jun 2015 15:09:29 +0200 Message-id: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> X-Mailer: git-send-email 1.7.9.5 X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFtrMLMWRmVeSWpSXmKPExsVy+t/xq7qiCg2hBncncFp8/dLBYnFuwQxG i9MTFjFZPP3Ux2Ixe3ozk8Wty6tYLM42vWG3WPZgM4vF5u8dbBYz591hs9iz9ySLxeVdc9gs 7q35z2rR2vOT3YHPo2VzuceCTaUem1doebx9GOCx6dMkdo+mM0eZPd7vu8rm0bdlFaPHmQVH 2D0+b5IL4IrisklJzcksSy3St0vgyljddJG5YIdExarvR1gaGLsEuxg5OSQETCTu7LnCCGGL SVy4t54NxBYSWMoosakhr4uRC8huZJI4vGknE0iCTUBf4tWMlWC2iECMxMFdPSwgRcwCrxgl /jxcDpYQFnCVmL7vEDuIzSKgKjF790kwm1fAXaLlRR+QzQG0TUFiziSbCYzcCxgZVjGKppYm FxQnpeca6RUn5haX5qXrJefnbmKEBOXXHYxLj1kdYhTgYFTi4Y34VBsqxJpYVlyZe4hRgoNZ SYR3nkhDqBBvSmJlVWpRfnxRaU5q8SFGaQ4WJXHembvehwgJpCeWpGanphakFsFkmTg4pRoY j7cUq2lWTZGWVwzb57fXc0ZRWdOC9I8RexpOdqm/kRcq+yB2VUNVb9apd+5XxZYqLZaRujv5 llrpVV2dhce4xBa95X86Zb9nusWqnim9QTsDHc9tfaCRy3plXUv9/o1zTK5YTz8fz51+seHw /omvW332PPVIb7UMeXFq1YcnfPx9LUm/UveWKrEUZyQaajEXFScCAN/fFjFGAgAA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi All, First of all, apologies for the delay: illness ruled out my plans for having this out for a review as intended. Anyway this is an updated version of the patchset for generic filesystem events interface [1][2], hopefully handling issues raised during the previous run. Changes from v2: - Switched to kref for reference counting - Support for the events has been made optional (config option) - Use dynamically assigned id for multicast group instead of using static one - Verify if there are any net listeners prior to sending the msg - Make the interface more namespace-aware (handling mount dropped and showing the content of config file). As for the network namespaces - as before only the init net namespace is being supported. Changes from v1: - Improved synchronization: switched to RCU accompanied with ref counting mechanism - Limiting scope of supported event types along with default event codes - Slightly modified configuration (event types followed by arguments where required) - Updated documentation - Unified naming for netlink attributes - Updated netlink message format to include dev minor:major numbers despite the filesystem type - Switched to single cmd id for messages - Removed the per-config-entry ids --- [1] https://lkml.org/lkml/2015/4/15/46 [2] https://lkml.org/lkml/2015/4/27/244 --- Beata Michalska (4): fs: Add generic file system event notifications ext4: Add helper function to mark group as corrupted ext4: Add support for generic FS events shmem: Add support for generic FS events Documentation/filesystems/events.txt | 232 ++++++++++ fs/Kconfig | 2 + fs/Makefile | 1 + fs/events/Kconfig | 7 + fs/events/Makefile | 5 + fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ fs/events/fs_event.h | 22 + fs/events/fs_event_netlink.c | 104 +++++ fs/ext4/balloc.c | 25 +- fs/ext4/ext4.h | 10 + fs/ext4/ialloc.c | 5 +- fs/ext4/inode.c | 2 +- fs/ext4/mballoc.c | 17 +- fs/ext4/resize.c | 1 + fs/ext4/super.c | 39 ++ fs/namespace.c | 1 + include/linux/fs.h | 6 +- include/linux/fs_event.h | 72 +++ include/uapi/linux/Kbuild | 1 + include/uapi/linux/fs_event.h | 58 +++ mm/shmem.c | 33 +- 21 files changed, 1419 insertions(+), 33 deletions(-) create mode 100644 Documentation/filesystems/events.txt create mode 100644 fs/events/Kconfig create mode 100644 fs/events/Makefile create mode 100644 fs/events/fs_event.c create mode 100644 fs/events/fs_event.h create mode 100644 fs/events/fs_event_netlink.c create mode 100644 include/linux/fs_event.h create mode 100644 include/uapi/linux/fs_event.h -- 1.7.9.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756425AbbFPNKK (ORCPT ); Tue, 16 Jun 2015 09:10:10 -0400 Received: from mailout1.w1.samsung.com ([210.118.77.11]:12593 "EHLO mailout1.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753056AbbFPNJp (ORCPT ); Tue, 16 Jun 2015 09:09:45 -0400 X-AuditID: cbfec7f4-f79c56d0000012ee-3c-558020151e32 From: Beata Michalska To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Cc: greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: [RFC v3 1/4] fs: Add generic file system event notifications Date: Tue, 16 Jun 2015 15:09:30 +0200 Message-id: <1434460173-18427-2-git-send-email-b.michalska@samsung.com> X-Mailer: git-send-email 1.7.9.5 In-reply-to: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrDLMWRmVeSWpSXmKPExsVy+t/xq7qiCg2hBnt6TCy+fulgsTi3YAaj xekJi5gsnn7qY7GYPb2ZyeLW5VUsFmeb3rBbLHuwmcVi8/cONouZ8+6wWezZe5LF4vKuOWwW 99b8Z7Vo7fnJ7sDn0bK53GPBplKPzSu0PN4+DPDY9GkSu0fTmaPMHu/3XWXz6NuyitHjzIIj 7B6fN8kFcEVx2aSk5mSWpRbp2yVwZRz/5F3w9TpTxfu1r9kaGG9OYepi5OSQEDCRmN51hhHC FpO4cG89WxcjF4eQwFJGibeHX7FCOI1MEoc37QTrYBPQl3g1YyWYLSIQI3FwVw8LSBGzwCtG iT8Pl4MlhAVcJU53bGAGsVkEVCX6fq1iAbF5BdwlDv7+AlTDAbROQWLOJBuQMKeAh0T342ns IGEhkJINnhMYeRcwMqxiFE0tTS4oTkrPNdQrTswtLs1L10vOz93ECAnjLzsYFx+zOsQowMGo xMMb8ak2VIg1say4MvcQowQHs5II7zyRhlAh3pTEyqrUovz4otKc1OJDjNIcLErivHN3vQ8R EkhPLEnNTk0tSC2CyTJxcEo1MErmFS14kFmSHM91samPRXO+a2HB1Xav9EOKjdXZ8SXdN2cs /nV2z+PjHzIPr7a97SFuuHc9+wIPNcFXO52vvtmvHjSDaW/aXvFdm25cdJi8sSYqZrOX4gmn wL/u7BX3WPmWHOe6rS7xyeBw1Y4PDNUPPNre3t9beit/6jGZtZlu7n0npqhd+aHEUpyRaKjF XFScCAA3x9TcXwIAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Introduce configurable generic interface for file system-wide event notifications, to provide file systems with a common way of reporting any potential issues as they emerge. The notifications are to be issued through generic netlink interface by newly introduced multicast group. Threshold notifications have been included, allowing triggering an event whenever the amount of free space drops below a certain level - or levels to be more precise as two of them are being supported: the lower and the upper range. The notifications work both ways: once the threshold level has been reached, an event shall be generated whenever the number of available blocks goes up again re-activating the threshold. The interface has been exposed through a vfs. Once mounted, it serves as an entry point for the set-up where one can register for particular file system events. Signed-off-by: Beata Michalska --- Documentation/filesystems/events.txt | 232 ++++++++++ fs/Kconfig | 2 + fs/Makefile | 1 + fs/events/Kconfig | 7 + fs/events/Makefile | 5 + fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ fs/events/fs_event.h | 22 + fs/events/fs_event_netlink.c | 104 +++++ fs/namespace.c | 1 + include/linux/fs.h | 6 +- include/linux/fs_event.h | 72 +++ include/uapi/linux/Kbuild | 1 + include/uapi/linux/fs_event.h | 58 +++ 13 files changed, 1319 insertions(+), 1 deletion(-) create mode 100644 Documentation/filesystems/events.txt create mode 100644 fs/events/Kconfig create mode 100644 fs/events/Makefile create mode 100644 fs/events/fs_event.c create mode 100644 fs/events/fs_event.h create mode 100644 fs/events/fs_event_netlink.c create mode 100644 include/linux/fs_event.h create mode 100644 include/uapi/linux/fs_event.h diff --git a/Documentation/filesystems/events.txt b/Documentation/filesystems/events.txt new file mode 100644 index 0000000..c2e6227 --- /dev/null +++ b/Documentation/filesystems/events.txt @@ -0,0 +1,232 @@ + + Generic file system event notification interface + +Document created 23 April 2015 by Beata Michalska + +1. The reason behind: +===================== + +There are many corner cases when things might get messy with the filesystems. +And it is not always obvious what and when went wrong. Sometimes you might +get some subtle hints that there is something going on - but by the time +you realise it, it might be too late as you are already out-of-space +or the filesystem has been remounted as read-only (i.e.). The generic +interface for the filesystem events fills the gap by providing a rather +easy way of real-time notifications triggered whenever something interesting +happens, allowing filesystems to report events in a common way, as they occur. + +2. How does it work: +==================== + +The interface itself has been exposed as fstrace-type Virtual File System, +primarily to ease the process of setting up the configuration for the +notifications. So for starters, it needs to get mounted (obviously): + + mount -t fstrace none /sys/fs/events + +This will unveil the single fstrace filesystem entry - the 'config' file, +through which the notification are being set-up. + +Activating notifications for particular filesystem is as straightforward +as writing into the 'config' file. Note that by default all events, despite +the actual filesystem type, are being disregarded. + +Synopsis of config: +------------------ + + MOUNT EVENT_TYPE [L1] [L2] + + MOUNT : the filesystem's mount point + EVENT_TYPE : event types - currently two of them are being supported: + + * generic events ("G") covering most common warnings + and errors that might be reported by any filesystem; + this option does not take any arguments; + + * threshold notifications ("T") - events sent whenever + the amount of available space drops below certain level; + it is possible to specify two threshold levels though + only one is required to properly setup the notifications; + as those refer to the number of available blocks, the lower + level [L1] needs to be higher than the upper one [L2] + +Sample request could look like the following: + + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config + +Multiple request might be specified provided they are separated with semicolon. + +The configuration itself might be modified at any time. One can add/remove +particular event types for given fielsystem, modify the threshold levels, +and remove single or all entries from the 'config' file. + + - Adding new event type: + + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config + +(Note that is is enough to provide the event type to be enabled without +the already set ones.) + + - Removing event type: + + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config + + - Updating threshold limits: + + $ echo MOUNT T L1 L2 > /sys/fs/events/config + + - Removing single entry: + + $ echo '!MOUNT' > /sys/fs/events/config + + - Removing all entries: + + $ echo > /sys/fs/events/config + +Reading the file will list all registered entries with their current set-up +along with some additional info like the filesystem type and the backing device +name if available. + +Final, though a very important note on the configuration: when and if the +actual events are being triggered falls way beyond the scope of the generic +filesystem events interface. It is up to a particular filesystem +implementation which events are to be supported - if any at all. So if +given filesystem does not support the event notifications, an attempt to +enable those through 'config' file will fail. + + +3. The generic netlink interface support: +========================================= + +Whenever an event notification is triggered (by given filesystem) the current +configuration is being validated to decide whether a userpsace notification +should be launched. If there has been no request (in a mean of 'config' file +entry) for given event, one will be silently disregarded. If, on the other +hand, someone is 'watching' given filesystem for specific events, a generic +netlink message will be sent. A dedicated multicast group has been provided +solely for this purpose so in order to receive such notifications, one should +subscribe to this new multicast group. As for now only the init network +namespace is being supported. + +3.1 Message format + +The FS_NL_C_EVENT shall be stored within the generic netlink message header +as the command field. The message payload will provide more detailed info: +the backing device major and minor numbers, the event code and the id of +the process which action led to the event occurrence. In case of threshold +notifications, the current number of available blocks will be included +in the payload as well. + + + 0 1 2 3 + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | NETLINK MESSAGE HEADER | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | GENERIC NETLINK MESSAGE HEADER | + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | Optional user specific message header | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + | GENERIC MESSAGE PAYLOAD: | + +---------------------------------------------------------------+ + | FS_NL_A_EVENT_ID (NLA_U32) | + +---------------------------------------------------------------+ + | FS_NL_A_DEV_MAJOR (NLA_U32) | + +---------------------------------------------------------------+ + | FS_NL_A_DEV_MINOR (NLA_U32) | + +---------------------------------------------------------------+ + | FS_NL_A_CAUSED_ID (NLA_U32) | + +---------------------------------------------------------------+ + | FS_NL_A_DATA (NLA_U64) | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + +The above figure is based on: + http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto#Message_Format + + +4. API Reference: +================= + + 4.1 Generic file system event interface data & operations + + #include + + struct fs_trace_info { + void __rcu *e_priv /* READ ONLY */ + unsigned int events_cap_mask; /* Supported notifications */ + const struct fs_trace_operations *ops; + }; + + struct fs_trace_operations { + void (*query)(struct super_block *, u64 *); + }; + + In order to get the fireworks and stuff, each filesystem needs to setup + the events_cap_mask field of the fs_trace_info structure, which has been + embedded within the super_block structure. This should reflect the type of + events the filesystem wants to support. In case of threshold notifications, + apart from setting the FS_EVENT_THRESH flag, the 'query' callback should + be provided as this enables the events interface to get the up-to-date + state of the number of available blocks whenever those notifications are + being requested. + + The 'e_priv' field of the fs_trace_info structure should be completely ignored + as it's for INTERNAL USE ONLY. So don't even think of messing with it, if you + do not want to get yourself into some real trouble. If still, you are tempted + to do so - feel free, it's gonna be pure fun. Consider yourself warned. + + + 4.2 Event notification: + + #include + void fs_event_notify(struct super_block *sb, unsigned int event_id); + + Notify the generic FS event interface of an occurring event. + This shall be used by any file system that wishes to inform any potential + listeners/watchers of a particular event. + - sb: the filesystem's super block + - event_id: an event identifier + + 4.3 Threshold notifications: + + #include + void fs_event_alloc_space(struct super_block *sb, u64 ncount); + void fs_event_free_space(struct super_block *sb, u64 ncount); + + Each filesystme supporting the threshold notifications should call + fs_event_alloc_space/fs_event_free_space respectively whenever the + amount of available blocks changes. + - sb: the filesystem's super block + - ncount: number of blocks being acquired/released + + Note that to properly handle the threshold notifications the fs events + interface needs to be kept up to date by the filesystems. Each should + register fs_trace_operations to enable querying the current number of + available blocks. + + 4.4 Sending message through generic netlink interface + + #include + + int fs_netlink_send_event(size_t size, unsigned int event_id, + int (*compose_msg)(struct sk_buff *skb, void *data), void *cbdata); + + Although the fs event interface is fully responsible for sending the messages + over the netlink, filesystems might use the FS_EVENT multicast group to send + their own custom messages. + - size: the size of the message payload + - event_id: the event identifier + - compose_msg: a callback responsible for filling-in the message payload + - cbdata: message custom data + + Calling fs_netlink_send_event will result in a message being sent by + the FS_EVENT multicast group. Note that the body of the message should be + prepared (set-up )by the caller - through compose_msg callback. The message's + sk_buff will be allocated on behalf of the caller (thus the size parameter). + The compose_msg should only fill the payload with proper data. Unless + the event id is specified as FS_EVENT_NONE, it's value shall be added + to the payload prior to calling the compose_msg. + + diff --git a/fs/Kconfig b/fs/Kconfig index ec35851..a89e678 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -69,6 +69,8 @@ config FILE_LOCKING for filesystems like NFS and for the flock() system call. Disabling this option saves about 11k. +source "fs/events/Kconfig" + source "fs/notify/Kconfig" source "fs/quota/Kconfig" diff --git a/fs/Makefile b/fs/Makefile index a88ac48..bcb3048 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -126,3 +126,4 @@ obj-y += exofs/ # Multiple modules obj-$(CONFIG_CEPH_FS) += ceph/ obj-$(CONFIG_PSTORE) += pstore/ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ +obj-$(CONFIG_FS_EVENTS) += events/ diff --git a/fs/events/Kconfig b/fs/events/Kconfig new file mode 100644 index 0000000..1c60195 --- /dev/null +++ b/fs/events/Kconfig @@ -0,0 +1,7 @@ +# Generic Files System events interface +config FS_EVENTS + bool "Generic filesystem events" + select NET + default y + help + Enable generic filesystem events interface diff --git a/fs/events/Makefile b/fs/events/Makefile new file mode 100644 index 0000000..9c98337 --- /dev/null +++ b/fs/events/Makefile @@ -0,0 +1,5 @@ +# +# Makefile for the Linux Generic File System Event Interface +# + +obj-y := fs_event.o fs_event_netlink.o diff --git a/fs/events/fs_event.c b/fs/events/fs_event.c new file mode 100644 index 0000000..1037311 --- /dev/null +++ b/fs/events/fs_event.c @@ -0,0 +1,809 @@ +/* + * Generic File System Evens Interface + * + * Copyright(c) 2015 Samsung Electronics. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "../pnode.h" +#include "fs_event.h" + +static LIST_HEAD(fs_trace_list); +static DEFINE_MUTEX(fs_trace_lock); + +static struct kmem_cache *fs_trace_cachep __read_mostly; + +static atomic_t stray_traces = ATOMIC_INIT(0); +static DECLARE_WAIT_QUEUE_HEAD(trace_wq); +/* + * Threshold notification state bits. + * Note the reverse as this refers to the number + * of available blocks. + */ +#define THRESH_LR_BELOW 0x0001 /* Falling below the lower range */ +#define THRESH_LR_BEYOND 0x0002 +#define THRESH_UR_BELOW 0x0004 +#define THRESH_UR_BEYOND 0x0008 /* Going beyond the upper range */ + +#define THRESH_LR_ON (THRESH_LR_BELOW | THRESH_LR_BEYOND) +#define THRESH_UR_ON (THRESH_UR_BELOW | THRESH_UR_BEYOND) + +#define FS_TRACE_ADD 0x100000 + +struct fs_trace_entry { + struct kref count; + atomic_t active; + struct super_block *sb; + unsigned int notify; + struct path mnt_path; + struct list_head node; + + struct fs_event_thresh { + u64 avail_space; + u64 lrange; + u64 urange; + unsigned int state; + } th; + struct rcu_head rcu_head; + spinlock_t lock; +}; + +static const match_table_t fs_etypes = { + { FS_EVENT_GENERIC, "G" }, + { FS_EVENT_THRESH, "T" }, + { 0, NULL }, +}; + +static inline int fs_trace_query_data(struct super_block *sb, + struct fs_trace_entry *en) +{ + if (sb->s_etrace.ops && sb->s_etrace.ops->query) { + sb->s_etrace.ops->query(sb, &en->th.avail_space); + return 0; + } + + return -EINVAL; +} + +static inline void fs_trace_entry_free(struct fs_trace_entry *en) +{ + kmem_cache_free(fs_trace_cachep, en); +} + +static void fs_destroy_trace_entry(struct kref *en_ref) +{ + struct fs_trace_entry *en = container_of(en_ref, + struct fs_trace_entry, count); + + /* Last reference has been dropped */ + fs_trace_entry_free(en); + atomic_dec(&stray_traces); +} + +static void fs_trace_entry_put(struct fs_trace_entry *en) +{ + kref_put(&en->count, fs_destroy_trace_entry); +} + +static void fs_release_trace_entry(struct rcu_head *rcu_head) +{ + struct fs_trace_entry *en = container_of(rcu_head, + struct fs_trace_entry, + rcu_head); + /* + * As opposed to typical reference drop, this one is being + * called from the rcu callback. This is to make sure all + * readers have managed to safely grab the reference before + * the change to rcu pointer is visible to all and before + * the reference is dropped here. + */ + fs_trace_entry_put(en); +} + +static void fs_drop_trace_entry(struct fs_trace_entry *en) +{ + struct super_block *sb; + + lockdep_assert_held(&fs_trace_lock); + /* + * The trace entry might have already been removed + * from the list of active traces with the proper + * ref drop, though it was still in use handling + * one of the fs events. This means that the object + * has been already scheduled for being released. + * So leave... + */ + + if (!atomic_add_unless(&en->active, -1, 0)) + return; + /* + * At this point the trace entry is being marked as inactive + * so no new references will be allowed. + * Still it might be floating around somewhere + * so drop the reference when the rcu readers are done. + */ + spin_lock(&en->lock); + list_del(&en->node); + sb = en->sb; + en->sb = NULL; + spin_unlock(&en->lock); + + rcu_assign_pointer(sb->s_etrace.e_priv, NULL); + call_rcu(&en->rcu_head, fs_release_trace_entry); + /* It's safe now to drop the reference to the super */ + deactivate_super(sb); + atomic_inc(&stray_traces); +} + +static inline +struct fs_trace_entry *fs_trace_entry_get(struct fs_trace_entry *en) +{ + if (en) { + if (!kref_get_unless_zero(&en->count)) + return NULL; + /* Don't allow referencing inactive object */ + if (!atomic_read(&en->active)) { + fs_trace_entry_put(en); + return NULL; + } + } + return en; +} + +static struct fs_trace_entry *fs_trace_entry_get_rcu(struct super_block *sb) +{ + struct fs_trace_entry *en; + + if (!sb) + return NULL; + + rcu_read_lock(); + en = rcu_dereference(sb->s_etrace.e_priv); + en = fs_trace_entry_get(en); + rcu_read_unlock(); + + return en; +} + +static int fs_remove_trace_entry(struct super_block *sb) +{ + struct fs_trace_entry *en; + + en = fs_trace_entry_get_rcu(sb); + if (!en) + return -EINVAL; + + mutex_lock(&fs_trace_lock); + fs_drop_trace_entry(en); + mutex_unlock(&fs_trace_lock); + fs_trace_entry_put(en); + return 0; +} + +static void fs_remove_all_traces(void) +{ + struct fs_trace_entry *en, *guard; + + mutex_lock(&fs_trace_lock); + list_for_each_entry_safe(en, guard, &fs_trace_list, node) + fs_drop_trace_entry(en); + mutex_unlock(&fs_trace_lock); +} + +static int create_common_msg(struct sk_buff *skb, void *data) +{ + struct fs_trace_entry *en = (struct fs_trace_entry *)data; + struct super_block *sb = en->sb; + + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) + return -EINVAL; + + if (nla_put_u64(skb, FS_NL_A_CAUSED_ID, pid_vnr(task_pid(current)))) + return -EINVAL; + + return 0; +} + +static int create_thresh_msg(struct sk_buff *skb, void *data) +{ + struct fs_trace_entry *en = (struct fs_trace_entry *)data; + int ret; + + ret = create_common_msg(skb, data); + if (!ret) + ret = nla_put_u64(skb, FS_NL_A_DATA, en->th.avail_space); + return ret; +} + +static void fs_event_send(struct fs_trace_entry *en, unsigned int event_id) +{ + size_t size = nla_total_size(sizeof(u32)) * 2 + + nla_total_size(sizeof(u64)); + + fs_netlink_send_event(size, event_id, create_common_msg, en); +} + +static void fs_event_send_thresh(struct fs_trace_entry *en, + unsigned int event_id) +{ + size_t size = nla_total_size(sizeof(u32)) * 2 + + nla_total_size(sizeof(u64)) * 2; + + fs_netlink_send_event(size, event_id, create_thresh_msg, en); +} + +void fs_event_notify(struct super_block *sb, unsigned int event_id) +{ + struct fs_trace_entry *en; + + en = fs_trace_entry_get_rcu(sb); + if (!en) + return; + + spin_lock(&en->lock); + if (atomic_read(&en->active) && (en->notify & FS_EVENT_GENERIC)) + fs_event_send(en, event_id); + spin_unlock(&en->lock); + fs_trace_entry_put(en); +} +EXPORT_SYMBOL(fs_event_notify); + +void fs_event_alloc_space(struct super_block *sb, u64 ncount) +{ + struct fs_trace_entry *en; + s64 count; + + en = fs_trace_entry_get_rcu(sb); + if (!en) + return; + + spin_lock(&en->lock); + + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) + goto leave; + /* + * we shouldn't drop below 0 here, + * unless there is a sync issue somewhere (?) + */ + count = en->th.avail_space - ncount; + en->th.avail_space = count < 0 ? 0 : count; + + if (en->th.avail_space > en->th.lrange) + /* Not 'even' close - leave */ + goto leave; + + if (en->th.avail_space > en->th.urange) { + /* Close enough - the lower range has been reached */ + if (!(en->th.state & THRESH_LR_BEYOND)) { + /* Send notification */ + fs_event_send_thresh(en, FS_THR_LRBELOW); + en->th.state &= ~THRESH_LR_BELOW; + en->th.state |= THRESH_LR_BEYOND; + } + goto leave; + } + if (!(en->th.state & THRESH_UR_BEYOND)) { + fs_event_send_thresh(en, FS_THR_URBELOW); + en->th.state &= ~THRESH_UR_BELOW; + en->th.state |= THRESH_UR_BEYOND; + } + +leave: + spin_unlock(&en->lock); + fs_trace_entry_put(en); +} +EXPORT_SYMBOL(fs_event_alloc_space); + +void fs_event_free_space(struct super_block *sb, u64 ncount) +{ + struct fs_trace_entry *en; + + en = fs_trace_entry_get_rcu(sb); + if (!en) + return; + + spin_lock(&en->lock); + + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) + goto leave; + + en->th.avail_space += ncount; + + if (en->th.avail_space > en->th.lrange) { + if (!(en->th.state & THRESH_LR_BELOW) + && en->th.state & THRESH_LR_BEYOND) { + /* Send notification */ + fs_event_send_thresh(en, FS_THR_LRABOVE); + en->th.state &= ~(THRESH_LR_BEYOND|THRESH_UR_BEYOND); + en->th.state |= THRESH_LR_BELOW; + goto leave; + } + } + if (en->th.avail_space > en->th.urange) { + if (!(en->th.state & THRESH_UR_BELOW) + && en->th.state & THRESH_UR_BEYOND) { + /* Notify */ + fs_event_send_thresh(en, FS_THR_URABOVE); + en->th.state &= ~THRESH_UR_BEYOND; + en->th.state |= THRESH_UR_BELOW; + } + } +leave: + spin_unlock(&en->lock); + fs_trace_entry_put(en); +} +EXPORT_SYMBOL(fs_event_free_space); + +void fs_event_mount_dropped(struct vfsmount *mnt) +{ + /* + * The mount is dropped but the super might not get released + * at once so there is very small chance some notifications + * will come through. + * Note that the mount being dropped here might belong to a different + * namespace - if this is the case, just ignore it. + */ + struct fs_trace_entry *en = fs_trace_entry_get_rcu(mnt->mnt_sb); + struct vfsmount *en_mnt; + + if (!en || !atomic_read(&en->active)) + return; + /* + * The entry once set, does not change the mountpoint it's being + * pinned to, so no need to take the lock here. + */ + en_mnt = en->mnt_path.mnt; + if (!(real_mount(mnt)->mnt_ns != (real_mount(en_mnt))->mnt_ns)) + fs_remove_trace_entry(mnt->mnt_sb); + fs_trace_entry_put(en); +} + +static int fs_new_trace_entry(struct path *path, struct fs_event_thresh *thresh, + unsigned int nmask) +{ + struct fs_trace_entry *en; + struct super_block *sb; + struct mount *r_mnt; + + en = kmem_cache_zalloc(fs_trace_cachep, GFP_KERNEL); + if (unlikely(!en)) + return -ENOMEM; + /* + * Note that no reference is being taken here for the path as it would + * make the unmount unnecessarily puzzling (due to an extra 'valid' + * reference for the mnt). + * This is *rather* safe as the notification on mount being dropped + * will get called prior to releasing the super block - so right + * in time to perform appropriate clean-up + */ + r_mnt = real_mount(path->mnt); + + en->mnt_path.dentry = r_mnt->mnt.mnt_root; + en->mnt_path.mnt = &r_mnt->mnt; + + sb = path->mnt->mnt_sb; + en->sb = sb; + /* + * Increase the refcount for sb to mark it's being relied on. + * Note that the reference to path is taken by the caller, so it + * is safe to assume there is at least single active reference + * to super as well. + */ + atomic_inc(&sb->s_active); + + nmask &= sb->s_etrace.events_cap_mask; + if (!nmask) + goto leave; + + spin_lock_init(&en->lock); + INIT_LIST_HEAD(&en->node); + + en->notify = nmask; + memcpy(&en->th, thresh, offsetof(struct fs_event_thresh, state)); + if (nmask & FS_EVENT_THRESH) + fs_trace_query_data(sb, en); + + kref_init(&en->count); + + if (rcu_access_pointer(sb->s_etrace.e_priv) != NULL) { + struct fs_trace_entry *prev_en; + + prev_en = fs_trace_entry_get_rcu(sb); + if (prev_en) { + WARN_ON(prev_en); + fs_trace_entry_put(prev_en); + goto leave; + } + } + atomic_set(&en->active, 1); + + mutex_lock(&fs_trace_lock); + list_add(&en->node, &fs_trace_list); + mutex_unlock(&fs_trace_lock); + + rcu_assign_pointer(sb->s_etrace.e_priv, en); + synchronize_rcu(); + + return 0; +leave: + deactivate_super(sb); + kmem_cache_free(fs_trace_cachep, en); + return -EINVAL; +} + +static int fs_update_trace_entry(struct path *path, + struct fs_event_thresh *thresh, + unsigned int nmask) +{ + struct fs_trace_entry *en; + struct super_block *sb; + int extend = nmask & FS_TRACE_ADD; + int ret = -EINVAL; + + en = fs_trace_entry_get_rcu(path->mnt->mnt_sb); + if (!en) + return (extend) ? fs_new_trace_entry(path, thresh, nmask) + : -EINVAL; + + if (!atomic_read(&en->active)) + return -EINVAL; + + nmask &= ~FS_TRACE_ADD; + + spin_lock(&en->lock); + sb = en->sb; + if (!sb || !(nmask & sb->s_etrace.events_cap_mask)) + goto leave; + + if (nmask & FS_EVENT_THRESH) { + if (extend) { + /* Get the current state */ + if (!(en->notify & FS_EVENT_THRESH)) + if (fs_trace_query_data(sb, en)) + goto leave; + + if (thresh->state & THRESH_LR_ON) { + en->th.lrange = thresh->lrange; + en->th.state &= ~THRESH_LR_ON; + } + + if (thresh->state & THRESH_UR_ON) { + en->th.urange = thresh->urange; + en->th.state &= ~THRESH_UR_ON; + } + } else { + memset(&en->th, 0, sizeof(en->th)); + } + } + + if (extend) + en->notify |= nmask; + else + en->notify &= ~nmask; + ret = 0; +leave: + spin_unlock(&en->lock); + fs_trace_entry_put(en); + return ret; +} + +static int fs_parse_trace_request(int argc, char **argv) +{ + struct fs_event_thresh thresh = {0}; + struct path path; + substring_t args[MAX_OPT_ARGS]; + unsigned int nmask = FS_TRACE_ADD; + int token; + char *s; + int ret = -EINVAL; + + if (!argc) { + fs_remove_all_traces(); + return 0; + } + + s = *(argv); + if (*s == '!') { + /* Clear the trace entry */ + nmask &= ~FS_TRACE_ADD; + ++s; + } + + if (kern_path_mountpoint(AT_FDCWD, s, &path, LOOKUP_FOLLOW)) + return -EINVAL; + + if (!(--argc)) { + if (!(nmask & FS_TRACE_ADD)) + ret = fs_remove_trace_entry(path.mnt->mnt_sb); + goto leave; + } + +repeat: + args[0].to = args[0].from = NULL; + token = match_token(*(++argv), fs_etypes, args); + if (!token && !nmask) + goto leave; + + nmask |= token & FS_EVENTS_ALL; + --argc; + if ((token & FS_EVENT_THRESH) && (nmask & FS_TRACE_ADD)) { + /* + * Get the threshold config data: + * lower range + * upper range + */ + if (!argc) + goto leave; + + ret = kstrtoull(*(++argv), 10, &thresh.lrange); + if (ret) + goto leave; + thresh.state |= THRESH_LR_ON; + if ((--argc)) { + ret = kstrtoull(*(++argv), 10, &thresh.urange); + if (ret) + goto leave; + thresh.state |= THRESH_UR_ON; + --argc; + } + /* The thresholds are based on number of available blocks */ + if (thresh.lrange < thresh.urange) { + ret = -EINVAL; + goto leave; + } + } + if (argc) + goto repeat; + + ret = fs_update_trace_entry(&path, &thresh, nmask); +leave: + path_put(&path); + return ret; +} + +#define DEFAULT_BUF_SIZE PAGE_SIZE + +static ssize_t fs_trace_write(struct file *file, const char __user *buffer, + size_t count, loff_t *ppos) +{ + char **argv; + char *kern_buf, *next, *cfg; + size_t size, dcount = 0; + int argc; + + if (!count) + return 0; + + kern_buf = kmalloc(DEFAULT_BUF_SIZE, GFP_KERNEL); + if (!kern_buf) + return -ENOMEM; + + while (dcount < count) { + + size = count - dcount; + if (size >= DEFAULT_BUF_SIZE) + size = DEFAULT_BUF_SIZE - 1; + if (copy_from_user(kern_buf, buffer + dcount, size)) { + dcount = -EINVAL; + goto leave; + } + + kern_buf[size] = '\0'; + + next = cfg = kern_buf; + + do { + next = strchr(cfg, ';'); + if (next) + *next = '\0'; + + argv = argv_split(GFP_KERNEL, cfg, &argc); + if (!argv) { + dcount = -ENOMEM; + goto leave; + } + + if (fs_parse_trace_request(argc, argv)) { + dcount = -EINVAL; + argv_free(argv); + goto leave; + } + + argv_free(argv); + if (next) + cfg = ++next; + + } while (next); + dcount += size; + } +leave: + kfree(kern_buf); + return dcount; +} + +static void *fs_trace_seq_start(struct seq_file *m, loff_t *pos) +{ + mutex_lock(&fs_trace_lock); + return seq_list_start(&fs_trace_list, *pos); +} + +static void *fs_trace_seq_next(struct seq_file *m, void *v, loff_t *pos) +{ + return seq_list_next(v, &fs_trace_list, pos); +} + +static void fs_trace_seq_stop(struct seq_file *m, void *v) +{ + mutex_unlock(&fs_trace_lock); +} + +static int fs_trace_seq_show(struct seq_file *m, void *v) +{ + struct fs_trace_entry *en; + struct super_block *sb; + struct mount *r_mnt; + const struct match_token *match; + unsigned int nmask; + + en = list_entry(v, struct fs_trace_entry, node); + /* Do not show the entries outside current mount namespace */ + r_mnt = real_mount(en->mnt_path.mnt); + if (r_mnt->mnt_ns != current->nsproxy->mnt_ns) { + if (!__is_local_mountpoint(r_mnt->mnt_mountpoint)) + return 0; + } + + sb = en->sb; + + seq_path(m, &en->mnt_path, "\t\n\\"); + seq_putc(m, ' '); + + seq_escape(m, sb->s_type->name, " \t\n\\"); + if (sb->s_subtype && sb->s_subtype[0]) { + seq_putc(m, '.'); + seq_escape(m, sb->s_subtype, " \t\n\\"); + } + + seq_putc(m, ' '); + if (sb->s_op->show_devname) { + sb->s_op->show_devname(m, en->mnt_path.mnt->mnt_root); + } else { + seq_escape(m, r_mnt->mnt_devname ? r_mnt->mnt_devname : "none", + " \t\n\\"); + } + seq_puts(m, " ("); + + nmask = en->notify; + for (match = fs_etypes; match->pattern; ++match) { + if (match->token & nmask) { + seq_puts(m, match->pattern); + nmask &= ~match->token; + if (nmask) + seq_putc(m, ','); + } + } + seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); + seq_puts(m, ")\n"); + return 0; +} + +static const struct seq_operations fs_trace_seq_ops = { + .start = fs_trace_seq_start, + .next = fs_trace_seq_next, + .stop = fs_trace_seq_stop, + .show = fs_trace_seq_show, +}; + +static int fs_trace_open(struct inode *inode, struct file *file) +{ + return seq_open(file, &fs_trace_seq_ops); +} + +static const struct file_operations fs_trace_fops = { + .owner = THIS_MODULE, + .open = fs_trace_open, + .write = fs_trace_write, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, +}; + +static int fs_trace_init(void) +{ + fs_trace_cachep = KMEM_CACHE(fs_trace_entry, 0); + if (!fs_trace_cachep) + return -EINVAL; + init_waitqueue_head(&trace_wq); + return 0; +} + +/* VFS support */ +static int fs_trace_fill_super(struct super_block *sb, void *data, int silen) +{ + int ret; + static struct tree_descr desc[] = { + [2] = { + .name = "config", + .ops = &fs_trace_fops, + .mode = S_IWUSR | S_IRUGO, + }, + {""}, + }; + + ret = simple_fill_super(sb, 0x7246332, desc); + return !ret ? fs_trace_init() : ret; +} + +static struct dentry *fs_trace_do_mount(struct file_system_type *fs_type, + int ntype, const char *dev_name, void *data) +{ + return mount_single(fs_type, ntype, data, fs_trace_fill_super); +} + +static void fs_trace_kill_super(struct super_block *sb) +{ + /* + * The rcu_barrier here will/should make sure all call_rcu + * callbacks are completed - still there might be some active + * trace objects in use which can make calling the + * kmem_cache_destroy unsafe. So we wait until all traces + * are finally released. + */ + fs_remove_all_traces(); + rcu_barrier(); + wait_event(trace_wq, !atomic_read(&stray_traces)); + + kmem_cache_destroy(fs_trace_cachep); + kill_litter_super(sb); +} + +static struct kset *fs_trace_kset; + +static struct file_system_type fs_trace_fstype = { + .name = "fstrace", + .mount = fs_trace_do_mount, + .kill_sb = fs_trace_kill_super, +}; + +static void __init fs_trace_vfs_init(void) +{ + fs_trace_kset = kset_create_and_add("events", NULL, fs_kobj); + + if (!fs_trace_kset) + return; + + if (!register_filesystem(&fs_trace_fstype)) { + if (!fs_event_netlink_register()) + return; + unregister_filesystem(&fs_trace_fstype); + } + kset_unregister(fs_trace_kset); +} + +static int __init fs_trace_evens_init(void) +{ + fs_trace_vfs_init(); + return 0; +}; +module_init(fs_trace_evens_init); + diff --git a/fs/events/fs_event.h b/fs/events/fs_event.h new file mode 100644 index 0000000..23f24c8 --- /dev/null +++ b/fs/events/fs_event.h @@ -0,0 +1,22 @@ +/* + * Copyright(c) 2015 Samsung Electronics. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#ifndef __GENERIC_FS_EVENTS_H +#define __GENERIC_FS_EVENTS_H + +int fs_event_netlink_register(void); +void fs_event_netlink_unregister(void); + +#endif /* __GENERIC_FS_EVENTS_H */ diff --git a/fs/events/fs_event_netlink.c b/fs/events/fs_event_netlink.c new file mode 100644 index 0000000..0c97eb7 --- /dev/null +++ b/fs/events/fs_event_netlink.c @@ -0,0 +1,104 @@ +/* + * Copyright(c) 2015 Samsung Electronics. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ +#include +#include +#include +#include +#include +#include +#include +#include "fs_event.h" + +static const struct genl_multicast_group fs_event_mcgroups[] = { + { .name = FS_EVENTS_MCAST_GRP_NAME, }, +}; + +static struct genl_family fs_event_family = { + .id = GENL_ID_GENERATE, + .name = FS_EVENTS_FAMILY_NAME, + .version = 1, + .maxattr = FS_NL_A_MAX, + .mcgrps = fs_event_mcgroups, + .n_mcgrps = ARRAY_SIZE(fs_event_mcgroups), +}; + +int fs_netlink_send_event(size_t size, unsigned int event_id, + int (*compose_msg)(struct sk_buff *skb, void *data), + void *cbdata) +{ + static atomic_t seq; + struct sk_buff *skb; + void *msg_head; + int ret = 0; + + if (!size || !compose_msg) + return -EINVAL; + + /* Skip if there are no listeners */ + if (!genl_has_listeners(&fs_event_family, &init_net, 0)) + return 0; + + if (event_id != FS_EVENT_NONE) + size += nla_total_size(sizeof(u32)); + size += nla_total_size(sizeof(u64)); + skb = genlmsg_new(size, GFP_NOWAIT); + + if (!skb) { + pr_debug("Failed to allocate new FS generic netlink message\n"); + return -ENOMEM; + } + + msg_head = genlmsg_put(skb, 0, atomic_add_return(1, &seq), + &fs_event_family, 0, FS_NL_C_EVENT); + if (!msg_head) + goto cleanup; + + if (event_id != FS_EVENT_NONE) + if (nla_put_u32(skb, FS_NL_A_EVENT_ID, event_id)) + goto cancel; + + ret = compose_msg(skb, cbdata); + if (ret) + goto cancel; + + genlmsg_end(skb, msg_head); + ret = genlmsg_multicast(&fs_event_family, skb, 0, 0, GFP_NOWAIT); + if (ret && ret != -ENOBUFS && ret != -ESRCH) + goto cleanup; + + return ret; + +cancel: + genlmsg_cancel(skb, msg_head); +cleanup: + nlmsg_free(skb); + return ret; +} +EXPORT_SYMBOL(fs_netlink_send_event); + +int fs_event_netlink_register(void) +{ + int ret; + + ret = genl_register_family(&fs_event_family); + if (ret) + pr_err("Failed to register FS netlink interface\n"); + return ret; +} + +void fs_event_netlink_unregister(void) +{ + genl_unregister_family(&fs_event_family); +} diff --git a/fs/namespace.c b/fs/namespace.c index 82ef140..ec6e2ef 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1031,6 +1031,7 @@ static void cleanup_mnt(struct mount *mnt) if (unlikely(mnt->mnt_pins.first)) mnt_pin_kill(mnt); fsnotify_vfsmount_delete(&mnt->mnt); + fs_event_mount_dropped(&mnt->mnt); dput(mnt->mnt.mnt_root); deactivate_super(mnt->mnt.mnt_sb); mnt_free_id(mnt); diff --git a/include/linux/fs.h b/include/linux/fs.h index b4d71b5..b7dadd9 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -263,6 +263,10 @@ struct iattr { * Includes for diskquotas. */ #include +/* + * Include for Generic File System Events Interface + */ +#include /* * Maximum number of layers of fs stack. Needs to be limited to @@ -1253,7 +1257,7 @@ struct super_block { struct hlist_node s_instances; unsigned int s_quota_types; /* Bitmask of supported quota types */ struct quota_info s_dquot; /* Diskquota specific options */ - + struct fs_trace_info s_etrace; struct sb_writers s_writers; char s_id[32]; /* Informational name */ diff --git a/include/linux/fs_event.h b/include/linux/fs_event.h new file mode 100644 index 0000000..83e22dd --- /dev/null +++ b/include/linux/fs_event.h @@ -0,0 +1,72 @@ +/* + * Generic File System Events Interface + * + * Copyright(c) 2015 Samsung Electronics. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ +#ifndef _LINUX_GENERIC_FS_EVETS_ +#define _LINUX_GENERIC_FS_EVETS_ +#include +#include + +/* + * Currently supported event types + */ +#define FS_EVENT_GENERIC 0x001 +#define FS_EVENT_THRESH 0x002 + +#define FS_EVENTS_ALL (FS_EVENT_GENERIC | FS_EVENT_THRESH) + +struct fs_trace_operations { + void (*query)(struct super_block *, u64 *); +}; + +struct fs_trace_info { + void __rcu *e_priv; /* READ ONLY */ + unsigned int events_cap_mask; /* Supported notifications */ + const struct fs_trace_operations *ops; +}; + +#ifdef CONFIG_FS_EVENTS + +void fs_event_notify(struct super_block *sb, unsigned int event_id); +void fs_event_alloc_space(struct super_block *sb, u64 ncount); +void fs_event_free_space(struct super_block *sb, u64 ncount); +void fs_event_mount_dropped(struct vfsmount *mnt); + +int fs_netlink_send_event(size_t size, unsigned int event_id, + int (*compose_msg)(struct sk_buff *skb, void *data), + void *cbdata); + +#else /* CONFIG_FS_EVENTS */ + +static inline +void fs_event_notify(struct super_block *sb, unsigned int event_id) {}; +static inline +void fs_event_alloc_space(struct super_block *sb, u64 ncount) {}; +static inline +void fs_event_free_space(struct super_block *sb, u64 ncount) {}; +static inline +void fs_event_mount_dropped(struct vfsmount *mnt) {}; + +static inline +int fs_netlink_send_event(size_t size, unsigned int event_id, + int (*compose_msig)(struct sk_buff *skb, void *data), + void *cbdata) +{ + return -ENOSYS; +} +#endif /* CONFIG_FS_EVENTS */ + +#endif /* _LINUX_GENERIC_FS_EVENTS_ */ + diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index 68ceb97..dae0fab 100644 --- a/include/uapi/linux/Kbuild +++ b/include/uapi/linux/Kbuild @@ -129,6 +129,7 @@ header-y += firewire-constants.h header-y += flat.h header-y += fou.h header-y += fs.h +header-y += fs_event.h header-y += fsl_hypervisor.h header-y += fuse.h header-y += futex.h diff --git a/include/uapi/linux/fs_event.h b/include/uapi/linux/fs_event.h new file mode 100644 index 0000000..d8b07da --- /dev/null +++ b/include/uapi/linux/fs_event.h @@ -0,0 +1,58 @@ +/* + * Generic netlink support for Generic File System Events Interface + * + * Copyright(c) 2015 Samsung Electronics. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2. + * + * The full GNU General Public License is included in this distribution in the + * file called COPYING. + * + * This program is distributed in the hope that it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ +#ifndef _UAPI_LINUX_GENERIC_FS_EVENTS_ +#define _UAPI_LINUX_GENERIC_FS_EVENTS_ + +#define FS_EVENTS_FAMILY_NAME "fs_event" +#define FS_EVENTS_MCAST_GRP_NAME "fs_event_mc_grp" + +/* + * Generic netlink attribute types + */ +enum { + FS_NL_A_NONE, + FS_NL_A_EVENT_ID, + FS_NL_A_DEV_MAJOR, + FS_NL_A_DEV_MINOR, + FS_NL_A_CAUSED_ID, + FS_NL_A_DATA, + __FS_NL_A_MAX, +}; +#define FS_NL_A_MAX (__FS_NL_A_MAX - 1) +/* + * Generic netlink commands + */ +#define FS_NL_C_EVENT 1 + +/* + * Supported set of FS events + */ +enum { + FS_EVENT_NONE, + FS_WARN_ENOSPC, /* No space left to reserve data blks */ + FS_WARN_ENOSPC_META, /* No space left for metadata */ + FS_THR_LRBELOW, /* The threshold lower range has been reached */ + FS_THR_LRABOVE, /* The threshold lower range re-activcated*/ + FS_THR_URBELOW, + FS_THR_URABOVE, + FS_ERR_REMOUNT_RO, /* The file system has been remounted as RO */ + FS_ERR_CORRUPTED /* Critical error - fs corrupted */ + +}; + +#endif /* _UAPI_LINUX_GENERIC_FS_EVENTS_ */ + -- 1.7.9.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756466AbbFPNKM (ORCPT ); Tue, 16 Jun 2015 09:10:12 -0400 Received: from mailout4.w1.samsung.com ([210.118.77.14]:13588 "EHLO mailout4.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753902AbbFPNJp (ORCPT ); Tue, 16 Jun 2015 09:09:45 -0400 X-AuditID: cbfec7f4-f79c56d0000012ee-3e-55802016216c From: Beata Michalska To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Cc: greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: [RFC v3 2/4] ext4: Add helper function to mark group as corrupted Date: Tue, 16 Jun 2015 15:09:31 +0200 Message-id: <1434460173-18427-3-git-send-email-b.michalska@samsung.com> X-Mailer: git-send-email 1.7.9.5 In-reply-to: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrFLMWRmVeSWpSXmKPExsVy+t/xq7piCg2hBpf+S1t8/dLBYnFuwQxG i9MTFjFZPP3Ux2Ixe3ozk8Wty6tYLM42vWG3WPZgM4vF5u8dbBYz591hs9iz9ySLxeVdc9gs 7q35z2rR2vOT3YHPo2VzuceCTaUem1doebx9GOCx6dMkdo+mM0eZPd7vu8rm0bdlFaPHmQVH 2D0+b5IL4IrisklJzcksSy3St0vgyti39CFLwT/liq3NF5gaGJvkuhg5OSQETCR+nGxihrDF JC7cW8/WxcjFISSwlFHi4L7nrBBOI5PEu85edpAqNgF9iVczVjKB2CICMRIHd/WwgBQxC7xi lPjzcDlYQljAS2LpsllACQ4OFgFViZdnzUDCvALuEu39d1hBwhICChJzJtmAhDkFPCS6H09j BwkLAZUc3OA5gZF3ASPDKkbR1NLkguKk9FxDveLE3OLSvHS95PzcTYyQEP6yg3HxMatDjAIc jEo8vBGfakOFWBPLiitzDzFKcDArifDOE2kIFeJNSaysSi3Kjy8qzUktPsQozcGiJM47d9f7 ECGB9MSS1OzU1ILUIpgsEwenVANjd9SEvBMKh84n5ss4/36qw7rOfbL7k1W9rlztr4qFOT13 dQuevvKhpiLn9bm2Bv6Nftl505Sullf+LHe803ht3mbJ67MK18U/bK26KfVyosRz67R7DM5i 5efvaveqbFHie1tZYvWp6ui3N928+lPNdiVdXCO1e/o0y82+Qlxc2tnhhm7OBn+VWIozEg21 mIuKEwGhN0VyXQIAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add ext4_mark_group_corrupted helper function to simplify the code and to keep the logic in one place. Signed-off-by: Beata Michalska --- fs/ext4/balloc.c | 15 +++------------ fs/ext4/ext4.h | 9 +++++++++ fs/ext4/ialloc.c | 5 +---- fs/ext4/mballoc.c | 11 ++--------- 4 files changed, 15 insertions(+), 25 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index 83a6f49..e95b27a 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -193,10 +193,7 @@ static int ext4_init_block_bitmap(struct super_block *sb, * essentially implementing a per-group read-only flag. */ if (!ext4_group_desc_csum_verify(sb, block_group, gdp)) { grp = ext4_get_group_info(sb, block_group); - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) - percpu_counter_sub(&sbi->s_freeclusters_counter, - grp->bb_free); - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); + ext4_mark_group_corrupted(sbi, grp); if (!EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) { int count; count = ext4_free_inodes_count(sb, gdp); @@ -379,20 +376,14 @@ static void ext4_validate_block_bitmap(struct super_block *sb, ext4_unlock_group(sb, block_group); ext4_error(sb, "bg %u: block %llu: invalid block bitmap", block_group, blk); - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) - percpu_counter_sub(&sbi->s_freeclusters_counter, - grp->bb_free); - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); + ext4_mark_group_corrupted(sbi, grp); return; } if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group, desc, bh))) { ext4_unlock_group(sb, block_group); ext4_error(sb, "bg %u: bad block bitmap checksum", block_group); - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) - percpu_counter_sub(&sbi->s_freeclusters_counter, - grp->bb_free); - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); + ext4_mark_group_corrupted(sbi, grp); return; } set_buffer_verified(bh); diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index f63c3d5..163afe2 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2535,6 +2535,15 @@ static inline spinlock_t *ext4_group_lock_ptr(struct super_block *sb, return bgl_lock_ptr(EXT4_SB(sb)->s_blockgroup_lock, group); } +static inline +void ext4_mark_group_corrupted(struct ext4_sb_info *sbi, + struct ext4_group_info *grp) +{ + if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) + percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free); + set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); +} + /* * Returns true if the filesystem is busy enough that attempts to * access the block group locks has run into contention. diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index ac644c3..ebe0499 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -79,10 +79,7 @@ static unsigned ext4_init_inode_bitmap(struct super_block *sb, if (!ext4_group_desc_csum_verify(sb, block_group, gdp)) { ext4_error(sb, "Checksum bad for group %u", block_group); grp = ext4_get_group_info(sb, block_group); - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) - percpu_counter_sub(&sbi->s_freeclusters_counter, - grp->bb_free); - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); + ext4_mark_group_corrupted(sbi, grp); if (!EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) { int count; count = ext4_free_inodes_count(sb, gdp); diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 8d1e602..24a4b6d 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -760,10 +760,7 @@ void ext4_mb_generate_buddy(struct super_block *sb, * corrupt and update bb_free using bitmap value */ grp->bb_free = free; - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) - percpu_counter_sub(&sbi->s_freeclusters_counter, - grp->bb_free); - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); + ext4_mark_group_corrupted(sbi, grp); } mb_set_largest_free_order(sb, grp); @@ -1448,12 +1445,8 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b, "freeing already freed block " "(bit %u); block bitmap corrupt.", block); - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)) - percpu_counter_sub(&sbi->s_freeclusters_counter, - e4b->bd_info->bb_free); /* Mark the block group as corrupt. */ - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, - &e4b->bd_info->bb_state); + ext4_mark_group_corrupted(sbi, e4b->bd_info); mb_regenerate_buddy(e4b); goto done; } -- 1.7.9.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756374AbbFPNKF (ORCPT ); Tue, 16 Jun 2015 09:10:05 -0400 Received: from mailout3.w1.samsung.com ([210.118.77.13]:15142 "EHLO mailout3.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754703AbbFPNJq (ORCPT ); Tue, 16 Jun 2015 09:09:46 -0400 X-AuditID: cbfec7f4-f79c56d0000012ee-44-55802017cdd8 From: Beata Michalska To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Cc: greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: [RFC v3 4/4] shmem: Add support for generic FS events Date: Tue, 16 Jun 2015 15:09:33 +0200 Message-id: <1434460173-18427-5-git-send-email-b.michalska@samsung.com> X-Mailer: git-send-email 1.7.9.5 In-reply-to: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrNLMWRmVeSWpSXmKPExsVy+t/xq7riCg2hBof36Fh8/dLBYnFuwQxG i9MTFjFZPP3Ux2Ixe3ozk8Wty6tYLM42vWG3WPZgM4vF5u8dbBYz591hs9iz9ySLxeVdc9gs 7q35z2rR2vOT3YHPo2VzuceCTaUem1doebx9GOCx6dMkdo+mM0eZPd7vu8rm0bdlFaPHmQVH 2D0+b5IL4IrisklJzcksSy3St0vgyvj6dxpbwSbJikWfbzA2MB4Q6WLk5JAQMJFo33mVGcIW k7hwbz1bFyMXh5DAUkaJvxOmsEM4jUwSpye0MYJUsQnoS7yasZIJxBYRiJE4uKuHBaSIWeAV o8Sfh8vBEsICdhJPn25iB7FZBFQlOv4fAGvmFXCXWH9pKdAKDqB1ChJzJtmAhDkFPCS6H09j BwkLAZUc3OA5gZF3ASPDKkbR1NLkguKk9FxDveLE3OLSvHS95PzcTYyQIP6yg3HxMatDjAIc jEo8vBGfakOFWBPLiitzDzFKcDArifDOE2kIFeJNSaysSi3Kjy8qzUktPsQozcGiJM47d9f7 ECGB9MSS1OzU1ILUIpgsEwenVAPj6iWFYSfPeWjkGxvPMb0/4eF3fxF3P6Hdby/dU9Tf7b7k qON5y7KOov33DV4FnXIP/7Hw5ZXtcbrpPx+Hbr5ukGj/c+8e8Z9Cm3Z+9+7gaHgueULj9Mln YeVrqmdJJc7fGnzsUPnSQuvF1a7B31kfRGQHBE2qKkzlLDOJNlYunskWXOV/2eGEEktxRqKh FnNRcSIA42T1BF4CAAA= Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add support for the generic FS events interface covering threshold notifiactions and the ENOSPC warning. Signed-off-by: Beata Michalska --- mm/shmem.c | 33 ++++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index cf2d0ca..a044d12 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -201,6 +201,7 @@ static int shmem_reserve_inode(struct super_block *sb) spin_lock(&sbinfo->stat_lock); if (!sbinfo->free_inodes) { spin_unlock(&sbinfo->stat_lock); + fs_event_notify(sb, FS_WARN_ENOSPC); return -ENOSPC; } sbinfo->free_inodes--; @@ -239,8 +240,10 @@ static void shmem_recalc_inode(struct inode *inode) freed = info->alloced - info->swapped - inode->i_mapping->nrpages; if (freed > 0) { struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb); - if (sbinfo->max_blocks) + if (sbinfo->max_blocks) { percpu_counter_add(&sbinfo->used_blocks, -freed); + fs_event_free_space(inode->i_sb, freed); + } info->alloced -= freed; inode->i_blocks -= freed * BLOCKS_PER_PAGE; shmem_unacct_blocks(info->flags, freed); @@ -1164,6 +1167,7 @@ repeat: goto unacct; } percpu_counter_inc(&sbinfo->used_blocks); + fs_event_alloc_space(inode->i_sb, 1); } page = shmem_alloc_page(gfp, info, index); @@ -1245,8 +1249,10 @@ trunc: spin_unlock(&info->lock); decused: sbinfo = SHMEM_SB(inode->i_sb); - if (sbinfo->max_blocks) + if (sbinfo->max_blocks) { percpu_counter_add(&sbinfo->used_blocks, -1); + fs_event_free_space(inode->i_sb, 1); + } unacct: shmem_unacct_blocks(info->flags, 1); failed: @@ -1258,12 +1264,16 @@ unlock: unlock_page(page); page_cache_release(page); } - if (error == -ENOSPC && !once++) { + if (error == -ENOSPC) { + if (!once++) { info = SHMEM_I(inode); spin_lock(&info->lock); shmem_recalc_inode(inode); spin_unlock(&info->lock); goto repeat; + } else { + fs_event_notify(inode->i_sb, FS_WARN_ENOSPC); + } } if (error == -EEXIST) /* from above or from radix_tree_insert */ goto repeat; @@ -2729,12 +2739,26 @@ static int shmem_encode_fh(struct inode *inode, __u32 *fh, int *len, return 1; } +static void shmem_trace_query(struct super_block *sb, u64 *ncount) +{ + struct shmem_sb_info *sbinfo = SHMEM_SB(sb); + + if (sbinfo->max_blocks) + *ncount = sbinfo->max_blocks - + percpu_counter_sum(&sbinfo->used_blocks); + +} + static const struct export_operations shmem_export_ops = { .get_parent = shmem_get_parent, .encode_fh = shmem_encode_fh, .fh_to_dentry = shmem_fh_to_dentry, }; +static const struct fs_trace_operations shmem_trace_ops = { + .query = shmem_trace_query, +}; + static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo, bool remount) { @@ -3020,6 +3044,9 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent) sb->s_flags |= MS_NOUSER; } sb->s_export_op = &shmem_export_ops; + sb->s_etrace.ops = &shmem_trace_ops; + sb->s_etrace.events_cap_mask = FS_EVENTS_ALL; + sb->s_flags |= MS_NOSEC; #else sb->s_flags |= MS_NOUSER; -- 1.7.9.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756505AbbFPNKR (ORCPT ); Tue, 16 Jun 2015 09:10:17 -0400 Received: from mailout2.w1.samsung.com ([210.118.77.12]:13250 "EHLO mailout2.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754684AbbFPNJq (ORCPT ); Tue, 16 Jun 2015 09:09:46 -0400 X-AuditID: cbfec7f5-f794b6d000001495-d4-55802017597c From: Beata Michalska To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org Cc: greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: [RFC v3 3/4] ext4: Add support for generic FS events Date: Tue, 16 Jun 2015 15:09:32 +0200 Message-id: <1434460173-18427-4-git-send-email-b.michalska@samsung.com> X-Mailer: git-send-email 1.7.9.5 In-reply-to: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrBLMWRmVeSWpSXmKPExsVy+t/xq7riCg2hBjunslh8/dLBYnFuwQxG i9MTFjFZPP3Ux2Ixe3ozk8Wty6tYLM42vWG3WPZgM4vF5u8dbBYz591hs9iz9ySLxeVdc9gs 7q35z2rR2vOT3YHPo2VzuceCTaUem1doebx9GOCx6dMkdo+mM0eZPd7vu8rm0bdlFaPHmQVH 2D0+b5IL4IrisklJzcksSy3St0vgytgz4wp7wTTziiOTPjA1MM7W62Lk5JAQMJH40feCBcIW k7hwbz0biC0ksJRR4vdJhS5GLiC7kUni9IQ2RpAEm4C+xKsZK5lAbBGBGImDu3pYQIqYBV4x Svx5uBwsISxgK3Gq9SZzFyMHB4uAqkR3TyyIySvgLjH/lA6IKSGgIDFnkg1IMaeAh0T342ns IGEhoIqDGzwnMPIuYGRYxSiaWppcUJyUnmukV5yYW1yal66XnJ+7iRESvl93MC49ZnWIUYCD UYmHN+JTbagQa2JZcWXuIUYJDmYlEd55Ig2hQrwpiZVVqUX58UWlOanFhxilOViUxHln7nof IiSQnliSmp2aWpBaBJNl4uCUamD04kqfOtvXcLfK/sBuTbG2jrcr/Bu/7WSa7r715aVD857n b742b4KS9rHpbzZ6nqyR7bw5gaX0zc3e+1Pud+vozlJynq6rJmA1o883aI75ijSnWm0Oq4L2 Ww3e86pjaw/aN3xs9zg9MfE6wzGR3lUNkeEf0+z5Mrw339HmkC+YMzX/vIy0cKoSS3FGoqEW c1FxIgCR0GSSWwIAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Add support for generic FS events including threshold notifications, ENOSPC and remount as read-only warnings, along with generic internal warnings/errors. Signed-off-by: Beata Michalska --- fs/ext4/balloc.c | 10 ++++++++-- fs/ext4/ext4.h | 1 + fs/ext4/inode.c | 2 +- fs/ext4/mballoc.c | 6 +++++- fs/ext4/resize.c | 1 + fs/ext4/super.c | 39 +++++++++++++++++++++++++++++++++++++++ 6 files changed, 55 insertions(+), 4 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index e95b27a..a48450f 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -569,6 +569,7 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi, { if (ext4_has_free_clusters(sbi, nclusters, flags)) { percpu_counter_add(&sbi->s_dirtyclusters_counter, nclusters); + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, nclusters)); return 0; } else return -ENOSPC; @@ -590,9 +591,10 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries) { if (!ext4_has_free_clusters(EXT4_SB(sb), 1, 0) || (*retries)++ > 3 || - !EXT4_SB(sb)->s_journal) + !EXT4_SB(sb)->s_journal) { + fs_event_notify(sb, FS_WARN_ENOSPC); return 0; - + } jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id); return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal); @@ -637,6 +639,10 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode, dquot_alloc_block_nofail(inode, EXT4_C2B(EXT4_SB(inode->i_sb), ar.len)); } + + if (*errp == -ENOSPC) + fs_event_notify(inode->i_sb, FS_WARN_ENOSPC_META); + return ret; } diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 163afe2..7d75ff9 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2542,6 +2542,7 @@ void ext4_mark_group_corrupted(struct ext4_sb_info *sbi, if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free); set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, grp->bb_free)); } /* diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5cb9a21..2a7af0f 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1238,7 +1238,7 @@ static void ext4_da_release_space(struct inode *inode, int to_free) percpu_counter_sub(&sbi->s_dirtyclusters_counter, to_free); spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); - + fs_event_free_space(sbi->s_sb, to_free); dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free)); } diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 24a4b6d..c2df6f0 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -4511,6 +4511,9 @@ out: kmem_cache_free(ext4_ac_cachep, ac); if (inquota && ar->len < inquota) dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len)); + if (reserv_clstrs && ar->len < reserv_clstrs) + fs_event_free_space(sbi->s_sb, + EXT4_C2B(sbi, reserv_clstrs - ar->len)); if (!ar->len) { if ((ar->flags & EXT4_MB_DELALLOC_RESERVED) == 0) /* release all the reserved blocks if non delalloc */ @@ -4848,7 +4851,7 @@ do_more: if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) dquot_free_block(inode, EXT4_C2B(sbi, count_clusters)); percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters); - + fs_event_free_space(sb, EXT4_C2B(sbi, count_clusters)); ext4_mb_unload_buddy(&e4b); /* We dirtied the bitmap block */ @@ -4982,6 +4985,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb, ext4_unlock_group(sb, block_group); percpu_counter_add(&sbi->s_freeclusters_counter, EXT4_NUM_B2C(sbi, blocks_freed)); + fs_event_free_space(sb, blocks_freed); if (sbi->s_log_groups_per_flex) { ext4_group_t flex_group = ext4_flex_group(sbi, block_group); diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c index 8a8ec62..dbf08d6 100644 --- a/fs/ext4/resize.c +++ b/fs/ext4/resize.c @@ -1378,6 +1378,7 @@ static void ext4_update_super(struct super_block *sb, EXT4_NUM_B2C(sbi, free_blocks)); percpu_counter_add(&sbi->s_freeinodes_counter, EXT4_INODES_PER_GROUP(sb) * flex_gd->count); + fs_event_free_space(sb, free_blocks - reserved_blocks); ext4_debug("free blocks count %llu", percpu_counter_read(&sbi->s_freeclusters_counter)); diff --git a/fs/ext4/super.c b/fs/ext4/super.c index e061e66..108b667 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -585,6 +585,8 @@ void __ext4_abort(struct super_block *sb, const char *function, if (EXT4_SB(sb)->s_journal) jbd2_journal_abort(EXT4_SB(sb)->s_journal, -EIO); save_error_info(sb, function, line); + fs_event_notify(sb, FS_ERR_REMOUNT_RO); + } if (test_opt(sb, ERRORS_PANIC)) panic("EXT4-fs panic from previous error\n"); @@ -1083,6 +1085,12 @@ static const struct quotactl_ops ext4_qctl_operations = { }; #endif +static void ext4_trace_query(struct super_block *sb, u64 *ncount); + +static const struct fs_trace_operations ext4_trace_ops = { + .query = ext4_trace_query, +}; + static const struct super_operations ext4_sops = { .alloc_inode = ext4_alloc_inode, .destroy_inode = ext4_destroy_inode, @@ -3398,11 +3406,20 @@ static int ext4_reserve_clusters(struct ext4_sb_info *sbi, ext4_fsblk_t count) { ext4_fsblk_t clusters = ext4_blocks_count(sbi->s_es) >> sbi->s_cluster_bits; + ext4_fsblk_t current_resv; if (count >= clusters) return -EINVAL; + current_resv = atomic64_read(&sbi->s_resv_clusters); atomic64_set(&sbi->s_resv_clusters, count); + + if (count > current_resv) + fs_event_alloc_space(sbi->s_sb, + EXT4_C2B(sbi, count - current_resv)); + else + fs_event_free_space(sbi->s_sb, + EXT4_C2B(sbi, current_resv - count)); return 0; } @@ -3966,6 +3983,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) sb->s_qcop = &ext4_qctl_operations; sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP; #endif + sb->s_etrace.ops = &ext4_trace_ops; + sb->s_etrace.events_cap_mask = FS_EVENTS_ALL; + memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid)); INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */ @@ -5438,6 +5458,25 @@ out: #endif +static void ext4_trace_query(struct super_block *sb, u64 *ncount) +{ + struct ext4_sb_info *sbi = EXT4_SB(sb); + struct ext4_super_block *es = sbi->s_es; + ext4_fsblk_t rsv_blocks; + ext4_fsblk_t nblocks; + + nblocks = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) - + percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter); + nblocks = EXT4_C2B(sbi, nblocks); + rsv_blocks = ext4_r_blocks_count(es) + + EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters)); + if (nblocks < rsv_blocks) + nblocks = 0; + else + nblocks -= rsv_blocks; + *ncount = nblocks; +} + static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { -- 1.7.9.5 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756306AbbFPQVz (ORCPT ); Tue, 16 Jun 2015 12:21:55 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:36332 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751163AbbFPQVx (ORCPT ); Tue, 16 Jun 2015 12:21:53 -0400 Date: Tue, 16 Jun 2015 17:21:47 +0100 From: Al Viro To: Beata Michalska Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Message-ID: <20150616162147.GA17109@ZenIV.linux.org.uk> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1434460173-18427-2-git-send-email-b.michalska@samsung.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: > Introduce configurable generic interface for file > system-wide event notifications, to provide file > systems with a common way of reporting any potential > issues as they emerge. > > The notifications are to be issued through generic > netlink interface by newly introduced multicast group. > > Threshold notifications have been included, allowing > triggering an event whenever the amount of free space drops > below a certain level - or levels to be more precise as two > of them are being supported: the lower and the upper range. > The notifications work both ways: once the threshold level > has been reached, an event shall be generated whenever > the number of available blocks goes up again re-activating > the threshold. > > The interface has been exposed through a vfs. Once mounted, > it serves as an entry point for the set-up where one can > register for particular file system events. Hmm... 1) what happens if two processes write to that file at the same time, trying to create an entry for the same fs? WARN_ON() and fail for one of them if they race? 2) what happens if fs is mounted more than once (e.g. in different namespaces, or bound at different mountpoints, or just plain mounted several times in different places) and we add an event for each? More specifically, what should happen when one of those gets unmounted? 3) what's the meaning of ->active? Is that "fs_drop_trace_entry() hadn't been called yet" flag? Unless I'm misreading it, we can very well get explicit removal race with umount, resulting in cleanup_mnt() returning from fs_event_mount_dropped() before the first process (i.e. write asking to remove that entry) gets around to its deactivate_super(), ending up with umount(2) on a filesystem that isn't mounted anywhere else reporting success to userland before the actual fs shutdown, which is not a nice thing to do... 4) test in fs_event_mount_dropped() looks very odd - by that point we are absolutely guaranteed to have ->mnt_ns == NULL. What's that supposed to do? Al, trying to figure out the lifetime rules in all of that... From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757617AbbFQGPz (ORCPT ); Wed, 17 Jun 2015 02:15:55 -0400 Received: from mail-wi0-f178.google.com ([209.85.212.178]:38331 "EHLO mail-wi0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751313AbbFQGPq (ORCPT ); Wed, 17 Jun 2015 02:15:46 -0400 MIME-Version: 1.0 X-Originating-IP: [213.57.247.249] In-Reply-To: <1434460173-18427-4-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-4-git-send-email-b.michalska@samsung.com> From: Leon Romanovsky Date: Wed, 17 Jun 2015 09:15:24 +0300 Message-ID: Subject: Re: [RFC v3 3/4] ext4: Add support for generic FS events To: Beata Michalska Cc: "linux-kernel@vger.kernel.org" , Linux-FSDevel , linux-api , Greg Kroah , jack , tytso , "adilger.kernel" , Hugh Dickins , lczerner , hch , linux-ext4 , Linux-MM , "kyungmin.park" , kmpark Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 16, 2015 at 4:09 PM, Beata Michalska wrote: > Add support for generic FS events including threshold > notifications, ENOSPC and remount as read-only warnings, > along with generic internal warnings/errors. > > Signed-off-by: Beata Michalska > --- > fs/ext4/balloc.c | 10 ++++++++-- > fs/ext4/ext4.h | 1 + > fs/ext4/inode.c | 2 +- > fs/ext4/mballoc.c | 6 +++++- > fs/ext4/resize.c | 1 + > fs/ext4/super.c | 39 +++++++++++++++++++++++++++++++++++++++ > 6 files changed, 55 insertions(+), 4 deletions(-) > > diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c > index e95b27a..a48450f 100644 > --- a/fs/ext4/balloc.c > +++ b/fs/ext4/balloc.c > @@ -569,6 +569,7 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi, > { > if (ext4_has_free_clusters(sbi, nclusters, flags)) { > percpu_counter_add(&sbi->s_dirtyclusters_counter, nclusters); > + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, nclusters)); > return 0; > } else > return -ENOSPC; Do you need to add "fs_event_notify(sb, FS_WARN_ENOSPC);" here too? > @@ -590,9 +591,10 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries) > { > if (!ext4_has_free_clusters(EXT4_SB(sb), 1, 0) || > (*retries)++ > 3 || > - !EXT4_SB(sb)->s_journal) > + !EXT4_SB(sb)->s_journal) { > + fs_event_notify(sb, FS_WARN_ENOSPC); > return 0; > - > + } > jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id); > > return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal); > @@ -637,6 +639,10 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode, > dquot_alloc_block_nofail(inode, > EXT4_C2B(EXT4_SB(inode->i_sb), ar.len)); > } > + > + if (*errp == -ENOSPC) > + fs_event_notify(inode->i_sb, FS_WARN_ENOSPC_META); > + > return ret; > } > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 163afe2..7d75ff9 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -2542,6 +2542,7 @@ void ext4_mark_group_corrupted(struct ext4_sb_info *sbi, > if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free); > set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, grp->bb_free)); > } > > /* > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 5cb9a21..2a7af0f 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -1238,7 +1238,7 @@ static void ext4_da_release_space(struct inode *inode, int to_free) > percpu_counter_sub(&sbi->s_dirtyclusters_counter, to_free); > > spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); > - > + fs_event_free_space(sbi->s_sb, to_free); > dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free)); > } > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 24a4b6d..c2df6f0 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -4511,6 +4511,9 @@ out: > kmem_cache_free(ext4_ac_cachep, ac); > if (inquota && ar->len < inquota) > dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len)); > + if (reserv_clstrs && ar->len < reserv_clstrs) > + fs_event_free_space(sbi->s_sb, > + EXT4_C2B(sbi, reserv_clstrs - ar->len)); > if (!ar->len) { > if ((ar->flags & EXT4_MB_DELALLOC_RESERVED) == 0) > /* release all the reserved blocks if non delalloc */ > @@ -4848,7 +4851,7 @@ do_more: > if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) > dquot_free_block(inode, EXT4_C2B(sbi, count_clusters)); > percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters); > - > + fs_event_free_space(sb, EXT4_C2B(sbi, count_clusters)); > ext4_mb_unload_buddy(&e4b); > > /* We dirtied the bitmap block */ > @@ -4982,6 +4985,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb, > ext4_unlock_group(sb, block_group); > percpu_counter_add(&sbi->s_freeclusters_counter, > EXT4_NUM_B2C(sbi, blocks_freed)); > + fs_event_free_space(sb, blocks_freed); > > if (sbi->s_log_groups_per_flex) { > ext4_group_t flex_group = ext4_flex_group(sbi, block_group); > diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c > index 8a8ec62..dbf08d6 100644 > --- a/fs/ext4/resize.c > +++ b/fs/ext4/resize.c > @@ -1378,6 +1378,7 @@ static void ext4_update_super(struct super_block *sb, > EXT4_NUM_B2C(sbi, free_blocks)); > percpu_counter_add(&sbi->s_freeinodes_counter, > EXT4_INODES_PER_GROUP(sb) * flex_gd->count); > + fs_event_free_space(sb, free_blocks - reserved_blocks); > > ext4_debug("free blocks count %llu", > percpu_counter_read(&sbi->s_freeclusters_counter)); > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index e061e66..108b667 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -585,6 +585,8 @@ void __ext4_abort(struct super_block *sb, const char *function, > if (EXT4_SB(sb)->s_journal) > jbd2_journal_abort(EXT4_SB(sb)->s_journal, -EIO); > save_error_info(sb, function, line); > + fs_event_notify(sb, FS_ERR_REMOUNT_RO); > + > } > if (test_opt(sb, ERRORS_PANIC)) > panic("EXT4-fs panic from previous error\n"); > @@ -1083,6 +1085,12 @@ static const struct quotactl_ops ext4_qctl_operations = { > }; > #endif > > +static void ext4_trace_query(struct super_block *sb, u64 *ncount); > + > +static const struct fs_trace_operations ext4_trace_ops = { > + .query = ext4_trace_query, > +}; > + > static const struct super_operations ext4_sops = { > .alloc_inode = ext4_alloc_inode, > .destroy_inode = ext4_destroy_inode, > @@ -3398,11 +3406,20 @@ static int ext4_reserve_clusters(struct ext4_sb_info *sbi, ext4_fsblk_t count) > { > ext4_fsblk_t clusters = ext4_blocks_count(sbi->s_es) >> > sbi->s_cluster_bits; > + ext4_fsblk_t current_resv; > > if (count >= clusters) > return -EINVAL; > > + current_resv = atomic64_read(&sbi->s_resv_clusters); > atomic64_set(&sbi->s_resv_clusters, count); > + > + if (count > current_resv) > + fs_event_alloc_space(sbi->s_sb, > + EXT4_C2B(sbi, count - current_resv)); > + else > + fs_event_free_space(sbi->s_sb, > + EXT4_C2B(sbi, current_resv - count)); > return 0; > } > > @@ -3966,6 +3983,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) > sb->s_qcop = &ext4_qctl_operations; > sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP; > #endif > + sb->s_etrace.ops = &ext4_trace_ops; > + sb->s_etrace.events_cap_mask = FS_EVENTS_ALL; > + > memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid)); > > INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */ > @@ -5438,6 +5458,25 @@ out: > > #endif > > +static void ext4_trace_query(struct super_block *sb, u64 *ncount) > +{ > + struct ext4_sb_info *sbi = EXT4_SB(sb); > + struct ext4_super_block *es = sbi->s_es; > + ext4_fsblk_t rsv_blocks; > + ext4_fsblk_t nblocks; > + > + nblocks = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) - > + percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter); > + nblocks = EXT4_C2B(sbi, nblocks); > + rsv_blocks = ext4_r_blocks_count(es) + > + EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters)); > + if (nblocks < rsv_blocks) > + nblocks = 0; > + else > + nblocks -= rsv_blocks; > + *ncount = nblocks; > +} > + > static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags, > const char *dev_name, void *data) > { > -- > 1.7.9.5 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- Leon Romanovsky | Independent Linux Consultant www.leon.nu | leon@leon.nu From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752955AbbFQJXP (ORCPT ); Wed, 17 Jun 2015 05:23:15 -0400 Received: from mailout1.w1.samsung.com ([210.118.77.11]:44461 "EHLO mailout1.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751166AbbFQJXH (ORCPT ); Wed, 17 Jun 2015 05:23:07 -0400 X-AuditID: cbfec7f5-f794b6d000001495-e8-55813c771631 Message-id: <55813C69.2040401@samsung.com> Date: Wed, 17 Jun 2015 11:22:49 +0200 From: Beata Michalska User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130804 Thunderbird/17.0.8 MIME-version: 1.0 To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150616162147.GA17109@ZenIV.linux.org.uk> In-reply-to: <20150616162147.GA17109@ZenIV.linux.org.uk> Content-type: text/plain; charset=UTF-8 Content-transfer-encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprHIsWRmVeSWpSXmKPExsVy+t/xy7rlNo2hBreeKlt8/dLBYnFuwQxG i9MTFjFZPP3Ux2Ixe3ozk8Wty6tYLM42vWG3WPZgM4vF5u8dbBYz591hs9iz9ySLxeVdc9gs 7q35z2rR2vOT3eL83+OsDvweLZvLPRZsKvXYvELL4+3DAI9NnyaxezSdOcrs8X7fVTaPvi2r GD3OLDjC7vF5k5zHpidvmQK4o7hsUlJzMstSi/TtErgy7p76xFywTa3i6+oVbA2MJ+S6GDk5 JARMJBat+sMGYYtJXLi3HswWEljKKPF5TXQXIxeQ/YxRovfGJCaQBK+AlsSDDxPAbBYBVYmd bW+ZQWw2AX2JVzNWgsVFBSIk/pzexwpRLyjxY/I9FhBbBKj+zqkzTCBDmQWOMEmseDcfrEhY wFPi4+wZ7BCbNzNKPP+vDGJzClhIdHYeZwSxmQXUJSbNW8QMYctLbF7zlnkCo8AsJDtmISmb haRsASPzKkbR1NLkguKk9FwjveLE3OLSvHS95PzcTYyQWPu6g3HpMatDjAIcjEo8vA07GkKF WBPLiitzDzFKcDArifDqaTaGCvGmJFZWpRblxxeV5qQWH2KU5mBREueduet9iJBAemJJanZq akFqEUyWiYNTqoFxiudnlpCv9vdn/VisqxYwc29VtMDKU2vevGfZ+ifvxyTe2YLHlbnfhZ0P m6lnHhYbXJJ43b9459YlpXby+/9XVUmEruI3TbHSOHr61JqF1fsuvuv+N0Gb6fYZyehnvtMf rO4s57zfcvagjI5swPJ24dQeYReP+gnXyzl3KOcWv4xd2qy5y/yJEktxRqKhFnNRcSIAIbO1 QrECAAA= Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On 06/16/2015 06:21 PM, Al Viro wrote: > On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: >> Introduce configurable generic interface for file >> system-wide event notifications, to provide file >> systems with a common way of reporting any potential >> issues as they emerge. >> >> The notifications are to be issued through generic >> netlink interface by newly introduced multicast group. >> >> Threshold notifications have been included, allowing >> triggering an event whenever the amount of free space drops >> below a certain level - or levels to be more precise as two >> of them are being supported: the lower and the upper range. >> The notifications work both ways: once the threshold level >> has been reached, an event shall be generated whenever >> the number of available blocks goes up again re-activating >> the threshold. >> >> The interface has been exposed through a vfs. Once mounted, >> it serves as an entry point for the set-up where one can >> register for particular file system events. > > Hmm... > > 1) what happens if two processes write to that file at the same time, > trying to create an entry for the same fs? WARN_ON() and fail for one > of them if they race? > There are some limits here - I admit. The entries in the config file might be overwritten at any time - there is no support for multiple config entries for the same mounted fs. This is mainly due to the threshold notifications: handling potentially numerous threshold limits each time the number of available blocks changes didn't seem like a good idea. So this is more like a global config, resembling sysfs fs-related tune options. > 2) what happens if fs is mounted more than once (e.g. in different > namespaces, or bound at different mountpoints, or just plain mounted > several times in different places) and we add an event for each? > More specifically, what should happen when one of those gets unmounted? > Each write to that file is being handled within the current namespace. Setting up an entry for a mount point from a different mnt namespace needs switching to that ns. As for bound mounts: the entry exists until the mount point it has been registered with is detached. The events can only be registered for one of the mount points, as they are tied with the super block - so one cannot have a separate config entry for each bound mounts. > 3) what's the meaning of ->active? Is that "fs_drop_trace_entry() hadn't > been called yet" flag? Unless I'm misreading it, we can very well get > explicit removal race with umount, resulting in cleanup_mnt() returning > from fs_event_mount_dropped() before the first process (i.e. write > asking to remove that entry) gets around to its deactivate_super(), > ending up with umount(2) on a filesystem that isn't mounted anywhere > else reporting success to userland before the actual fs shutdown, which > is not a nice thing to do... > The 'active' means simply that the entry for a given mounted fs is still valid in a way that the events are still required: the entry in the config file has not been removed. When the trace is being removed - it's 'active' filed gets invalidated to mark that the events for related fs are no longer needed. deactivate_super() should get called only once, dropping the reference acquired while creating the entry (fs_new_trace_entry). While in fs_drop_trace_entry, lock is being held (in both cases: unmount and explicit entry removal). The fs_drop_trace_entry will silently skip all the clean-up if the entry is inactive. I might be missing smth here - though. If so,I would really appreciate some more of your comments. > 4) test in fs_event_mount_dropped() looks very odd - by that point we > are absolutely guaranteed to have ->mnt_ns == NULL. What's that supposed > to do? > I have totally missed the fact that the mnt namespace pointer is invalidated during unmount_tree - cannot really explain why that did happen. So thank You for pointing that out. This should be simply checking if it's still valid. This verification is needed in case the mount that is being detached is not the one the events have been registered with as they refer to fs not a particular mount point. This is the case with the mnt namespaces: let's assume one registers for events for particular mounted fs in an init mnt namespace, then the new mnt namespace is being created with shared moutn points being cloned: so the same mount point exists in both namespaces. Now if this mnt point gets detached: either through umount or during the mnt namespace being swept out - the entry in the init mnt namespace should remain untouched - same applies the other way round. > > Al, trying to figure out the lifetime rules in all of that... > Best Regards Beata From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754367AbbFQJZM (ORCPT ); Wed, 17 Jun 2015 05:25:12 -0400 Received: from mailout4.w1.samsung.com ([210.118.77.14]:45540 "EHLO mailout4.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752178AbbFQJZJ (ORCPT ); Wed, 17 Jun 2015 05:25:09 -0400 X-AuditID: cbfec7f5-f794b6d000001495-41-55813cf2bfa5 Message-id: <55813CF0.6010602@samsung.com> Date: Wed, 17 Jun 2015 11:25:04 +0200 From: Beata Michalska User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130804 Thunderbird/17.0.8 MIME-version: 1.0 To: Leon Romanovsky Cc: "linux-kernel@vger.kernel.org" , Linux-FSDevel , linux-api , Greg Kroah , jack , tytso , "adilger.kernel" , Hugh Dickins , lczerner , hch , linux-ext4 , Linux-MM , "kyungmin.park" , kmpark Subject: Re: [RFC v3 3/4] ext4: Add support for generic FS events References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-4-git-send-email-b.michalska@samsung.com> In-reply-to: Content-type: text/plain; charset=UTF-8 Content-transfer-encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprDIsWRmVeSWpSXmKPExsVy+t/xK7qfbBpDDS5v47T4+qWDxeLcghmM FqcnLGKyePqpj8Vi9vRmJotbl1exWJxtesNusezBZhaLQ9P/sFps/t7BZjFz3h02iz17T7JY XN41h83i3pr/rBatPT/ZHfg9WjaXeyzYVOqxeYWWx9uHAR6bPk1i9+iZnu3RdOYos8f7fVfZ PPq2rGL0OLPgCLvH501yAdxRXDYpqTmZZalF+nYJXBkLzp5lL9jgXvHos2MD43WTLkYODgkB E4kLqzi6GDmBTDGJC/fWs4HYQgJLGSXeP47rYuQCsp8xSvQ/vcgOkuAV0JJ49nItC4jNIqAq 8WjpPVYQm01AX+LVjJVMILaoQITEn9P7WCHqBSV+TL4HVi8ioCQx+etedpChzAKdLBJft+wC KxIWcJQ4OPULM8S2M4wSK5rPgm3jFAiW+PblLlgRs4C6xKR5i5ghbHmJzWveMk9gFJiFZMks JGWzkJQtYGRexSiaWppcUJyUnmukV5yYW1yal66XnJ+7iRESZ193MC49ZnWIUYCDUYmHt2FH Q6gQa2JZcWXuIUYJDmYlEV49zcZQId6UxMqq1KL8+KLSnNTiQ4zSHCxK4rwzd70PERJITyxJ zU5NLUgtgskycXBKNTAuOuLAkZLH8Yn3tfFyYQWdq7dE9R5cKgrSSb4nfojNjWvHo/MCVlsa ONMqu+fyxmfs0ngZuKdOLmWHUNzrFRFBMjpiv6bpvOM5Vy6xdWm5fcrWmA1bRXO5LgfVLTrg 6W8U9dI4VnbxyQOnG544zbJ/69Bb897r1tfafZbPLnm85Yx2mJfH7K7EUpyRaKjFXFScCADU zjwWrwIAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/17/2015 08:15 AM, Leon Romanovsky wrote: > On Tue, Jun 16, 2015 at 4:09 PM, Beata Michalska > wrote: >> Add support for generic FS events including threshold >> notifications, ENOSPC and remount as read-only warnings, >> along with generic internal warnings/errors. >> >> Signed-off-by: Beata Michalska >> --- >> fs/ext4/balloc.c | 10 ++++++++-- >> fs/ext4/ext4.h | 1 + >> fs/ext4/inode.c | 2 +- >> fs/ext4/mballoc.c | 6 +++++- >> fs/ext4/resize.c | 1 + >> fs/ext4/super.c | 39 +++++++++++++++++++++++++++++++++++++++ >> 6 files changed, 55 insertions(+), 4 deletions(-) >> >> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c >> index e95b27a..a48450f 100644 >> --- a/fs/ext4/balloc.c >> +++ b/fs/ext4/balloc.c >> @@ -569,6 +569,7 @@ int ext4_claim_free_clusters(struct ext4_sb_info *sbi, >> { >> if (ext4_has_free_clusters(sbi, nclusters, flags)) { >> percpu_counter_add(&sbi->s_dirtyclusters_counter, nclusters); >> + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, nclusters)); >> return 0; >> } else >> return -ENOSPC; > Do you need to add "fs_event_notify(sb, FS_WARN_ENOSPC);" here too? Yeap, I've missed that one. Thank You. BR Beata > >> @@ -590,9 +591,10 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries) >> { >> if (!ext4_has_free_clusters(EXT4_SB(sb), 1, 0) || >> (*retries)++ > 3 || >> - !EXT4_SB(sb)->s_journal) >> + !EXT4_SB(sb)->s_journal) { >> + fs_event_notify(sb, FS_WARN_ENOSPC); >> return 0; >> - >> + } >> jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id); >> >> return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal); >> @@ -637,6 +639,10 @@ ext4_fsblk_t ext4_new_meta_blocks(handle_t *handle, struct inode *inode, >> dquot_alloc_block_nofail(inode, >> EXT4_C2B(EXT4_SB(inode->i_sb), ar.len)); >> } >> + >> + if (*errp == -ENOSPC) >> + fs_event_notify(inode->i_sb, FS_WARN_ENOSPC_META); >> + >> return ret; >> } >> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >> index 163afe2..7d75ff9 100644 >> --- a/fs/ext4/ext4.h >> +++ b/fs/ext4/ext4.h >> @@ -2542,6 +2542,7 @@ void ext4_mark_group_corrupted(struct ext4_sb_info *sbi, >> if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) >> percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free); >> set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); >> + fs_event_alloc_space(sbi->s_sb, EXT4_C2B(sbi, grp->bb_free)); >> } >> >> /* >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index 5cb9a21..2a7af0f 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -1238,7 +1238,7 @@ static void ext4_da_release_space(struct inode *inode, int to_free) >> percpu_counter_sub(&sbi->s_dirtyclusters_counter, to_free); >> >> spin_unlock(&EXT4_I(inode)->i_block_reservation_lock); >> - >> + fs_event_free_space(sbi->s_sb, to_free); >> dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free)); >> } >> >> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c >> index 24a4b6d..c2df6f0 100644 >> --- a/fs/ext4/mballoc.c >> +++ b/fs/ext4/mballoc.c >> @@ -4511,6 +4511,9 @@ out: >> kmem_cache_free(ext4_ac_cachep, ac); >> if (inquota && ar->len < inquota) >> dquot_free_block(ar->inode, EXT4_C2B(sbi, inquota - ar->len)); >> + if (reserv_clstrs && ar->len < reserv_clstrs) >> + fs_event_free_space(sbi->s_sb, >> + EXT4_C2B(sbi, reserv_clstrs - ar->len)); >> if (!ar->len) { >> if ((ar->flags & EXT4_MB_DELALLOC_RESERVED) == 0) >> /* release all the reserved blocks if non delalloc */ >> @@ -4848,7 +4851,7 @@ do_more: >> if (!(flags & EXT4_FREE_BLOCKS_NO_QUOT_UPDATE)) >> dquot_free_block(inode, EXT4_C2B(sbi, count_clusters)); >> percpu_counter_add(&sbi->s_freeclusters_counter, count_clusters); >> - >> + fs_event_free_space(sb, EXT4_C2B(sbi, count_clusters)); >> ext4_mb_unload_buddy(&e4b); >> >> /* We dirtied the bitmap block */ >> @@ -4982,6 +4985,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb, >> ext4_unlock_group(sb, block_group); >> percpu_counter_add(&sbi->s_freeclusters_counter, >> EXT4_NUM_B2C(sbi, blocks_freed)); >> + fs_event_free_space(sb, blocks_freed); >> >> if (sbi->s_log_groups_per_flex) { >> ext4_group_t flex_group = ext4_flex_group(sbi, block_group); >> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c >> index 8a8ec62..dbf08d6 100644 >> --- a/fs/ext4/resize.c >> +++ b/fs/ext4/resize.c >> @@ -1378,6 +1378,7 @@ static void ext4_update_super(struct super_block *sb, >> EXT4_NUM_B2C(sbi, free_blocks)); >> percpu_counter_add(&sbi->s_freeinodes_counter, >> EXT4_INODES_PER_GROUP(sb) * flex_gd->count); >> + fs_event_free_space(sb, free_blocks - reserved_blocks); >> >> ext4_debug("free blocks count %llu", >> percpu_counter_read(&sbi->s_freeclusters_counter)); >> diff --git a/fs/ext4/super.c b/fs/ext4/super.c >> index e061e66..108b667 100644 >> --- a/fs/ext4/super.c >> +++ b/fs/ext4/super.c >> @@ -585,6 +585,8 @@ void __ext4_abort(struct super_block *sb, const char *function, >> if (EXT4_SB(sb)->s_journal) >> jbd2_journal_abort(EXT4_SB(sb)->s_journal, -EIO); >> save_error_info(sb, function, line); >> + fs_event_notify(sb, FS_ERR_REMOUNT_RO); >> + >> } >> if (test_opt(sb, ERRORS_PANIC)) >> panic("EXT4-fs panic from previous error\n"); >> @@ -1083,6 +1085,12 @@ static const struct quotactl_ops ext4_qctl_operations = { >> }; >> #endif >> >> +static void ext4_trace_query(struct super_block *sb, u64 *ncount); >> + >> +static const struct fs_trace_operations ext4_trace_ops = { >> + .query = ext4_trace_query, >> +}; >> + >> static const struct super_operations ext4_sops = { >> .alloc_inode = ext4_alloc_inode, >> .destroy_inode = ext4_destroy_inode, >> @@ -3398,11 +3406,20 @@ static int ext4_reserve_clusters(struct ext4_sb_info *sbi, ext4_fsblk_t count) >> { >> ext4_fsblk_t clusters = ext4_blocks_count(sbi->s_es) >> >> sbi->s_cluster_bits; >> + ext4_fsblk_t current_resv; >> >> if (count >= clusters) >> return -EINVAL; >> >> + current_resv = atomic64_read(&sbi->s_resv_clusters); >> atomic64_set(&sbi->s_resv_clusters, count); >> + >> + if (count > current_resv) >> + fs_event_alloc_space(sbi->s_sb, >> + EXT4_C2B(sbi, count - current_resv)); >> + else >> + fs_event_free_space(sbi->s_sb, >> + EXT4_C2B(sbi, current_resv - count)); >> return 0; >> } >> >> @@ -3966,6 +3983,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) >> sb->s_qcop = &ext4_qctl_operations; >> sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP; >> #endif >> + sb->s_etrace.ops = &ext4_trace_ops; >> + sb->s_etrace.events_cap_mask = FS_EVENTS_ALL; >> + >> memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid)); >> >> INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */ >> @@ -5438,6 +5458,25 @@ out: >> >> #endif >> >> +static void ext4_trace_query(struct super_block *sb, u64 *ncount) >> +{ >> + struct ext4_sb_info *sbi = EXT4_SB(sb); >> + struct ext4_super_block *es = sbi->s_es; >> + ext4_fsblk_t rsv_blocks; >> + ext4_fsblk_t nblocks; >> + >> + nblocks = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) - >> + percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter); >> + nblocks = EXT4_C2B(sbi, nblocks); >> + rsv_blocks = ext4_r_blocks_count(es) + >> + EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters)); >> + if (nblocks < rsv_blocks) >> + nblocks = 0; >> + else >> + nblocks -= rsv_blocks; >> + *ncount = nblocks; >> +} >> + >> static struct dentry *ext4_mount(struct file_system_type *fs_type, int flags, >> const char *dev_name, void *data) >> { >> -- >> 1.7.9.5 >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org > > > From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755077AbbFQXIX (ORCPT ); Wed, 17 Jun 2015 19:08:23 -0400 Received: from ipmail05.adl6.internode.on.net ([150.101.137.143]:51557 "EHLO ipmail05.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754036AbbFQXGw (ORCPT ); Wed, 17 Jun 2015 19:06:52 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2DzCADt/IFVPEYFLHlcgxCBM4ZMpHcBAQEBAQEGmXUCAgEBAoE1TQEBAQEBAQcBAQEBQAE/hCMBAQQnExwhAhAIAw4KCSUPBSUDBxoTiC7FSgEBCAIBHxiGA4UqhQYHhCsFhVgHjgyLRoE1jm+HfoEJgSgcgWQsMYJIAQEB Date: Thu, 18 Jun 2015 09:06:05 +1000 From: Dave Chinner To: Beata Michalska Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Message-ID: <20150617230605.GK10224@dastard> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1434460173-18427-2-git-send-email-b.michalska@samsung.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: > Introduce configurable generic interface for file > system-wide event notifications, to provide file > systems with a common way of reporting any potential > issues as they emerge. > > The notifications are to be issued through generic > netlink interface by newly introduced multicast group. > > Threshold notifications have been included, allowing > triggering an event whenever the amount of free space drops > below a certain level - or levels to be more precise as two > of them are being supported: the lower and the upper range. > The notifications work both ways: once the threshold level > has been reached, an event shall be generated whenever > the number of available blocks goes up again re-activating > the threshold. > > The interface has been exposed through a vfs. Once mounted, > it serves as an entry point for the set-up where one can > register for particular file system events. > > Signed-off-by: Beata Michalska This has massive scalability problems: > + 4.3 Threshold notifications: > + > + #include > + void fs_event_alloc_space(struct super_block *sb, u64 ncount); > + void fs_event_free_space(struct super_block *sb, u64 ncount); > + > + Each filesystme supporting the threshold notifications should call > + fs_event_alloc_space/fs_event_free_space respectively whenever the > + amount of available blocks changes. > + - sb: the filesystem's super block > + - ncount: number of blocks being acquired/released ... here. > + Note that to properly handle the threshold notifications the fs events > + interface needs to be kept up to date by the filesystems. Each should > + register fs_trace_operations to enable querying the current number of > + available blocks. Have you noticed that the filesystems have percpu counters for tracking global space usage? There's good reason for that - taking a spinlock in such a hot accounting path causes severe contention. > +static void fs_event_send(struct fs_trace_entry *en, unsigned int event_id) > +{ > + size_t size = nla_total_size(sizeof(u32)) * 2 + > + nla_total_size(sizeof(u64)); > + > + fs_netlink_send_event(size, event_id, create_common_msg, en); > +} > + > +static void fs_event_send_thresh(struct fs_trace_entry *en, > + unsigned int event_id) > +{ > + size_t size = nla_total_size(sizeof(u32)) * 2 + > + nla_total_size(sizeof(u64)) * 2; > + > + fs_netlink_send_event(size, event_id, create_thresh_msg, en); > +} > + > +void fs_event_notify(struct super_block *sb, unsigned int event_id) > +{ > + struct fs_trace_entry *en; > + > + en = fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + if (atomic_read(&en->active) && (en->notify & FS_EVENT_GENERIC)) > + fs_event_send(en, event_id); > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_notify); > + > +void fs_event_alloc_space(struct super_block *sb, u64 ncount) > +{ > + struct fs_trace_entry *en; > + s64 count; > + > + en = fs_trace_entry_get_rcu(sb); > + if (!en) > + return; Adds an atomic write to get the trace entry, > + spin_lock(&en->lock); a spin lock to lock the entry, > + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) > + goto leave; > + /* > + * we shouldn't drop below 0 here, > + * unless there is a sync issue somewhere (?) > + */ > + count = en->th.avail_space - ncount; > + en->th.avail_space = count < 0 ? 0 : count; > + > + if (en->th.avail_space > en->th.lrange) > + /* Not 'even' close - leave */ > + goto leave; > + > + if (en->th.avail_space > en->th.urange) { > + /* Close enough - the lower range has been reached */ > + if (!(en->th.state & THRESH_LR_BEYOND)) { > + /* Send notification */ > + fs_event_send_thresh(en, FS_THR_LRBELOW); > + en->th.state &= ~THRESH_LR_BELOW; > + en->th.state |= THRESH_LR_BEYOND; > + } > + goto leave; Then puts the entire netlink send path inside this spinlock, which includes memory allocation and all sorts of non-filesystem code paths. And it may be inside critical filesystem locks as well.... Apart from the serialisation problem of the locking, adding memory allocation and the network send path to filesystem code that is effectively considered "innermost" filesystem code is going to have all sorts of problems for various filesystems. In the XFS case, we simply cannot execute this sort of function in the places where we update global space accounting. As it is, I think the basic concept of separate tracking of free space if fundamentally flawed. What I think needs to be done is that filesystems need access to the thresholds for events, and then the filesystems call fs_event_send_thresh() themselves from appropriate contexts (ie. without compromising locking, scalability, memory allocation recursion constraints, etc). e.g. instead of tracking every change in free space, a filesystem might execute this once every few seconds from a workqueue: event = fs_event_need_space_warning(sb, ) if (event) fs_event_send_thresh(sb, event); User still gets warnings about space usage, but there's no runtime overhead or problems with lock/memory allocation contexts, etc. Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754646AbbFRLRk (ORCPT ); Thu, 18 Jun 2015 07:17:40 -0400 Received: from mail-pa0-f52.google.com ([209.85.220.52]:34209 "EHLO mail-pa0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752998AbbFRLRg (ORCPT ); Thu, 18 Jun 2015 07:17:36 -0400 Message-ID: <5582A8C1.3000002@gmail.com> Date: Thu, 18 Jun 2015 19:17:21 +0800 From: Kinglong Mee User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Beata Michalska , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org CC: greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org, kinglongmee@gmail.com Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> In-Reply-To: <1434460173-18427-2-git-send-email-b.michalska@samsung.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 6/16/2015 9:09 PM, Beata Michalska wrote: > Introduce configurable generic interface for file > system-wide event notifications, to provide file > systems with a common way of reporting any potential > issues as they emerge. ... snip ... > + > +Sample request could look like the following: > + > + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config > + > +Multiple request might be specified provided they are separated with semicolon. > + > +The configuration itself might be modified at any time. One can add/remove > +particular event types for given fielsystem, modify the threshold levels, > +and remove single or all entries from the 'config' file. > + > + - Adding new event type: > + > + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config > + > +(Note that is is enough to provide the event type to be enabled without Should be "Note that it is ... " here ? > +the already set ones.) > + > + - Removing event type: > + > + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config > + > + - Updating threshold limits: > + > + $ echo MOUNT T L1 L2 > /sys/fs/events/config > + > + - Removing single entry: > + > + $ echo '!MOUNT' > /sys/fs/events/config > + > + - Removing all entries: > + > + $ echo > /sys/fs/events/config > + > +Reading the file will list all registered entries with their current set-up > +along with some additional info like the filesystem type and the backing device > +name if available. > + > +Final, though a very important note on the configuration: when and if the > +actual events are being triggered falls way beyond the scope of the generic > +filesystem events interface. It is up to a particular filesystem > +implementation which events are to be supported - if any at all. So if > +given filesystem does not support the event notifications, an attempt to > +enable those through 'config' file will fail. > + > + > +3. The generic netlink interface support: > +========================================= > + > +Whenever an event notification is triggered (by given filesystem) the current > +configuration is being validated to decide whether a userpsace notification > +should be launched. If there has been no request (in a mean of 'config' file > +entry) for given event, one will be silently disregarded. If, on the other > +hand, someone is 'watching' given filesystem for specific events, a generic > +netlink message will be sent. A dedicated multicast group has been provided > +solely for this purpose so in order to receive such notifications, one should > +subscribe to this new multicast group. As for now only the init network > +namespace is being supported. > + > +3.1 Message format > + > +The FS_NL_C_EVENT shall be stored within the generic netlink message header > +as the command field. The message payload will provide more detailed info: > +the backing device major and minor numbers, the event code and the id of > +the process which action led to the event occurrence. In case of threshold > +notifications, the current number of available blocks will be included > +in the payload as well. > + > + > + 0 1 2 3 > + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | NETLINK MESSAGE HEADER | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | GENERIC NETLINK MESSAGE HEADER | > + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | cmd, not cdm. > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | Optional user specific message header | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | GENERIC MESSAGE PAYLOAD: | > + +---------------------------------------------------------------+ > + | FS_NL_A_EVENT_ID (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DEV_MAJOR (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DEV_MINOR (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_CAUSED_ID (NLA_U32) | Should be NLA_U64 ? The following uses as, + if (nla_put_u64(skb, FS_NL_A_CAUSED_ID, pid_vnr(task_pid(current)))) + return -EINVAL; Also, I'd like FS_NL_A_CAUSED_PID than FS_NL_A_CAUSED_ID. > + +---------------------------------------------------------------+ > + | FS_NL_A_DATA (NLA_U64) | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + > + > +The above figure is based on: > + http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto#Message_Format > + > + ... snip... > + seq_putc(m, ' '); > + if (sb->s_op->show_devname) { > + sb->s_op->show_devname(m, en->mnt_path.mnt->mnt_root); > + } else { > + seq_escape(m, r_mnt->mnt_devname ? r_mnt->mnt_devname : "none", > + " \t\n\\"); > + } > + seq_puts(m, " ("); > + > + nmask = en->notify; > + for (match = fs_etypes; match->pattern; ++match) { > + if (match->token & nmask) { > + seq_puts(m, match->pattern); Print here is better. if (match->pattern & FS_EVENT_THRESH) seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); > + nmask &= ~match->token; > + if (nmask) > + seq_putc(m, ','); > + } > + } > + seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); Don't print the lrange/urange (always be zero) when without FS_EVENT_THRESH. > + seq_puts(m, ")\n"); > + return 0; > +} > + > +static const struct seq_operations fs_trace_seq_ops = { > + .start = fs_trace_seq_start, > + .next = fs_trace_seq_next, > + .stop = fs_trace_seq_stop, > + .show = fs_trace_seq_show, > +}; > + > +static int fs_trace_open(struct inode *inode, struct file *file) > +{ > + return seq_open(file, &fs_trace_seq_ops); > +} > + > +static const struct file_operations fs_trace_fops = { > + .owner = THIS_MODULE, > + .open = fs_trace_open, > + .write = fs_trace_write, > + .read = seq_read, > + .llseek = seq_lseek, > + .release = seq_release, > +}; > + > +static int fs_trace_init(void) > +{ > + fs_trace_cachep = KMEM_CACHE(fs_trace_entry, 0); > + if (!fs_trace_cachep) > + return -EINVAL; > + init_waitqueue_head(&trace_wq); > + return 0; > +} > + > +/* VFS support */ > +static int fs_trace_fill_super(struct super_block *sb, void *data, int silen) > +{ > + int ret; > + static struct tree_descr desc[] = { > + [2] = { > + .name = "config", > + .ops = &fs_trace_fops, > + .mode = S_IWUSR | S_IRUGO, > + }, > + {""}, > + }; > + > + ret = simple_fill_super(sb, 0x7246332, desc); > + return !ret ? fs_trace_init() : ret; > +} > + > +static struct dentry *fs_trace_do_mount(struct file_system_type *fs_type, > + int ntype, const char *dev_name, void *data) > +{ > + return mount_single(fs_type, ntype, data, fs_trace_fill_super); > +} > + > +static void fs_trace_kill_super(struct super_block *sb) > +{ > + /* > + * The rcu_barrier here will/should make sure all call_rcu > + * callbacks are completed - still there might be some active > + * trace objects in use which can make calling the > + * kmem_cache_destroy unsafe. So we wait until all traces > + * are finally released. > + */ > + fs_remove_all_traces(); > + rcu_barrier(); > + wait_event(trace_wq, !atomic_read(&stray_traces)); > + > + kmem_cache_destroy(fs_trace_cachep); > + kill_litter_super(sb); > +} > + > +static struct kset *fs_trace_kset; > + > +static struct file_system_type fs_trace_fstype = { > + .name = "fstrace", > + .mount = fs_trace_do_mount, > + .kill_sb = fs_trace_kill_super, > +}; > + > +static void __init fs_trace_vfs_init(void) > +{ > + fs_trace_kset = kset_create_and_add("events", NULL, fs_kobj); > + > + if (!fs_trace_kset) > + return; > + > + if (!register_filesystem(&fs_trace_fstype)) { > + if (!fs_event_netlink_register()) > + return; > + unregister_filesystem(&fs_trace_fstype); > + } > + kset_unregister(fs_trace_kset); > +} > + > +static int __init fs_trace_evens_init(void) > +{ > + fs_trace_vfs_init(); > + return 0; > +}; > +module_init(fs_trace_evens_init); > + > diff --git a/fs/events/fs_event.h b/fs/events/fs_event.h > new file mode 100644 > index 0000000..23f24c8 > --- /dev/null > +++ b/fs/events/fs_event.h > @@ -0,0 +1,22 @@ > +/* > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > + > +#ifndef __GENERIC_FS_EVENTS_H > +#define __GENERIC_FS_EVENTS_H > + > +int fs_event_netlink_register(void); > +void fs_event_netlink_unregister(void); > + > +#endif /* __GENERIC_FS_EVENTS_H */ > diff --git a/fs/events/fs_event_netlink.c b/fs/events/fs_event_netlink.c > new file mode 100644 > index 0000000..0c97eb7 > --- /dev/null > +++ b/fs/events/fs_event_netlink.c > @@ -0,0 +1,104 @@ > +/* > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "fs_event.h" > + > +static const struct genl_multicast_group fs_event_mcgroups[] = { > + { .name = FS_EVENTS_MCAST_GRP_NAME, }, > +}; > + > +static struct genl_family fs_event_family = { > + .id = GENL_ID_GENERATE, > + .name = FS_EVENTS_FAMILY_NAME, > + .version = 1, > + .maxattr = FS_NL_A_MAX, > + .mcgrps = fs_event_mcgroups, > + .n_mcgrps = ARRAY_SIZE(fs_event_mcgroups), > +}; > + > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), > + void *cbdata) > +{ > + static atomic_t seq; > + struct sk_buff *skb; > + void *msg_head; > + int ret = 0; > + > + if (!size || !compose_msg) > + return -EINVAL; > + > + /* Skip if there are no listeners */ > + if (!genl_has_listeners(&fs_event_family, &init_net, 0)) > + return 0; > + > + if (event_id != FS_EVENT_NONE) > + size += nla_total_size(sizeof(u32)); > + size += nla_total_size(sizeof(u64)); What is this for ? thanks Kinglong Mee From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932577AbbFROuV (ORCPT ); Thu, 18 Jun 2015 10:50:21 -0400 Received: from mailout4.w1.samsung.com ([210.118.77.14]:52621 "EHLO mailout4.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932508AbbFROuP (ORCPT ); Thu, 18 Jun 2015 10:50:15 -0400 X-AuditID: cbfec7f5-f794b6d000001495-41-5582daa578d9 Message-id: <5582DAA3.8080204@samsung.com> Date: Thu, 18 Jun 2015 16:50:11 +0200 From: Beata Michalska User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130804 Thunderbird/17.0.8 MIME-version: 1.0 To: Kinglong Mee Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <5582A8C1.3000002@gmail.com> In-reply-to: <5582A8C1.3000002@gmail.com> Content-type: text/plain; charset=windows-1252 Content-transfer-encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprHIsWRmVeSWpSXmKPExsVy+t/xa7pLbzWFGvzdx2vx9UsHi8W5BTMY LU5PWMRk8fRTH4vF7OnNTBaz705nt7h1eRWLxdmmN+wWyx5sZrHY/L2DzWLmvDtsFnv2nmSx uLxrDpvFvTX/WS1ae36yO/B7tGwu99g56y67x4JNpR6bV2h5vH0Y4LHp0yR2j6YzR5k93u+7 yubRt2UVo8eZBUfYPT5vkgvgjuKySUnNySxLLdK3S+DKuPimh7ngc2zF5ffLmBoYb3h1MXJy SAiYSEw/up0NwhaTuHBvPZDNxSEksJRR4tDr94wQzjNGiYMPm9hBqngFtCRmn3rJDGKzCKhK HH38D8xmE9CXeDVjJROILSoQIfHn9D5WiHpBiR+T77GA2CICGhJT7+9hARnKLHCESWLFu/lg RcICnhIfZ89gh9i2hFHi2eylYFM5BTQlPm7fCHQGB1CHnsT9i1ogYWYBeYnNa94yT2AUmIVk xyyEqllIqhYwMq9iFE0tTS4oTkrPNdIrTswtLs1L10vOz93ECIm1rzsYlx6zOsQowMGoxMPL wNUUKsSaWFZcmXuIUYKDWUmEV+4aUIg3JbGyKrUoP76oNCe1+BCjNAeLkjjvzF3vQ4QE0hNL UrNTUwtSi2CyTBycUg2MJQsyf4YalPyPXuCW0+v4/3RyHH+Y2sEza889N+kMarMufJJzpTAr kEWtZ5+Obsk90+atfwSdY/f8nlpzcsvDCFuFAwv4Gp5uW35mvfqv0t9nm9++XVPd6fgq3WnG 5jMO167uW8G3piFuvtr8GeIuFct4lReHhhh1fOl/dLr9VfLze6bMmwtKlFiKMxINtZiLihMB LRORzbECAAA= Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On 06/18/2015 01:17 PM, Kinglong Mee wrote: > On 6/16/2015 9:09 PM, Beata Michalska wrote: >> Introduce configurable generic interface for file >> system-wide event notifications, to provide file >> systems with a common way of reporting any potential >> issues as they emerge. > ... snip ... >> + >> +Sample request could look like the following: >> + >> + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config >> + >> +Multiple request might be specified provided they are separated with semicolon. >> + >> +The configuration itself might be modified at any time. One can add/remove >> +particular event types for given fielsystem, modify the threshold levels, >> +and remove single or all entries from the 'config' file. >> + >> + - Adding new event type: >> + >> + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config >> + >> +(Note that is is enough to provide the event type to be enabled without > > Should be "Note that it is ... " here ? Right > >> +the already set ones.) >> + >> + - Removing event type: >> + >> + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config >> + >> + - Updating threshold limits: >> + >> + $ echo MOUNT T L1 L2 > /sys/fs/events/config >> + >> + - Removing single entry: >> + >> + $ echo '!MOUNT' > /sys/fs/events/config >> + >> + - Removing all entries: >> + >> + $ echo > /sys/fs/events/config >> + >> +Reading the file will list all registered entries with their current set-up >> +along with some additional info like the filesystem type and the backing device >> +name if available. >> + >> +Final, though a very important note on the configuration: when and if the >> +actual events are being triggered falls way beyond the scope of the generic >> +filesystem events interface. It is up to a particular filesystem >> +implementation which events are to be supported - if any at all. So if >> +given filesystem does not support the event notifications, an attempt to >> +enable those through 'config' file will fail. >> + >> + >> +3. The generic netlink interface support: >> +========================================= >> + >> +Whenever an event notification is triggered (by given filesystem) the current >> +configuration is being validated to decide whether a userpsace notification >> +should be launched. If there has been no request (in a mean of 'config' file >> +entry) for given event, one will be silently disregarded. If, on the other >> +hand, someone is 'watching' given filesystem for specific events, a generic >> +netlink message will be sent. A dedicated multicast group has been provided >> +solely for this purpose so in order to receive such notifications, one should >> +subscribe to this new multicast group. As for now only the init network >> +namespace is being supported. >> + >> +3.1 Message format >> + >> +The FS_NL_C_EVENT shall be stored within the generic netlink message header >> +as the command field. The message payload will provide more detailed info: >> +the backing device major and minor numbers, the event code and the id of >> +the process which action led to the event occurrence. In case of threshold >> +notifications, the current number of available blocks will be included >> +in the payload as well. >> + >> + >> + 0 1 2 3 >> + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | NETLINK MESSAGE HEADER | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | GENERIC NETLINK MESSAGE HEADER | >> + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | > > cmd, not cdm. ditto > >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | Optional user specific message header | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | GENERIC MESSAGE PAYLOAD: | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_EVENT_ID (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DEV_MAJOR (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DEV_MINOR (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_CAUSED_ID (NLA_U32) | > > Should be NLA_U64 ? The following uses as, > > + if (nla_put_u64(skb, FS_NL_A_CAUSED_ID, pid_vnr(task_pid(current)))) > + return -EINVAL; > Yes, or nla_put_u32 - either way my bad > Also, I'd like FS_NL_A_CAUSED_PID than FS_NL_A_CAUSED_ID. Alright > >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DATA (NLA_U64) | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + >> + >> +The above figure is based on: >> + http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto#Message_Format >> + >> + > ... snip... >> + seq_putc(m, ' '); >> + if (sb->s_op->show_devname) { >> + sb->s_op->show_devname(m, en->mnt_path.mnt->mnt_root); >> + } else { >> + seq_escape(m, r_mnt->mnt_devname ? r_mnt->mnt_devname : "none", >> + " \t\n\\"); >> + } >> + seq_puts(m, " ("); >> + >> + nmask = en->notify; >> + for (match = fs_etypes; match->pattern; ++match) { >> + if (match->token & nmask) { >> + seq_puts(m, match->pattern); > > Print here is better. > > if (match->pattern & FS_EVENT_THRESH) > seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); > >> + nmask &= ~match->token; >> + if (nmask) >> + seq_putc(m, ','); >> + } >> + } >> + seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); > > Don't print the lrange/urange (always be zero) when without FS_EVENT_THRESH. > ditto >> + seq_puts(m, ")\n"); >> + return 0; >> +} >> + >> +static const struct seq_operations fs_trace_seq_ops = { >> + .start = fs_trace_seq_start, >> + .next = fs_trace_seq_next, >> + .stop = fs_trace_seq_stop, >> + .show = fs_trace_seq_show, >> +}; >> + >> +static int fs_trace_open(struct inode *inode, struct file *file) >> +{ >> + return seq_open(file, &fs_trace_seq_ops); >> +} >> + >> +static const struct file_operations fs_trace_fops = { >> + .owner = THIS_MODULE, >> + .open = fs_trace_open, >> + .write = fs_trace_write, >> + .read = seq_read, >> + .llseek = seq_lseek, >> + .release = seq_release, >> +}; >> + >> +static int fs_trace_init(void) >> +{ >> + fs_trace_cachep = KMEM_CACHE(fs_trace_entry, 0); >> + if (!fs_trace_cachep) >> + return -EINVAL; >> + init_waitqueue_head(&trace_wq); >> + return 0; >> +} >> + >> +/* VFS support */ >> +static int fs_trace_fill_super(struct super_block *sb, void *data, int silen) >> +{ >> + int ret; >> + static struct tree_descr desc[] = { >> + [2] = { >> + .name = "config", >> + .ops = &fs_trace_fops, >> + .mode = S_IWUSR | S_IRUGO, >> + }, >> + {""}, >> + }; >> + >> + ret = simple_fill_super(sb, 0x7246332, desc); >> + return !ret ? fs_trace_init() : ret; >> +} >> + >> +static struct dentry *fs_trace_do_mount(struct file_system_type *fs_type, >> + int ntype, const char *dev_name, void *data) >> +{ >> + return mount_single(fs_type, ntype, data, fs_trace_fill_super); >> +} >> + >> +static void fs_trace_kill_super(struct super_block *sb) >> +{ >> + /* >> + * The rcu_barrier here will/should make sure all call_rcu >> + * callbacks are completed - still there might be some active >> + * trace objects in use which can make calling the >> + * kmem_cache_destroy unsafe. So we wait until all traces >> + * are finally released. >> + */ >> + fs_remove_all_traces(); >> + rcu_barrier(); >> + wait_event(trace_wq, !atomic_read(&stray_traces)); >> + >> + kmem_cache_destroy(fs_trace_cachep); >> + kill_litter_super(sb); >> +} >> + >> +static struct kset *fs_trace_kset; >> + >> +static struct file_system_type fs_trace_fstype = { >> + .name = "fstrace", >> + .mount = fs_trace_do_mount, >> + .kill_sb = fs_trace_kill_super, >> +}; >> + >> +static void __init fs_trace_vfs_init(void) >> +{ >> + fs_trace_kset = kset_create_and_add("events", NULL, fs_kobj); >> + >> + if (!fs_trace_kset) >> + return; >> + >> + if (!register_filesystem(&fs_trace_fstype)) { >> + if (!fs_event_netlink_register()) >> + return; >> + unregister_filesystem(&fs_trace_fstype); >> + } >> + kset_unregister(fs_trace_kset); >> +} >> + >> +static int __init fs_trace_evens_init(void) >> +{ >> + fs_trace_vfs_init(); >> + return 0; >> +}; >> +module_init(fs_trace_evens_init); >> + >> diff --git a/fs/events/fs_event.h b/fs/events/fs_event.h >> new file mode 100644 >> index 0000000..23f24c8 >> --- /dev/null >> +++ b/fs/events/fs_event.h >> @@ -0,0 +1,22 @@ >> +/* >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> + >> +#ifndef __GENERIC_FS_EVENTS_H >> +#define __GENERIC_FS_EVENTS_H >> + >> +int fs_event_netlink_register(void); >> +void fs_event_netlink_unregister(void); >> + >> +#endif /* __GENERIC_FS_EVENTS_H */ >> diff --git a/fs/events/fs_event_netlink.c b/fs/events/fs_event_netlink.c >> new file mode 100644 >> index 0000000..0c97eb7 >> --- /dev/null >> +++ b/fs/events/fs_event_netlink.c >> @@ -0,0 +1,104 @@ >> +/* >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include "fs_event.h" >> + >> +static const struct genl_multicast_group fs_event_mcgroups[] = { >> + { .name = FS_EVENTS_MCAST_GRP_NAME, }, >> +}; >> + >> +static struct genl_family fs_event_family = { >> + .id = GENL_ID_GENERATE, >> + .name = FS_EVENTS_FAMILY_NAME, >> + .version = 1, >> + .maxattr = FS_NL_A_MAX, >> + .mcgrps = fs_event_mcgroups, >> + .n_mcgrps = ARRAY_SIZE(fs_event_mcgroups), >> +}; >> + >> +int fs_netlink_send_event(size_t size, unsigned int event_id, >> + int (*compose_msg)(struct sk_buff *skb, void *data), >> + void *cbdata) >> +{ >> + static atomic_t seq; >> + struct sk_buff *skb; >> + void *msg_head; >> + int ret = 0; >> + >> + if (!size || !compose_msg) >> + return -EINVAL; >> + >> + /* Skip if there are no listeners */ >> + if (!genl_has_listeners(&fs_event_family, &init_net, 0)) >> + return 0; >> + >> + if (event_id != FS_EVENT_NONE) >> + size += nla_total_size(sizeof(u32)); >> + size += nla_total_size(sizeof(u64)); > > What is this for ? > This should actually get removed :) > thanks > Kinglong Mee > Thank You, Best Regards Beata From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752319AbbFSAFd (ORCPT ); Thu, 18 Jun 2015 20:05:33 -0400 Received: from ipmail07.adl2.internode.on.net ([150.101.137.131]:27837 "EHLO ipmail07.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751958AbbFSAF0 (ORCPT ); Thu, 18 Jun 2015 20:05:26 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2DWBwDGW4NV//dLLHlcgxCBM4JQqQYBAQEBAQEGmXcCAgEBAoE4TQEBAQEBAYELhCMBAQQnExwhAhAIAw4KCSUPBSUDIRMbiBPGLgEBAQcCAR8YhgOFKoUGB4QrBYVYB44Ri0qBNo5xiAAmY4MoLDGCSAEBAQ Date: Fri, 19 Jun 2015 10:03:41 +1000 From: Dave Chinner To: Beata Michalska Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Message-ID: <20150619000341.GM10224@dastard> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150617230605.GK10224@dastard> <55828064.5040301@samsung.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <55828064.5040301@samsung.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 18, 2015 at 10:25:08AM +0200, Beata Michalska wrote: > On 06/18/2015 01:06 AM, Dave Chinner wrote: > > On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: > >> Introduce configurable generic interface for file > >> system-wide event notifications, to provide file > >> systems with a common way of reporting any potential > >> issues as they emerge. > >> > >> The notifications are to be issued through generic > >> netlink interface by newly introduced multicast group. > >> > >> Threshold notifications have been included, allowing > >> triggering an event whenever the amount of free space drops > >> below a certain level - or levels to be more precise as two > >> of them are being supported: the lower and the upper range. > >> The notifications work both ways: once the threshold level > >> has been reached, an event shall be generated whenever > >> the number of available blocks goes up again re-activating > >> the threshold. > >> > >> The interface has been exposed through a vfs. Once mounted, > >> it serves as an entry point for the set-up where one can > >> register for particular file system events. > >> > >> Signed-off-by: Beata Michalska > > > > This has massive scalability problems: .... > > Have you noticed that the filesystems have percpu counters for > > tracking global space usage? There's good reason for that - taking a > > spinlock in such a hot accounting path causes severe contention. .... > > Then puts the entire netlink send path inside this spinlock, which > > includes memory allocation and all sorts of non-filesystem code > > paths. And it may be inside critical filesystem locks as well.... > > > > Apart from the serialisation problem of the locking, adding > > memory allocation and the network send path to filesystem code > > that is effectively considered "innermost" filesystem code is going > > to have all sorts of problems for various filesystems. In the XFS > > case, we simply cannot execute this sort of function in the places > > where we update global space accounting. > > > > As it is, I think the basic concept of separate tracking of free > > space if fundamentally flawed. What I think needs to be done is that > > filesystems need access to the thresholds for events, and then the > > filesystems call fs_event_send_thresh() themselves from appropriate > > contexts (ie. without compromising locking, scalability, memory > > allocation recursion constraints, etc). > > > > e.g. instead of tracking every change in free space, a filesystem > > might execute this once every few seconds from a workqueue: > > > > event = fs_event_need_space_warning(sb, ) > > if (event) > > fs_event_send_thresh(sb, event); > > > > User still gets warnings about space usage, but there's no runtime > > overhead or problems with lock/memory allocation contexts, etc. > > Having fs to keep a firm hand on thresholds limits would indeed be > far more sane approach though that would require each fs to > add support for that and handle most of it on their own. Avoiding >> this was the main rationale behind this rfc. > If fs people agree to that, I'll be more than willing to drop this > in favour of the per-fs tracking solution. > Personally, I hope they will. I was hoping that you'd think a little more about my suggestion and work out how to do background threshold event detection generically. I kind of left it as "an exercise for the reader" because it seems obvious to me. Hint: ->statfs allows you to get the total, free and used space from filesystems in a generic manner. Cheers, Dave. -- Dave Chinner david@fromorbit.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754440AbbFSR2Y (ORCPT ); Fri, 19 Jun 2015 13:28:24 -0400 Received: from mailout4.w1.samsung.com ([210.118.77.14]:39065 "EHLO mailout4.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752813AbbFSR2R (ORCPT ); Fri, 19 Jun 2015 13:28:17 -0400 X-AuditID: cbfec7f4-f79c56d0000012ee-8f-5584512c851f Message-id: <5584512B.5020301@samsung.com> Date: Fri, 19 Jun 2015 19:28:11 +0200 From: Beata Michalska User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130804 Thunderbird/17.0.8 MIME-version: 1.0 To: Dave Chinner Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150617230605.GK10224@dastard> <55828064.5040301@samsung.com> <20150619000341.GM10224@dastard> In-reply-to: <20150619000341.GM10224@dastard> Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprLIsWRmVeSWpSXmKPExsVy+t/xa7q6gS2hBudOM1p8/dLBYrHl2D1G i3MLZjBanJ6wiMni6ac+FovZ05uZLG5dXsVicbbpDbvFsgebWSw2f+9gs5g57w6bxZ69J1ks Lu+aw2Zxb81/VovWnp/sDvweLZvLPU4tkvBYsKnUY/MKLY+3DwM8Nn2axO7RdOYos8f7fVfZ PPq2rGL0OLPgCLvH501yAdxRXDYpqTmZZalF+nYJXBmHr9xiL9irWXH8mlYD40XFLkZODgkB E4mZJ/6yQNhiEhfurWfrYuTiEBJYyijxZM0bRgjnGaPEt4W3mEGqeAW0JHY+2gXUwcHBIqAq cedxNUiYTUBf4tWMlUwgtqhAhMSf0/tYIcoFJX5Mvge2QERATWLSpB3MIDOZBY4wSax4Nx+s SFjAU+Lj7BnsEMteMUoc272KHSTBKaAr0bjhFFgRs4COxP7WaWwQtrzE5jVvmScwCsxCsmQW krJZSMoWMDKvYhRNLU0uKE5KzzXUK07MLS7NS9dLzs/dxAiJtC87GBcfszrEKMDBqMTDa/it OVSINbGsuDL3EKMEB7OSCK+aakuoEG9KYmVValF+fFFpTmrxIUZpDhYlcd65u96HCAmkJ5ak ZqemFqQWwWSZODilGhi7VvEdDv7/iEfiyLWHy/lq33Sem6F0+19IRNab7nua+hsOGF9/Vi1W 2vS9VXz2zs2XZ/C5zf1/Vv9i/6tjWs2Sm3Z6vVP28r+ne+pC6um9trzfNnotZF/2gcP6l5/s /bxjedabWYo/PFLc98z1RfyBlYVNuRuW1JrPKPp+qt2lrUM4U3Zrc95GJZbijERDLeai4kQA suFEbbACAAA= Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/19/2015 02:03 AM, Dave Chinner wrote: > On Thu, Jun 18, 2015 at 10:25:08AM +0200, Beata Michalska wrote: >> On 06/18/2015 01:06 AM, Dave Chinner wrote: >>> On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: >>>> Introduce configurable generic interface for file >>>> system-wide event notifications, to provide file >>>> systems with a common way of reporting any potential >>>> issues as they emerge. >>>> >>>> The notifications are to be issued through generic >>>> netlink interface by newly introduced multicast group. >>>> >>>> Threshold notifications have been included, allowing >>>> triggering an event whenever the amount of free space drops >>>> below a certain level - or levels to be more precise as two >>>> of them are being supported: the lower and the upper range. >>>> The notifications work both ways: once the threshold level >>>> has been reached, an event shall be generated whenever >>>> the number of available blocks goes up again re-activating >>>> the threshold. >>>> >>>> The interface has been exposed through a vfs. Once mounted, >>>> it serves as an entry point for the set-up where one can >>>> register for particular file system events. >>>> >>>> Signed-off-by: Beata Michalska >>> >>> This has massive scalability problems: > .... >>> Have you noticed that the filesystems have percpu counters for >>> tracking global space usage? There's good reason for that - taking a >>> spinlock in such a hot accounting path causes severe contention. > .... >>> Then puts the entire netlink send path inside this spinlock, which >>> includes memory allocation and all sorts of non-filesystem code >>> paths. And it may be inside critical filesystem locks as well.... >>> >>> Apart from the serialisation problem of the locking, adding >>> memory allocation and the network send path to filesystem code >>> that is effectively considered "innermost" filesystem code is going >>> to have all sorts of problems for various filesystems. In the XFS >>> case, we simply cannot execute this sort of function in the places >>> where we update global space accounting. >>> >>> As it is, I think the basic concept of separate tracking of free >>> space if fundamentally flawed. What I think needs to be done is that >>> filesystems need access to the thresholds for events, and then the >>> filesystems call fs_event_send_thresh() themselves from appropriate >>> contexts (ie. without compromising locking, scalability, memory >>> allocation recursion constraints, etc). >>> >>> e.g. instead of tracking every change in free space, a filesystem >>> might execute this once every few seconds from a workqueue: >>> >>> event = fs_event_need_space_warning(sb, ) >>> if (event) >>> fs_event_send_thresh(sb, event); >>> >>> User still gets warnings about space usage, but there's no runtime >>> overhead or problems with lock/memory allocation contexts, etc. >> >> Having fs to keep a firm hand on thresholds limits would indeed be >> far more sane approach though that would require each fs to >> add support for that and handle most of it on their own. Avoiding >>> this was the main rationale behind this rfc. >> If fs people agree to that, I'll be more than willing to drop this >> in favour of the per-fs tracking solution. >> Personally, I hope they will. > > I was hoping that you'd think a little more about my suggestion and > work out how to do background threshold event detection generically. > I kind of left it as "an exercise for the reader" because it seems > obvious to me. > > Hint: ->statfs allows you to get the total, free and used space > from filesystems in a generic manner. > > Cheers, > > Dave. > I haven't given up on that, so yes, I'm still working on a more suitable generic solution. Background detection is one of the options, though it needs some more thoughts. Giving up the sync approach means less accuracy for the threshold notifications, but I guess this could be fine-tuned to get an acceptable level. Another bump: how this tuning is supposed to be done (additional config option maybe)? The interface would have to keep it somehow sane - but what would 'sane' mean in this case (?) Also, I'm not sure whether single approach would server here well for all the potentially supported file systems so this would have to be properly adjusted (taking the threshold levels into consideration as well). And still,it would require some form of synchronization with tracked fs so that this 'detection' is not being unnecessarily performed (i.e. while fs remains frozen). There is also an idea of using an interface resembling the stackable fs: a transparent file system layered on top of the tracked one (solely for the tracking purposes). This would simplify handling the trace object's lifetime - no more list of registered traces. It would also give a way of tracking (to some extent) the changes in the amount of available space, which combined with tweaked background check could give a solution with less performance overhead than the original one. I'll try this one and see how it goes. Thank You for your feedback so far - I really appreciate it. Best Regards Beata -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932417AbbFSXVh (ORCPT ); Fri, 19 Jun 2015 19:21:37 -0400 Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:52826 "EHLO ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751667AbbFSXVV (ORCPT ); Fri, 19 Jun 2015 19:21:21 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2BqCgDIooRVPOkmLHlcgxCBM4ZMpQkBAQEBAQEGmgICAgEBAoE3TQEBAQEBAQcBAQEBQT+EIgEBAQMBJxMcIQIFCwgDDgoJJQ8FJQMHGhMbiAwHxikBAQEHAgEfGIYDhSqFBgeEKwWFWgeBHYYxhk2LTYE7jnSIAIEJgygsMYEDBIFBAQEB Date: Sat, 20 Jun 2015 09:21:17 +1000 From: Dave Chinner To: Beata Michalska Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Message-ID: <20150619232117.GN10224@dastard> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150617230605.GK10224@dastard> <55828064.5040301@samsung.com> <20150619000341.GM10224@dastard> <5584512B.5020301@samsung.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5584512B.5020301@samsung.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jun 19, 2015 at 07:28:11PM +0200, Beata Michalska wrote: > On 06/19/2015 02:03 AM, Dave Chinner wrote: > > On Thu, Jun 18, 2015 at 10:25:08AM +0200, Beata Michalska wrote: > >> On 06/18/2015 01:06 AM, Dave Chinner wrote: > >>> On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: > >>>> Introduce configurable generic interface for file > >>>> system-wide event notifications, to provide file > >>>> systems with a common way of reporting any potential > >>>> issues as they emerge. > >>>> > >>>> The notifications are to be issued through generic > >>>> netlink interface by newly introduced multicast group. > >>>> > >>>> Threshold notifications have been included, allowing > >>>> triggering an event whenever the amount of free space drops > >>>> below a certain level - or levels to be more precise as two > >>>> of them are being supported: the lower and the upper range. > >>>> The notifications work both ways: once the threshold level > >>>> has been reached, an event shall be generated whenever > >>>> the number of available blocks goes up again re-activating > >>>> the threshold. > >>>> > >>>> The interface has been exposed through a vfs. Once mounted, > >>>> it serves as an entry point for the set-up where one can > >>>> register for particular file system events. > >>>> > >>>> Signed-off-by: Beata Michalska > >>> > >>> This has massive scalability problems: > > .... > >>> Have you noticed that the filesystems have percpu counters for > >>> tracking global space usage? There's good reason for that - taking a > >>> spinlock in such a hot accounting path causes severe contention. > > .... > >>> Then puts the entire netlink send path inside this spinlock, which > >>> includes memory allocation and all sorts of non-filesystem code > >>> paths. And it may be inside critical filesystem locks as well.... > >>> > >>> Apart from the serialisation problem of the locking, adding > >>> memory allocation and the network send path to filesystem code > >>> that is effectively considered "innermost" filesystem code is going > >>> to have all sorts of problems for various filesystems. In the XFS > >>> case, we simply cannot execute this sort of function in the places > >>> where we update global space accounting. > >>> > >>> As it is, I think the basic concept of separate tracking of free > >>> space if fundamentally flawed. What I think needs to be done is that > >>> filesystems need access to the thresholds for events, and then the > >>> filesystems call fs_event_send_thresh() themselves from appropriate > >>> contexts (ie. without compromising locking, scalability, memory > >>> allocation recursion constraints, etc). > >>> > >>> e.g. instead of tracking every change in free space, a filesystem > >>> might execute this once every few seconds from a workqueue: > >>> > >>> event = fs_event_need_space_warning(sb, ) > >>> if (event) > >>> fs_event_send_thresh(sb, event); > >>> > >>> User still gets warnings about space usage, but there's no runtime > >>> overhead or problems with lock/memory allocation contexts, etc. > >> > >> Having fs to keep a firm hand on thresholds limits would indeed be > >> far more sane approach though that would require each fs to > >> add support for that and handle most of it on their own. Avoiding > >>> this was the main rationale behind this rfc. > >> If fs people agree to that, I'll be more than willing to drop this > >> in favour of the per-fs tracking solution. > >> Personally, I hope they will. > > > > I was hoping that you'd think a little more about my suggestion and > > work out how to do background threshold event detection generically. > > I kind of left it as "an exercise for the reader" because it seems > > obvious to me. > > > > Hint: ->statfs allows you to get the total, free and used space > > from filesystems in a generic manner. > > > > Cheers, > > > > Dave. > > > > I haven't given up on that, so yes, I'm still working on a more suitable > generic solution. > Background detection is one of the options, though it needs some more thoughts. > Giving up the sync approach means less accuracy for the threshold notifications, > but I guess this could be fine-tuned to get an acceptable level. Accuracy really doesn't matter for threshold notifications - by the time the event is delivered to userspace it can already be wrong. > Another bump: > how this tuning is supposed to be done (additional config option maybe)? Why would you need to tune it at all? You can't *stop* the operation that is triggering the threshold, so a few seconds delay on delivery isn't going to make any difference to anyone.... You're overthinking this massively. All this needs is a work item per superblock, and when the thresholds are turned on it queues a self-repeating delayed work that calls ->statfs, checks against the configured threshold, issues an event if necessary, and then queues itself again to run next period. When the threshold is turned off, the work is cancelled. Another option: a kernel thread that runs periodically and just calls iterate_supers() with a function that checks the sb for threshold events, and if configured runs ->statfs and does the work, otherwise skips the sb. That avoids all the lifetime issues with using workqueues, you don't need a struct work, etc. > There is also an idea of using an interface resembling the stackable fs: No. Just .... No. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932294AbbFVPr1 (ORCPT ); Mon, 22 Jun 2015 11:47:27 -0400 Received: from mailout1.w1.samsung.com ([210.118.77.11]:46014 "EHLO mailout1.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751810AbbFVPrI (ORCPT ); Mon, 22 Jun 2015 11:47:08 -0400 X-AuditID: cbfec7f4-f79c56d0000012ee-5d-55882df91499 Message-id: <55882DD3.5040002@samsung.com> Date: Mon, 22 Jun 2015 17:46:27 +0200 From: Beata Michalska User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130804 Thunderbird/17.0.8 MIME-version: 1.0 To: Dave Chinner Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <20150617230605.GK10224@dastard> <55828064.5040301@samsung.com> <20150619000341.GM10224@dastard> <5584512B.5020301@samsung.com> <20150619232117.GN10224@dastard> In-reply-to: <20150619232117.GN10224@dastard> Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprHIsWRmVeSWpSXmKPExsVy+t/xy7o/dTtCDU48VrP4+qWDxWLLsXuM FucWzGC0OD1hEZPF0099LBazpzczWdy6vIrF4mzTG3aLZQ82s1hs/t7BZjFz3h02iz17T7JY XN41h83i3pr/rBatPT/ZHfg9WjaXe5xaJOGxYFOpx+YVWh5vHwZ4bPo0id2j6cxRZo/3+66y efRtWcXocWbBEXaPz5vkArijuGxSUnMyy1KL9O0SuDK+3TjGUvBAt2Lrpj9sDYydql2MnBwS AiYS7979Y4GwxSQu3FvP1sXIxSEksJRR4uqjl8wQzjNGid4p69i7GDk4eAW0JN50cYE0sAio Stzeu4ERxGYT0Jd4NWMlE4gtKhAh8ef0PlYQm1dAUOLH5HtgC0QE1CQmTdoBNpNZ4AiTxIp3 88GKhAU8JT7OnsEOsWwNk8TqI1/BJnEK6EqseL6PDcRmFtCR2N86DcqWl9i85i3zBEaBWUiW zEJSNgtJ2QJG5lWMoqmlyQXFSem5hnrFibnFpXnpesn5uZsYIbH2ZQfj4mNWhxgFOBiVeHgd bNtDhVgTy4orcw8xSnAwK4nwtpwFCvGmJFZWpRblxxeV5qQWH2KU5mBREuedu+t9iJBAemJJ anZqakFqEUyWiYNTqoEx/3j683WL2pbIFN7T6V25rSudte7PHJ70tQxK/xddfTxpo43Q9jkT fflqroh/m/3yWKh474R404XS6gyPpxc99Tt49HLZdg6P7Wysb23fSX1u/6cvyaTbOu1UPf9J VotXr4IPqVZodevL25fJ/1mxNno607PVzLYrtzddfMi98pHWpFO3BNPLlViKMxINtZiLihMB uZ6SE7ECAAA= Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/20/2015 01:21 AM, Dave Chinner wrote: > On Fri, Jun 19, 2015 at 07:28:11PM +0200, Beata Michalska wrote: >> On 06/19/2015 02:03 AM, Dave Chinner wrote: >>> On Thu, Jun 18, 2015 at 10:25:08AM +0200, Beata Michalska wrote: >>>> On 06/18/2015 01:06 AM, Dave Chinner wrote: >>>>> On Tue, Jun 16, 2015 at 03:09:30PM +0200, Beata Michalska wrote: >>>>>> Introduce configurable generic interface for file >>>>>> system-wide event notifications, to provide file >>>>>> systems with a common way of reporting any potential >>>>>> issues as they emerge. >>>>>> >>>>>> The notifications are to be issued through generic >>>>>> netlink interface by newly introduced multicast group. >>>>>> >>>>>> Threshold notifications have been included, allowing >>>>>> triggering an event whenever the amount of free space drops >>>>>> below a certain level - or levels to be more precise as two >>>>>> of them are being supported: the lower and the upper range. >>>>>> The notifications work both ways: once the threshold level >>>>>> has been reached, an event shall be generated whenever >>>>>> the number of available blocks goes up again re-activating >>>>>> the threshold. >>>>>> >>>>>> The interface has been exposed through a vfs. Once mounted, >>>>>> it serves as an entry point for the set-up where one can >>>>>> register for particular file system events. >>>>>> >>>>>> Signed-off-by: Beata Michalska >>>>> >>>>> This has massive scalability problems: >>> .... >>>>> Have you noticed that the filesystems have percpu counters for >>>>> tracking global space usage? There's good reason for that - taking a >>>>> spinlock in such a hot accounting path causes severe contention. >>> .... >>>>> Then puts the entire netlink send path inside this spinlock, which >>>>> includes memory allocation and all sorts of non-filesystem code >>>>> paths. And it may be inside critical filesystem locks as well.... >>>>> >>>>> Apart from the serialisation problem of the locking, adding >>>>> memory allocation and the network send path to filesystem code >>>>> that is effectively considered "innermost" filesystem code is going >>>>> to have all sorts of problems for various filesystems. In the XFS >>>>> case, we simply cannot execute this sort of function in the places >>>>> where we update global space accounting. >>>>> >>>>> As it is, I think the basic concept of separate tracking of free >>>>> space if fundamentally flawed. What I think needs to be done is that >>>>> filesystems need access to the thresholds for events, and then the >>>>> filesystems call fs_event_send_thresh() themselves from appropriate >>>>> contexts (ie. without compromising locking, scalability, memory >>>>> allocation recursion constraints, etc). >>>>> >>>>> e.g. instead of tracking every change in free space, a filesystem >>>>> might execute this once every few seconds from a workqueue: >>>>> >>>>> event = fs_event_need_space_warning(sb, ) >>>>> if (event) >>>>> fs_event_send_thresh(sb, event); >>>>> >>>>> User still gets warnings about space usage, but there's no runtime >>>>> overhead or problems with lock/memory allocation contexts, etc. >>>> >>>> Having fs to keep a firm hand on thresholds limits would indeed be >>>> far more sane approach though that would require each fs to >>>> add support for that and handle most of it on their own. Avoiding >>>>> this was the main rationale behind this rfc. >>>> If fs people agree to that, I'll be more than willing to drop this >>>> in favour of the per-fs tracking solution. >>>> Personally, I hope they will. >>> >>> I was hoping that you'd think a little more about my suggestion and >>> work out how to do background threshold event detection generically. >>> I kind of left it as "an exercise for the reader" because it seems >>> obvious to me. >>> >>> Hint: ->statfs allows you to get the total, free and used space >>> from filesystems in a generic manner. >>> >>> Cheers, >>> >>> Dave. >>> >> >> I haven't given up on that, so yes, I'm still working on a more suitable >> generic solution. >> Background detection is one of the options, though it needs some more thoughts. >> Giving up the sync approach means less accuracy for the threshold notifications, >> but I guess this could be fine-tuned to get an acceptable level. > > Accuracy really doesn't matter for threshold notifications - by the > time the event is delivered to userspace it can already be wrong. > >> Another bump: >> how this tuning is supposed to be done (additional config option maybe)? > > Why would you need to tune it at all? You can't *stop* the operation > that is triggering the threshold, so a few seconds delay on delivery > isn't going to make any difference to anyone.... > > You're overthinking this massively. All this needs is a work item > per superblock, and when the thresholds are turned on it queues a > self-repeating delayed work that calls ->statfs, checks against the > configured threshold, issues an event if necessary, and then queues > itself again to run next period. When the threshold is turned off, > the work is cancelled. > > Another option: a kernel thread that runs periodically and just > calls iterate_supers() with a function that checks the sb for > threshold events, and if configured runs ->statfs and does the work, > otherwise skips the sb. That avoids all the lifetime issues with > using workqueues, you don't need a struct work, etc. > >> There is also an idea of using an interface resembling the stackable fs: > > No. Just .... No. > > Cheers, > > Dave. > Alright, I'll make appropriate changes to move the threshold verification into the background and see how it works. Thanks, Best Regards Beata -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753264AbbFXPbW (ORCPT ); Wed, 24 Jun 2015 11:31:22 -0400 Received: from mailout4.w1.samsung.com ([210.118.77.14]:20793 "EHLO mailout4.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751487AbbFXPbS (ORCPT ); Wed, 24 Jun 2015 11:31:18 -0400 X-AuditID: cbfec7f5-f794b6d000001495-7a-558acd43cc9f Message-id: <558ACD3A.2020508@samsung.com> Date: Wed, 24 Jun 2015 17:31:06 +0200 From: Beata Michalska User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130804 Thunderbird/17.0.8 MIME-version: 1.0 To: Dmitry Monakhov Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <87oak5ebmx.fsf@openvz.org> In-reply-to: <87oak5ebmx.fsf@openvz.org> Content-type: text/plain; charset=UTF-8 Content-transfer-encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprLIsWRmVeSWpSXmKPExsVy+t/xa7rOZ7tCDXa+4bH4+qWDxeLq81/M FucWzGC0OD1hEZPF0099LBazpzczWdy6vIrF4mzTG3aLZQ82s1hs/t7BZjFz3h02iz17T7JY XN41h83i3pr/rBatPT/ZHfg9WjaXe+ycdZfdY8GmUo/NK7Q83j4M8Nj0aRK7R9OZo8we7/dd ZfPo27KK0ePMgiPsHp83yQVwR3HZpKTmZJalFunbJXBl9OwzLHjsVbHi+k6WBsZdVl2MnBwS AiYSF15vYYGwxSQu3FvPBmILCSxllLjXK9/FyAVkP2OUeHNsGliCV0BLYt2GO0wgNouAqsS9 Z2tZQWw2AX2JVzNWgsVFBSIk/pzexwpRLyjxY/I9sAUiAhoSp+/PZwQZyixwhElixbv5YEXC Ap4SH2fPYIfYtphRYtPrb0AJDg5OoI4NJzNAapgF1CUmzVvEDGHLS2xe85Z5AqPALCQ7ZiEp m4WkbAEj8ypG0dTS5ILipPRcI73ixNzi0rx0veT83E2MkEj7uoNx6TGrQ4wCHIxKPLwrPnSG CrEmlhVX5h5ilOBgVhLhVT7WFSrEm5JYWZValB9fVJqTWnyIUZqDRUmcd+au9yFCAumJJanZ qakFqUUwWSYOTqkGxvj8pnjz40u2n/RQneRr9fihQ+ki4yzJ0kvOE6R/ScZ66SqoygdeqLzj Vvd2aZmR/Z453NazTvW2y7dP98n04chbFXDs0pUFj01LgwzPNR79orhngd2ym5dSzftP/pC9 y5u5prb+g+GnOydtDRP9pBuOLy6Pm3d/srR+WDNz954zt6ZGzrsvqcRSnJFoqMVcVJwIAL+w Xl6wAgAA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/24/2015 10:47 AM, Dmitry Monakhov wrote: > Beata Michalska writes: > >> Introduce configurable generic interface for file >> system-wide event notifications, to provide file >> systems with a common way of reporting any potential >> issues as they emerge. >> >> The notifications are to be issued through generic >> netlink interface by newly introduced multicast group. >> >> Threshold notifications have been included, allowing >> triggering an event whenever the amount of free space drops >> below a certain level - or levels to be more precise as two >> of them are being supported: the lower and the upper range. >> The notifications work both ways: once the threshold level >> has been reached, an event shall be generated whenever >> the number of available blocks goes up again re-activating >> the threshold. >> >> The interface has been exposed through a vfs. Once mounted, >> it serves as an entry point for the set-up where one can >> register for particular file system events. >> >> Signed-off-by: Beata Michalska >> --- >> Documentation/filesystems/events.txt | 232 ++++++++++ >> fs/Kconfig | 2 + >> fs/Makefile | 1 + >> fs/events/Kconfig | 7 + >> fs/events/Makefile | 5 + >> fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ >> fs/events/fs_event.h | 22 + >> fs/events/fs_event_netlink.c | 104 +++++ >> fs/namespace.c | 1 + >> include/linux/fs.h | 6 +- >> include/linux/fs_event.h | 72 +++ >> include/uapi/linux/Kbuild | 1 + >> include/uapi/linux/fs_event.h | 58 +++ >> 13 files changed, 1319 insertions(+), 1 deletion(-) >> create mode 100644 Documentation/filesystems/events.txt >> create mode 100644 fs/events/Kconfig >> create mode 100644 fs/events/Makefile >> create mode 100644 fs/events/fs_event.c >> create mode 100644 fs/events/fs_event.h >> create mode 100644 fs/events/fs_event_netlink.c >> create mode 100644 include/linux/fs_event.h >> create mode 100644 include/uapi/linux/fs_event.h >> >> diff --git a/Documentation/filesystems/events.txt b/Documentation/filesystems/events.txt >> new file mode 100644 >> index 0000000..c2e6227 >> --- /dev/null >> +++ b/Documentation/filesystems/events.txt >> @@ -0,0 +1,232 @@ >> + >> + Generic file system event notification interface >> + >> +Document created 23 April 2015 by Beata Michalska >> + >> +1. The reason behind: >> +===================== >> + >> +There are many corner cases when things might get messy with the filesystems. >> +And it is not always obvious what and when went wrong. Sometimes you might >> +get some subtle hints that there is something going on - but by the time >> +you realise it, it might be too late as you are already out-of-space >> +or the filesystem has been remounted as read-only (i.e.). The generic >> +interface for the filesystem events fills the gap by providing a rather >> +easy way of real-time notifications triggered whenever something interesting >> +happens, allowing filesystems to report events in a common way, as they occur. >> + >> +2. How does it work: >> +==================== >> + >> +The interface itself has been exposed as fstrace-type Virtual File System, >> +primarily to ease the process of setting up the configuration for the >> +notifications. So for starters, it needs to get mounted (obviously): >> + >> + mount -t fstrace none /sys/fs/events >> + >> +This will unveil the single fstrace filesystem entry - the 'config' file, >> +through which the notification are being set-up. >> + >> +Activating notifications for particular filesystem is as straightforward >> +as writing into the 'config' file. Note that by default all events, despite >> +the actual filesystem type, are being disregarded. >> + >> +Synopsis of config: >> +------------------ >> + >> + MOUNT EVENT_TYPE [L1] [L2] >> + >> + MOUNT : the filesystem's mount point >> + EVENT_TYPE : event types - currently two of them are being supported: >> + >> + * generic events ("G") covering most common warnings >> + and errors that might be reported by any filesystem; >> + this option does not take any arguments; >> + >> + * threshold notifications ("T") - events sent whenever >> + the amount of available space drops below certain level; >> + it is possible to specify two threshold levels though >> + only one is required to properly setup the notifications; >> + as those refer to the number of available blocks, the lower >> + level [L1] needs to be higher than the upper one [L2] >> + >> +Sample request could look like the following: >> + >> + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config >> + >> +Multiple request might be specified provided they are separated with semicolon. >> + >> +The configuration itself might be modified at any time. One can add/remove >> +particular event types for given fielsystem, modify the threshold levels, >> +and remove single or all entries from the 'config' file. >> + >> + - Adding new event type: >> + >> + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config >> + >> +(Note that is is enough to provide the event type to be enabled without >> +the already set ones.) >> + >> + - Removing event type: >> + >> + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config >> + >> + - Updating threshold limits: >> + >> + $ echo MOUNT T L1 L2 > /sys/fs/events/config >> + >> + - Removing single entry: >> + >> + $ echo '!MOUNT' > /sys/fs/events/config >> + >> + - Removing all entries: >> + >> + $ echo > /sys/fs/events/config >> + >> +Reading the file will list all registered entries with their current set-up >> +along with some additional info like the filesystem type and the backing device >> +name if available. >> + >> +Final, though a very important note on the configuration: when and if the >> +actual events are being triggered falls way beyond the scope of the generic >> +filesystem events interface. It is up to a particular filesystem >> +implementation which events are to be supported - if any at all. So if >> +given filesystem does not support the event notifications, an attempt to >> +enable those through 'config' file will fail. >> + >> + >> +3. The generic netlink interface support: >> +========================================= >> + >> +Whenever an event notification is triggered (by given filesystem) the current >> +configuration is being validated to decide whether a userpsace notification >> +should be launched. If there has been no request (in a mean of 'config' file >> +entry) for given event, one will be silently disregarded. If, on the other >> +hand, someone is 'watching' given filesystem for specific events, a generic >> +netlink message will be sent. A dedicated multicast group has been provided >> +solely for this purpose so in order to receive such notifications, one should >> +subscribe to this new multicast group. As for now only the init network >> +namespace is being supported. >> + >> +3.1 Message format >> + >> +The FS_NL_C_EVENT shall be stored within the generic netlink message header >> +as the command field. The message payload will provide more detailed info: >> +the backing device major and minor numbers, the event code and the id of >> +the process which action led to the event occurrence. In case of threshold >> +notifications, the current number of available blocks will be included >> +in the payload as well. >> + >> + >> + 0 1 2 3 >> + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | NETLINK MESSAGE HEADER | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | GENERIC NETLINK MESSAGE HEADER | >> + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | Optional user specific message header | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | GENERIC MESSAGE PAYLOAD: | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_EVENT_ID (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DEV_MAJOR (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DEV_MINOR (NLA_U32) | > ... >> + >> +static int create_common_msg(struct sk_buff *skb, void *data) >> +{ >> + struct fs_trace_entry *en = (struct fs_trace_entry *)data; >> + struct super_block *sb = en->sb; >> + >> + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) >> + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) >> + return -EINVAL; > What about diskless(nfs,cifs,etc) filesystem? btrfs also has no > valid sb->s_dev Those are using the anon ids, generated by get_anon_bdev (through set_anon_super). This id will be visible in /proc/self/mountinfo or through stat. i.e: 30 22 0:21 / /root/fake_fs/btrfs rw,realtime - btrfs /dev/loop4 rw,nospace_cache Best Regards Beata From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751930AbbFZHbH (ORCPT ); Fri, 26 Jun 2015 03:31:07 -0400 Received: from mailout1.w1.samsung.com ([210.118.77.11]:21710 "EHLO mailout1.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751988AbbFZHax (ORCPT ); Fri, 26 Jun 2015 03:30:53 -0400 X-AuditID: cbfec7f4-f79c56d0000012ee-b0-558cffa8d4c3 Message-id: <558CFF9C.20700@samsung.com> Date: Fri, 26 Jun 2015 09:30:36 +0200 From: Beata Michalska User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130804 Thunderbird/17.0.8 MIME-version: 1.0 To: Steve French Cc: Dmitry Monakhov , LKML , linux-fsdevel , "linux-api@vger.kernel.org" , Greg Kroah-Hartman , Jan Kara , "Theodore Ts'o" , adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, Christoph Hellwig , "linux-ext4@vger.kernel.org" , linux-mm , kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <87oak5ebmx.fsf@openvz.org> <558ACD3A.2020508@samsung.com> In-reply-to: Content-type: text/plain; charset=UTF-8 Content-transfer-encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrIIsWRmVeSWpSXmKPExsVy+t/xy7or/veEGtxuE7b4+qWDxeLq81/M FucWzGC0OD1hEZPF0099LBazpzczWdy6vIrF4mzTG3aLZQ82s1hs/t7BZjFz3h02iz17T7JY XN41h83i3pr/rBZvXhxms2jt+cnuIODRsrncY+esu+weCzaVemxeoeXx9mGAx6ZPk9g9ms4c ZfZ4v+8qm0ffllWMHmcWHGH3+LxJLoA7issmJTUnsyy1SN8ugSvjwPkpbAWrIiu2N3xjaWD8 6NLFyMEhIWAicX+9RRcjJ5ApJnHh3no2EFtIYCmjxM39QCVcQPYzRomO3W3MIPW8AhoSUx75 gJgsAqoSVx6AlbMJ6Eu8mrGSCcQWFYiQ+HN6HyuIzSsgKPFj8j0WEFsEqPxd81RmkJHMApNZ JI6cn8AMkhAW8JT4OHsGO8SuTiaJ3m1HwaZyCgRLTH7+FayIWUBdYtK8RVC2vMTmNW+ZJzAK zEKyZBaSsllIyhYwMq9iFE0tTS4oTkrPNdQrTswtLs1L10vOz93ECIm4LzsYFx+zOsQowMGo xMM7o7UnVIg1say4MvcQowQHs5II7+M/QCHelMTKqtSi/Pii0pzU4kOM0hwsSuK8c3e9DxES SE8sSc1OTS1ILYLJMnFwSjUwSny5pBPtVPeyZCX/rvmOXNJvDy6e0X/2YkCAv7H2OTsf8dMm 7SvminTppL492/WWq/HHjSke9zanMazfeTvlXNbv28lzj7PyNV2zLlofkSvPvJRdcm/7tLO+ r1722hsV3t/Vs1V1voxcbOu6GyHCXxbMa0nav+S4tmXA+qie3/G3dt4PU5A3UGIpzkg01GIu Kk4EACenkHO0AgAA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/24/2015 06:26 PM, Steve French wrote: > On Wed, Jun 24, 2015 at 10:31 AM, Beata Michalska > wrote: >> On 06/24/2015 10:47 AM, Dmitry Monakhov wrote: >>> Beata Michalska writes: >>> >>>> Introduce configurable generic interface for file >>>> system-wide event notifications, to provide file >>>> systems with a common way of reporting any potential >>>> issues as they emerge. >>>> >>>> The notifications are to be issued through generic >>>> netlink interface by newly introduced multicast group. >>>> >>>> Threshold notifications have been included, allowing >>>> triggering an event whenever the amount of free space drops >>>> below a certain level - or levels to be more precise as two >>>> of them are being supported: the lower and the upper range. >>>> The notifications work both ways: once the threshold level >>>> has been reached, an event shall be generated whenever >>>> the number of available blocks goes up again re-activating >>>> the threshold. >>>> >>>> The interface has been exposed through a vfs. Once mounted, >>>> it serves as an entry point for the set-up where one can >>>> register for particular file system events. >>>> >>>> Signed-off-by: Beata Michalska >>>> --- >>>> Documentation/filesystems/events.txt | 232 ++++++++++ >>>> fs/Kconfig | 2 + >>>> fs/Makefile | 1 + >>>> fs/events/Kconfig | 7 + >>>> fs/events/Makefile | 5 + >>>> fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ >>>> fs/events/fs_event.h | 22 + >>>> fs/events/fs_event_netlink.c | 104 +++++ >>>> fs/namespace.c | 1 + >>>> include/linux/fs.h | 6 +- >>>> include/linux/fs_event.h | 72 +++ >>>> include/uapi/linux/Kbuild | 1 + >>>> include/uapi/linux/fs_event.h | 58 +++ >>>> 13 files changed, 1319 insertions(+), 1 deletion(-) >>>> create mode 100644 Documentation/filesystems/events.txt >>>> create mode 100644 fs/events/Kconfig >>>> create mode 100644 fs/events/Makefile >>>> create mode 100644 fs/events/fs_event.c >>>> create mode 100644 fs/events/fs_event.h >>>> create mode 100644 fs/events/fs_event_netlink.c >>>> create mode 100644 include/linux/fs_event.h >>>> create mode 100644 include/uapi/linux/fs_event.h >>>> >>>> diff --git a/Documentation/filesystems/events.txt b/Documentation/filesystems/events.txt >>>> new file mode 100644 >>>> index 0000000..c2e6227 >>>> --- /dev/null >>>> +++ b/Documentation/filesystems/events.txt >>>> @@ -0,0 +1,232 @@ >>>> + >>>> + Generic file system event notification interface >>>> + >>>> +Document created 23 April 2015 by Beata Michalska >>>> + >>>> +1. The reason behind: >>>> +===================== >>>> + >>>> +There are many corner cases when things might get messy with the filesystems. >>>> +And it is not always obvious what and when went wrong. Sometimes you might >>>> +get some subtle hints that there is something going on - but by the time >>>> +you realise it, it might be too late as you are already out-of-space >>>> +or the filesystem has been remounted as read-only (i.e.). The generic >>>> +interface for the filesystem events fills the gap by providing a rather >>>> +easy way of real-time notifications triggered whenever something interesting >>>> +happens, allowing filesystems to report events in a common way, as they occur. >>>> + >>>> +2. How does it work: >>>> +==================== >>>> + >>>> +The interface itself has been exposed as fstrace-type Virtual File System, >>>> +primarily to ease the process of setting up the configuration for the >>>> +notifications. So for starters, it needs to get mounted (obviously): >>>> + >>>> + mount -t fstrace none /sys/fs/events >>>> + >>>> +This will unveil the single fstrace filesystem entry - the 'config' file, >>>> +through which the notification are being set-up. >>>> + >>>> +Activating notifications for particular filesystem is as straightforward >>>> +as writing into the 'config' file. Note that by default all events, despite >>>> +the actual filesystem type, are being disregarded. >>>> + >>>> +Synopsis of config: >>>> +------------------ >>>> + >>>> + MOUNT EVENT_TYPE [L1] [L2] >>>> + >>>> + MOUNT : the filesystem's mount point >>>> + EVENT_TYPE : event types - currently two of them are being supported: >>>> + >>>> + * generic events ("G") covering most common warnings >>>> + and errors that might be reported by any filesystem; >>>> + this option does not take any arguments; >>>> + >>>> + * threshold notifications ("T") - events sent whenever >>>> + the amount of available space drops below certain level; >>>> + it is possible to specify two threshold levels though >>>> + only one is required to properly setup the notifications; >>>> + as those refer to the number of available blocks, the lower >>>> + level [L1] needs to be higher than the upper one [L2] >>>> + >>>> +Sample request could look like the following: >>>> + >>>> + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config >>>> + >>>> +Multiple request might be specified provided they are separated with semicolon. >>>> + >>>> +The configuration itself might be modified at any time. One can add/remove >>>> +particular event types for given fielsystem, modify the threshold levels, >>>> +and remove single or all entries from the 'config' file. >>>> + >>>> + - Adding new event type: >>>> + >>>> + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config >>>> + >>>> +(Note that is is enough to provide the event type to be enabled without >>>> +the already set ones.) >>>> + >>>> + - Removing event type: >>>> + >>>> + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config >>>> + >>>> + - Updating threshold limits: >>>> + >>>> + $ echo MOUNT T L1 L2 > /sys/fs/events/config >>>> + >>>> + - Removing single entry: >>>> + >>>> + $ echo '!MOUNT' > /sys/fs/events/config >>>> + >>>> + - Removing all entries: >>>> + >>>> + $ echo > /sys/fs/events/config >>>> + >>>> +Reading the file will list all registered entries with their current set-up >>>> +along with some additional info like the filesystem type and the backing device >>>> +name if available. >>>> + >>>> +Final, though a very important note on the configuration: when and if the >>>> +actual events are being triggered falls way beyond the scope of the generic >>>> +filesystem events interface. It is up to a particular filesystem >>>> +implementation which events are to be supported - if any at all. So if >>>> +given filesystem does not support the event notifications, an attempt to >>>> +enable those through 'config' file will fail. >>>> + >>>> + >>>> +3. The generic netlink interface support: >>>> +========================================= >>>> + >>>> +Whenever an event notification is triggered (by given filesystem) the current >>>> +configuration is being validated to decide whether a userpsace notification >>>> +should be launched. If there has been no request (in a mean of 'config' file >>>> +entry) for given event, one will be silently disregarded. If, on the other >>>> +hand, someone is 'watching' given filesystem for specific events, a generic >>>> +netlink message will be sent. A dedicated multicast group has been provided >>>> +solely for this purpose so in order to receive such notifications, one should >>>> +subscribe to this new multicast group. As for now only the init network >>>> +namespace is being supported. >>>> + >>>> +3.1 Message format >>>> + >>>> +The FS_NL_C_EVENT shall be stored within the generic netlink message header >>>> +as the command field. The message payload will provide more detailed info: >>>> +the backing device major and minor numbers, the event code and the id of >>>> +the process which action led to the event occurrence. In case of threshold >>>> +notifications, the current number of available blocks will be included >>>> +in the payload as well. >>>> + >>>> + >>>> + 0 1 2 3 >>>> + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 >>>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>>> + | NETLINK MESSAGE HEADER | >>>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>>> + | GENERIC NETLINK MESSAGE HEADER | >>>> + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | >>>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>>> + | Optional user specific message header | >>>> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >>>> + | GENERIC MESSAGE PAYLOAD: | >>>> + +---------------------------------------------------------------+ >>>> + | FS_NL_A_EVENT_ID (NLA_U32) | >>>> + +---------------------------------------------------------------+ >>>> + | FS_NL_A_DEV_MAJOR (NLA_U32) | >>>> + +---------------------------------------------------------------+ >>>> + | FS_NL_A_DEV_MINOR (NLA_U32) | >>> >> ... >> >>>> + >>>> +static int create_common_msg(struct sk_buff *skb, void *data) >>>> +{ >>>> + struct fs_trace_entry *en = (struct fs_trace_entry *)data; >>>> + struct super_block *sb = en->sb; >>>> + >>>> + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) >>>> + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) >>>> + return -EINVAL; >>> What about diskless(nfs,cifs,etc) filesystem? btrfs also has no >>> valid sb->s_dev > > And note that filesystem notifications and also file/directory change > notification are particularly useful in the case of a a network file > system (and heavily used by Windows desktop, Mac etc.) since when a > file is shared a user may not necessarily know that a file (or file > system as a whole) changed via another client (or on the server, or on > the server via a different protocol e.g.SMB3 vs NFSv4), but is more > likely to know about local changes to the same file. In some sense > the users of mounts on network file systems get more benefit from > notifications than a mount on a local file system would. > As for the network file systems... As it has been pointed out there are some serious scalability/performance issues with the current version of the events interface. As it also has been suggested I plan to modify the way the threshold notifications are being handled by shuffling the responsibility for tracking the amount of available space through querying file systems for an update. Thus I'm wondering, if this will not result in yet another issue in case of the network file systems, as for them, handling such query means asking the sever for an update (there is basically no caching on the client side). Best Regards Beata From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934057AbbGVKk3 (ORCPT ); Wed, 22 Jul 2015 06:40:29 -0400 Received: from mailout4.samsung.com ([203.254.224.34]:39048 "EHLO mailout4.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933440AbbGVKk0 (ORCPT ); Wed, 22 Jul 2015 06:40:26 -0400 X-AuditID: cbfee61a-f79516d000006302-93-55af7318b354 From: Bartlomiej Zolnierkiewicz To: Beata Michalska , tytso@mit.edu Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 2/4] ext4: Add helper function to mark group as corrupted Date: Wed, 22 Jul 2015 12:40:03 +0200 Message-id: <3417027.tdShitEpvE@amdc1976> User-Agent: KMail/4.13.3 (Linux/3.13.0-57-generic; KDE/4.13.3; x86_64; ; ) In-reply-to: <1434460173-18427-3-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-3-git-send-email-b.michalska@samsung.com> MIME-version: 1.0 Content-transfer-encoding: 7Bit Content-type: text/plain; charset=us-ascii X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprNIsWRmVeSWpSXmKPExsVy+t9jQV2J4vWhBt9uC1p8/dLBYjFtRj+7 xbkFMxgtTk9YxGTx9FMfi8Xs6c1MFrcur2KxONv0ht1i2YPNLBabv3ewWcycd4fNYs/ekywW l3fNYbO4t+Y/q0Vrz092B36Pls3lHgs2lXpsXqHl8fZhgMemT5PYPZrOHGX2eL/vKptH35ZV jB5nFhxh9/i8SS6AK4rLJiU1J7MstUjfLoErY+GE48wFLzQqDt1/xNzAuFKxi5GTQ0LAROLC uXnMELaYxIV769m6GLk4hASmM0q8n/OOGcL5yihx7utBNpAqNgEriYntqxhBbBEBG4n/z28x ghQxC2xlkph05w4TSEJYwE/i1q2NYDaLgKpE17kvrCA2r4CmRN/EtWCDRAW8JL7/agBbzSng IXFo6RsmiG2tjBLrV39igmgQlPgx+R4LiM0sIC+xb/9UVghbS2L9zuNMExgFZiEpm4WkbBaS sgWMzKsYRVMLkguKk9JzDfWKE3OLS/PS9ZLzczcxgqPsmdQOxpUNFocYBTgYlXh4JxxdFyrE mlhWXJl7iFGCg1lJhPdF1vpQId6UxMqq1KL8+KLSnNTiQ4zSHCxK4rwn831ChQTSE0tSs1NT C1KLYLJMHJxSDYzBMUyXN75MuTrhde78S2sftuzflXjvVvqVd7YPUlK8eV+f27qsT9EiU9VU 0t1z7cqNLw4wfN9+R82CgbvvbXdY0e7Q1jMHvE2WWfzyyZxUm/LE6+ilVvuTmldtw0S1dLIK w/O11OIchZ9fvHdlf6m80efApd2W2bcfpXYv1XU+PbdCOOvNC2ElluKMREMt5qLiRABIKDqq rgIAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Tuesday, June 16, 2015 03:09:31 PM Beata Michalska wrote: > Add ext4_mark_group_corrupted helper function to > simplify the code and to keep the logic in one place. > > Signed-off-by: Beata Michalska This small cleanup patch is not really required for your notifications framework to work and it seems to be a good change on its own. Maybe it can be merged independently of other patches? Ted, what is your opinion on it? Best regards, -- Bartlomiej Zolnierkiewicz Samsung R&D Institute Poland Samsung Electronics > --- > fs/ext4/balloc.c | 15 +++------------ > fs/ext4/ext4.h | 9 +++++++++ > fs/ext4/ialloc.c | 5 +---- > fs/ext4/mballoc.c | 11 ++--------- > 4 files changed, 15 insertions(+), 25 deletions(-) > > diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c > index 83a6f49..e95b27a 100644 > --- a/fs/ext4/balloc.c > +++ b/fs/ext4/balloc.c > @@ -193,10 +193,7 @@ static int ext4_init_block_bitmap(struct super_block *sb, > * essentially implementing a per-group read-only flag. */ > if (!ext4_group_desc_csum_verify(sb, block_group, gdp)) { > grp = ext4_get_group_info(sb, block_group); > - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > - percpu_counter_sub(&sbi->s_freeclusters_counter, > - grp->bb_free); > - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > + ext4_mark_group_corrupted(sbi, grp); > if (!EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) { > int count; > count = ext4_free_inodes_count(sb, gdp); > @@ -379,20 +376,14 @@ static void ext4_validate_block_bitmap(struct super_block *sb, > ext4_unlock_group(sb, block_group); > ext4_error(sb, "bg %u: block %llu: invalid block bitmap", > block_group, blk); > - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > - percpu_counter_sub(&sbi->s_freeclusters_counter, > - grp->bb_free); > - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > + ext4_mark_group_corrupted(sbi, grp); > return; > } > if (unlikely(!ext4_block_bitmap_csum_verify(sb, block_group, > desc, bh))) { > ext4_unlock_group(sb, block_group); > ext4_error(sb, "bg %u: bad block bitmap checksum", block_group); > - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > - percpu_counter_sub(&sbi->s_freeclusters_counter, > - grp->bb_free); > - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > + ext4_mark_group_corrupted(sbi, grp); > return; > } > set_buffer_verified(bh); > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index f63c3d5..163afe2 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -2535,6 +2535,15 @@ static inline spinlock_t *ext4_group_lock_ptr(struct super_block *sb, > return bgl_lock_ptr(EXT4_SB(sb)->s_blockgroup_lock, group); > } > > +static inline > +void ext4_mark_group_corrupted(struct ext4_sb_info *sbi, > + struct ext4_group_info *grp) > +{ > + if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > + percpu_counter_sub(&sbi->s_freeclusters_counter, grp->bb_free); > + set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > +} > + > /* > * Returns true if the filesystem is busy enough that attempts to > * access the block group locks has run into contention. > diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c > index ac644c3..ebe0499 100644 > --- a/fs/ext4/ialloc.c > +++ b/fs/ext4/ialloc.c > @@ -79,10 +79,7 @@ static unsigned ext4_init_inode_bitmap(struct super_block *sb, > if (!ext4_group_desc_csum_verify(sb, block_group, gdp)) { > ext4_error(sb, "Checksum bad for group %u", block_group); > grp = ext4_get_group_info(sb, block_group); > - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > - percpu_counter_sub(&sbi->s_freeclusters_counter, > - grp->bb_free); > - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > + ext4_mark_group_corrupted(sbi, grp); > if (!EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) { > int count; > count = ext4_free_inodes_count(sb, gdp); > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 8d1e602..24a4b6d 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -760,10 +760,7 @@ void ext4_mb_generate_buddy(struct super_block *sb, > * corrupt and update bb_free using bitmap value > */ > grp->bb_free = free; > - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(grp)) > - percpu_counter_sub(&sbi->s_freeclusters_counter, > - grp->bb_free); > - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, &grp->bb_state); > + ext4_mark_group_corrupted(sbi, grp); > } > mb_set_largest_free_order(sb, grp); > > @@ -1448,12 +1445,8 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b, > "freeing already freed block " > "(bit %u); block bitmap corrupt.", > block); > - if (!EXT4_MB_GRP_BBITMAP_CORRUPT(e4b->bd_info)) > - percpu_counter_sub(&sbi->s_freeclusters_counter, > - e4b->bd_info->bb_free); > /* Mark the block group as corrupt. */ > - set_bit(EXT4_GROUP_INFO_BBITMAP_CORRUPT_BIT, > - &e4b->bd_info->bb_state); > + ext4_mark_group_corrupted(sbi, e4b->bd_info); > mb_regenerate_buddy(e4b); > goto done; > } From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934622AbbGVP4I (ORCPT ); Wed, 22 Jul 2015 11:56:08 -0400 Received: from mailout3.samsung.com ([203.254.224.33]:54668 "EHLO mailout3.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934454AbbGVP4A (ORCPT ); Wed, 22 Jul 2015 11:56:00 -0400 X-AuditID: cbfee61b-f79416d0000014c0-9f-55afbd0ddb45 From: Bartlomiej Zolnierkiewicz To: Beata Michalska Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications Date: Wed, 22 Jul 2015 17:55:34 +0200 Message-id: <6913836.Rhse3j9PM4@amdc1976> User-Agent: KMail/4.13.3 (Linux/3.13.0-57-generic; KDE/4.13.3; x86_64; ; ) In-reply-to: <1434460173-18427-2-git-send-email-b.michalska@samsung.com> References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> MIME-version: 1.0 Content-transfer-encoding: 7Bit Content-type: text/plain; charset=us-ascii X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprBIsWRmVeSWpSXmKPExsVy+t9jQV3evetDDV5dU7X4+qWDxWLajH52 i3MLZjBanJ6wiMni6ac+FovZ05uZLG5dXsVicbbpDbvFsgebWSw2f+9gs5g57w6bxZ69J1ks Lu+aw2Zxb81/VovWnp/sDvweLZvLPRZsKvXYvELL4+3DAI9NnyaxezSdOcrs8X7fVTaPvi2r GD3OLDjC7vF5k1wAVxSXTUpqTmZZapG+XQJXxpvmZpaCpe3MFd1z3zE1MO7czNTFyMkhIWAi se/UXFYIW0ziwr31bCC2kMB0RokrLfwQ9ldGiRuLzEFsNgEriYntqxhBbBEBXYm1u0+xdDFy cTALHGGSWPFuPtggYQFPiY+zZ7B3MXJwsAioSuxb7Qxi8gpoSsy8xAFSISrgJfH9VwMziM0p 4CGx+f0DqLWtjBIfmn1AbF4BQYkfk++xgNjMAvIS+/ZPZYWwtSTW7zzONIFRYBaSsllIymYh KVvAyLyKUTS1ILmgOCk910ivODG3uDQvXS85P3cTIzi+nknvYFzVYHGIUYCDUYmHd8LRdaFC rIllxZW5hxglOJiVRHi/LlsfKsSbklhZlVqUH19UmpNafIhRmoNFSZz3ZL5PqJBAemJJanZq akFqEUyWiYNTqoFxtmH15kWSN2Tzmz4oV2offdV26Ib99gkcDgXbjrxsXR5Yfun8EYHu4Km/ mF4f7NgiIOqnYWYyUWlLfsajkIXCTyxzM674rFAs9mXWkNpsWHhxB/dl1kCViJVvIqKLz34S Dbs2Xe2nw+8LNVdu3D2+WK35isXa8gP9d6py18sbFD96vWthp/ZHJZbijERDLeai4kQAiHUV yqsCAAA= Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Some comments below. On Tuesday, June 16, 2015 03:09:30 PM Beata Michalska wrote: > Introduce configurable generic interface for file > system-wide event notifications, to provide file > systems with a common way of reporting any potential > issues as they emerge. > > The notifications are to be issued through generic > netlink interface by newly introduced multicast group. > > Threshold notifications have been included, allowing > triggering an event whenever the amount of free space drops > below a certain level - or levels to be more precise as two > of them are being supported: the lower and the upper range. > The notifications work both ways: once the threshold level > has been reached, an event shall be generated whenever > the number of available blocks goes up again re-activating > the threshold. > > The interface has been exposed through a vfs. Once mounted, > it serves as an entry point for the set-up where one can > register for particular file system events. > > Signed-off-by: Beata Michalska > --- > Documentation/filesystems/events.txt | 232 ++++++++++ > fs/Kconfig | 2 + > fs/Makefile | 1 + > fs/events/Kconfig | 7 + > fs/events/Makefile | 5 + > fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ > fs/events/fs_event.h | 22 + > fs/events/fs_event_netlink.c | 104 +++++ > fs/namespace.c | 1 + > include/linux/fs.h | 6 +- > include/linux/fs_event.h | 72 +++ > include/uapi/linux/Kbuild | 1 + > include/uapi/linux/fs_event.h | 58 +++ > 13 files changed, 1319 insertions(+), 1 deletion(-) > create mode 100644 Documentation/filesystems/events.txt > create mode 100644 fs/events/Kconfig > create mode 100644 fs/events/Makefile > create mode 100644 fs/events/fs_event.c > create mode 100644 fs/events/fs_event.h > create mode 100644 fs/events/fs_event_netlink.c > create mode 100644 include/linux/fs_event.h > create mode 100644 include/uapi/linux/fs_event.h > > diff --git a/Documentation/filesystems/events.txt b/Documentation/filesystems/events.txt > new file mode 100644 > index 0000000..c2e6227 > --- /dev/null > +++ b/Documentation/filesystems/events.txt > @@ -0,0 +1,232 @@ > + > + Generic file system event notification interface > + > +Document created 23 April 2015 by Beata Michalska > + > +1. The reason behind: > +===================== > + > +There are many corner cases when things might get messy with the filesystems. > +And it is not always obvious what and when went wrong. Sometimes you might > +get some subtle hints that there is something going on - but by the time > +you realise it, it might be too late as you are already out-of-space > +or the filesystem has been remounted as read-only (i.e.). The generic > +interface for the filesystem events fills the gap by providing a rather > +easy way of real-time notifications triggered whenever something interesting > +happens, allowing filesystems to report events in a common way, as they occur. > + > +2. How does it work: > +==================== > + > +The interface itself has been exposed as fstrace-type Virtual File System, > +primarily to ease the process of setting up the configuration for the > +notifications. So for starters, it needs to get mounted (obviously): > + > + mount -t fstrace none /sys/fs/events > + > +This will unveil the single fstrace filesystem entry - the 'config' file, > +through which the notification are being set-up. The patch creates a separate virtual filesystem for single file, this is an overkill IMHO and a new sysfs or debugfs entry should be sufficient. > + > +Activating notifications for particular filesystem is as straightforward > +as writing into the 'config' file. Note that by default all events, despite > +the actual filesystem type, are being disregarded. > + > +Synopsis of config: > +------------------ > + > + MOUNT EVENT_TYPE [L1] [L2] OTOH Why not use the advantages of having a separate virtual filesystem and create separate directories for each mount point (+ maybe even extra parent directories for mount namespaces) and put separate entries for each event type in these directories. This would also allow usage of eventfd() notification interface on such files. Please take look at: tools/cgroup/cgroup_event_listener.c and Documentation/cgroups/memcg_test.txt (point 9.10) to see how much easier it is to observe memory usage thresholds on memory cgroups compared to available blocks on filesystems using fs events.. Also while at it please add your example user-space code (posted on request in a some other mail) to tools/fs_events/ (preferably in a separate patch). > + > + MOUNT : the filesystem's mount point > + EVENT_TYPE : event types - currently two of them are being supported: > + > + * generic events ("G") covering most common warnings > + and errors that might be reported by any filesystem; > + this option does not take any arguments; fs_event.h in uapi dir allows following events: /* * Supported set of FS events */ enum { FS_EVENT_NONE, FS_WARN_ENOSPC, /* No space left to reserve data blks */ FS_WARN_ENOSPC_META, /* No space left for metadata */ FS_THR_LRBELOW, /* The threshold lower range has been reached */ FS_THR_LRABOVE, /* The threshold lower range re-activcated*/ FS_THR_URBELOW, FS_THR_URABOVE, FS_ERR_REMOUNT_RO, /* The file system has been remounted as RO */ FS_ERR_CORRUPTED /* Critical error - fs corrupted */ }; For non-threshold related events the current interface allows only configuration of all or none events to be anabled, i.e. you cannot selectively enable notification on FS_WARN_ENOSPC but not on FS_ERR_REMOUNT_RO. I also think that configuration interface should be made to match the notification interface when it comes to event types. > + > + * threshold notifications ("T") - events sent whenever > + the amount of available space drops below certain level; > + it is possible to specify two threshold levels though > + only one is required to properly setup the notifications; > + as those refer to the number of available blocks, the lower > + level [L1] needs to be higher than the upper one [L2] Why is there a limitation of only two thresholds? It should be relatively easy to make the code support unlimited number of thresholds. > + > +Sample request could look like the following: > + > + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config > + > +Multiple request might be specified provided they are separated with semicolon. s/request/requests/ I think that allowing multiple event types and requests in one configuration request is not a good idea. Currently parsing code is relatively simple but once somebody decides to enhance the interface with new event types the parsing code may get complex & ugly. > + > +The configuration itself might be modified at any time. One can add/remove > +particular event types for given fielsystem, modify the threshold levels, s/fielsystem/filesystem/ > +and remove single or all entries from the 'config' file. > + > + - Adding new event type: > + > + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config > + > +(Note that is is enough to provide the event type to be enabled without s/is is/is/ > +the already set ones.) > + > + - Removing event type: > + > + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config > + > + - Updating threshold limits: > + > + $ echo MOUNT T L1 L2 > /sys/fs/events/config > + > + - Removing single entry: > + > + $ echo '!MOUNT' > /sys/fs/events/config > + > + - Removing all entries: > + > + $ echo > /sys/fs/events/config > + > +Reading the file will list all registered entries with their current set-up > +along with some additional info like the filesystem type and the backing device > +name if available. > + > +Final, though a very important note on the configuration: when and if the > +actual events are being triggered falls way beyond the scope of the generic > +filesystem events interface. It is up to a particular filesystem > +implementation which events are to be supported - if any at all. So if > +given filesystem does not support the event notifications, an attempt to > +enable those through 'config' file will fail. > + > + > +3. The generic netlink interface support: > +========================================= > + > +Whenever an event notification is triggered (by given filesystem) the current > +configuration is being validated to decide whether a userpsace notification s/userpsace/userspace/ > +should be launched. If there has been no request (in a mean of 'config' file > +entry) for given event, one will be silently disregarded. If, on the other > +hand, someone is 'watching' given filesystem for specific events, a generic > +netlink message will be sent. A dedicated multicast group has been provided > +solely for this purpose so in order to receive such notifications, one should > +subscribe to this new multicast group. As for now only the init network > +namespace is being supported. > + > +3.1 Message format > + > +The FS_NL_C_EVENT shall be stored within the generic netlink message header > +as the command field. The message payload will provide more detailed info: > +the backing device major and minor numbers, the event code and the id of > +the process which action led to the event occurrence. In case of threshold > +notifications, the current number of available blocks will be included > +in the payload as well. > + > + > + 0 1 2 3 > + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | NETLINK MESSAGE HEADER | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | GENERIC NETLINK MESSAGE HEADER | > + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | Optional user specific message header | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + | GENERIC MESSAGE PAYLOAD: | > + +---------------------------------------------------------------+ > + | FS_NL_A_EVENT_ID (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DEV_MAJOR (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DEV_MINOR (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_CAUSED_ID (NLA_U32) | > + +---------------------------------------------------------------+ > + | FS_NL_A_DATA (NLA_U64) | > + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > + > + > +The above figure is based on: > + http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto#Message_Format > + > + > +4. API Reference: > +================= > + > + 4.1 Generic file system event interface data & operations > + > + #include > + > + struct fs_trace_info { > + void __rcu *e_priv /* READ ONLY */ It should be marked as private for fs events core code and not for use by filesystems' code. If possible it would be the best to move it out of this struct, > + unsigned int events_cap_mask; /* Supported notifications */ > + const struct fs_trace_operations *ops; > + }; > + > + struct fs_trace_operations { > + void (*query)(struct super_block *, u64 *); > + }; > + > + In order to get the fireworks and stuff, each filesystem needs to setup > + the events_cap_mask field of the fs_trace_info structure, which has been > + embedded within the super_block structure. This should reflect the type of > + events the filesystem wants to support. In case of threshold notifications, > + apart from setting the FS_EVENT_THRESH flag, the 'query' callback should > + be provided as this enables the events interface to get the up-to-date > + state of the number of available blocks whenever those notifications are > + being requested. > + > + The 'e_priv' field of the fs_trace_info structure should be completely ignored > + as it's for INTERNAL USE ONLY. So don't even think of messing with it, if you > + do not want to get yourself into some real trouble. If still, you are tempted > + to do so - feel free, it's gonna be pure fun. Consider yourself warned. > + > + > + 4.2 Event notification: > + > + #include > + void fs_event_notify(struct super_block *sb, unsigned int event_id); > + > + Notify the generic FS event interface of an occurring event. > + This shall be used by any file system that wishes to inform any potential > + listeners/watchers of a particular event. > + - sb: the filesystem's super block > + - event_id: an event identifier > + > + 4.3 Threshold notifications: > + > + #include > + void fs_event_alloc_space(struct super_block *sb, u64 ncount); > + void fs_event_free_space(struct super_block *sb, u64 ncount); > + > + Each filesystme supporting the threshold notifications should call > + fs_event_alloc_space/fs_event_free_space respectively whenever the > + amount of available blocks changes. > + - sb: the filesystem's super block > + - ncount: number of blocks being acquired/released > + > + Note that to properly handle the threshold notifications the fs events > + interface needs to be kept up to date by the filesystems. Each should > + register fs_trace_operations to enable querying the current number of > + available blocks. > + > + 4.4 Sending message through generic netlink interface > + > + #include > + > + int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), void *cbdata); > + > + Although the fs event interface is fully responsible for sending the messages > + over the netlink, filesystems might use the FS_EVENT multicast group to send > + their own custom messages. > + - size: the size of the message payload > + - event_id: the event identifier > + - compose_msg: a callback responsible for filling-in the message payload > + - cbdata: message custom data > + > + Calling fs_netlink_send_event will result in a message being sent by > + the FS_EVENT multicast group. Note that the body of the message should be > + prepared (set-up )by the caller - through compose_msg callback. The message's (set-up) > + sk_buff will be allocated on behalf of the caller (thus the size parameter). > + The compose_msg should only fill the payload with proper data. Unless > + the event id is specified as FS_EVENT_NONE, it's value shall be added > + to the payload prior to calling the compose_msg. > + > + > diff --git a/fs/Kconfig b/fs/Kconfig > index ec35851..a89e678 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -69,6 +69,8 @@ config FILE_LOCKING > for filesystems like NFS and for the flock() system > call. Disabling this option saves about 11k. > > +source "fs/events/Kconfig" > + > source "fs/notify/Kconfig" > > source "fs/quota/Kconfig" > diff --git a/fs/Makefile b/fs/Makefile > index a88ac48..bcb3048 100644 > --- a/fs/Makefile > +++ b/fs/Makefile > @@ -126,3 +126,4 @@ obj-y += exofs/ # Multiple modules > obj-$(CONFIG_CEPH_FS) += ceph/ > obj-$(CONFIG_PSTORE) += pstore/ > obj-$(CONFIG_EFIVAR_FS) += efivarfs/ > +obj-$(CONFIG_FS_EVENTS) += events/ > diff --git a/fs/events/Kconfig b/fs/events/Kconfig > new file mode 100644 > index 0000000..1c60195 > --- /dev/null > +++ b/fs/events/Kconfig > @@ -0,0 +1,7 @@ > +# Generic Files System events interface > +config FS_EVENTS > + bool "Generic filesystem events" > + select NET > + default y Do we really want to default to yes? [ If so then maybe we want to make the config option visible only when EXPERT mode is enabled? ] > + help > + Enable generic filesystem events interface Please enhance the help entry. > diff --git a/fs/events/Makefile b/fs/events/Makefile > new file mode 100644 > index 0000000..9c98337 > --- /dev/null > +++ b/fs/events/Makefile > @@ -0,0 +1,5 @@ > +# > +# Makefile for the Linux Generic File System Event Interface > +# > + > +obj-y := fs_event.o fs_event_netlink.o > diff --git a/fs/events/fs_event.c b/fs/events/fs_event.c > new file mode 100644 > index 0000000..1037311 > --- /dev/null > +++ b/fs/events/fs_event.c > @@ -0,0 +1,809 @@ > +/* > + * Generic File System Evens Interface > + * > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "../pnode.h" > +#include "fs_event.h" > + > +static LIST_HEAD(fs_trace_list); > +static DEFINE_MUTEX(fs_trace_lock); > + > +static struct kmem_cache *fs_trace_cachep __read_mostly; > + > +static atomic_t stray_traces = ATOMIC_INIT(0); > +static DECLARE_WAIT_QUEUE_HEAD(trace_wq); > +/* > + * Threshold notification state bits. > + * Note the reverse as this refers to the number > + * of available blocks. > + */ > +#define THRESH_LR_BELOW 0x0001 /* Falling below the lower range */ > +#define THRESH_LR_BEYOND 0x0002 > +#define THRESH_UR_BELOW 0x0004 > +#define THRESH_UR_BEYOND 0x0008 /* Going beyond the upper range */ > + > +#define THRESH_LR_ON (THRESH_LR_BELOW | THRESH_LR_BEYOND) > +#define THRESH_UR_ON (THRESH_UR_BELOW | THRESH_UR_BEYOND) > + > +#define FS_TRACE_ADD 0x100000 > + > +struct fs_trace_entry { > + struct kref count; > + atomic_t active; > + struct super_block *sb; > + unsigned int notify; > + struct path mnt_path; > + struct list_head node; > + > + struct fs_event_thresh { > + u64 avail_space; > + u64 lrange; > + u64 urange; > + unsigned int state; > + } th; > + struct rcu_head rcu_head; > + spinlock_t lock; > +}; > + > +static const match_table_t fs_etypes = { > + { FS_EVENT_GENERIC, "G" }, > + { FS_EVENT_THRESH, "T" }, > + { 0, NULL }, > +}; > + > +static inline int fs_trace_query_data(struct super_block *sb, > + struct fs_trace_entry *en) > +{ > + if (sb->s_etrace.ops && sb->s_etrace.ops->query) { > + sb->s_etrace.ops->query(sb, &en->th.avail_space); > + return 0; > + } > + > + return -EINVAL; > +} > + > +static inline void fs_trace_entry_free(struct fs_trace_entry *en) I don't see a real need for this wrapper (it is used only once). > +{ > + kmem_cache_free(fs_trace_cachep, en); > +} > + > +static void fs_destroy_trace_entry(struct kref *en_ref) > +{ > + struct fs_trace_entry *en = container_of(en_ref, > + struct fs_trace_entry, count); > + > + /* Last reference has been dropped */ > + fs_trace_entry_free(en); > + atomic_dec(&stray_traces); > +} > + > +static void fs_trace_entry_put(struct fs_trace_entry *en) > +{ > + kref_put(&en->count, fs_destroy_trace_entry); > +} > + > +static void fs_release_trace_entry(struct rcu_head *rcu_head) > +{ > + struct fs_trace_entry *en = container_of(rcu_head, > + struct fs_trace_entry, > + rcu_head); > + /* > + * As opposed to typical reference drop, this one is being > + * called from the rcu callback. This is to make sure all > + * readers have managed to safely grab the reference before > + * the change to rcu pointer is visible to all and before > + * the reference is dropped here. > + */ > + fs_trace_entry_put(en); > +} > + > +static void fs_drop_trace_entry(struct fs_trace_entry *en) > +{ > + struct super_block *sb; > + > + lockdep_assert_held(&fs_trace_lock); > + /* > + * The trace entry might have already been removed > + * from the list of active traces with the proper > + * ref drop, though it was still in use handling > + * one of the fs events. This means that the object > + * has been already scheduled for being released. > + * So leave... > + */ > + > + if (!atomic_add_unless(&en->active, -1, 0)) > + return; > + /* > + * At this point the trace entry is being marked as inactive > + * so no new references will be allowed. > + * Still it might be floating around somewhere > + * so drop the reference when the rcu readers are done. > + */ > + spin_lock(&en->lock); > + list_del(&en->node); > + sb = en->sb; > + en->sb = NULL; > + spin_unlock(&en->lock); > + > + rcu_assign_pointer(sb->s_etrace.e_priv, NULL); > + call_rcu(&en->rcu_head, fs_release_trace_entry); > + /* It's safe now to drop the reference to the super */ > + deactivate_super(sb); > + atomic_inc(&stray_traces); > +} > + > +static inline > +struct fs_trace_entry *fs_trace_entry_get(struct fs_trace_entry *en) > +{ > + if (en) { > + if (!kref_get_unless_zero(&en->count)) > + return NULL; > + /* Don't allow referencing inactive object */ > + if (!atomic_read(&en->active)) { > + fs_trace_entry_put(en); > + return NULL; > + } > + } > + return en; > +} > + > +static struct fs_trace_entry *fs_trace_entry_get_rcu(struct super_block *sb) > +{ > + struct fs_trace_entry *en; > + > + if (!sb) > + return NULL; > + > + rcu_read_lock(); > + en = rcu_dereference(sb->s_etrace.e_priv); > + en = fs_trace_entry_get(en); > + rcu_read_unlock(); > + > + return en; > +} > + > +static int fs_remove_trace_entry(struct super_block *sb) > +{ > + struct fs_trace_entry *en; > + > + en = fs_trace_entry_get_rcu(sb); > + if (!en) > + return -EINVAL; > + > + mutex_lock(&fs_trace_lock); > + fs_drop_trace_entry(en); > + mutex_unlock(&fs_trace_lock); > + fs_trace_entry_put(en); > + return 0; > +} > + > +static void fs_remove_all_traces(void) > +{ > + struct fs_trace_entry *en, *guard; > + > + mutex_lock(&fs_trace_lock); > + list_for_each_entry_safe(en, guard, &fs_trace_list, node) > + fs_drop_trace_entry(en); > + mutex_unlock(&fs_trace_lock); > +} > + > +static int create_common_msg(struct sk_buff *skb, void *data) > +{ > + struct fs_trace_entry *en = (struct fs_trace_entry *)data; > + struct super_block *sb = en->sb; > + > + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) > + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) > + return -EINVAL; > + > + if (nla_put_u64(skb, FS_NL_A_CAUSED_ID, pid_vnr(task_pid(current)))) > + return -EINVAL; > + > + return 0; > +} > + > +static int create_thresh_msg(struct sk_buff *skb, void *data) > +{ > + struct fs_trace_entry *en = (struct fs_trace_entry *)data; > + int ret; > + > + ret = create_common_msg(skb, data); > + if (!ret) > + ret = nla_put_u64(skb, FS_NL_A_DATA, en->th.avail_space); > + return ret; > +} > + > +static void fs_event_send(struct fs_trace_entry *en, unsigned int event_id) > +{ > + size_t size = nla_total_size(sizeof(u32)) * 2 + > + nla_total_size(sizeof(u64)); > + > + fs_netlink_send_event(size, event_id, create_common_msg, en); > +} > + > +static void fs_event_send_thresh(struct fs_trace_entry *en, > + unsigned int event_id) > +{ > + size_t size = nla_total_size(sizeof(u32)) * 2 + > + nla_total_size(sizeof(u64)) * 2; > + > + fs_netlink_send_event(size, event_id, create_thresh_msg, en); > +} > + > +void fs_event_notify(struct super_block *sb, unsigned int event_id) > +{ > + struct fs_trace_entry *en; > + > + en = fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + if (atomic_read(&en->active) && (en->notify & FS_EVENT_GENERIC)) > + fs_event_send(en, event_id); > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_notify); > + > +void fs_event_alloc_space(struct super_block *sb, u64 ncount) > +{ > + struct fs_trace_entry *en; > + s64 count; > + > + en = fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + > + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) > + goto leave; > + /* > + * we shouldn't drop below 0 here, > + * unless there is a sync issue somewhere (?) > + */ > + count = en->th.avail_space - ncount; > + en->th.avail_space = count < 0 ? 0 : count; > + > + if (en->th.avail_space > en->th.lrange) > + /* Not 'even' close - leave */ > + goto leave; > + > + if (en->th.avail_space > en->th.urange) { > + /* Close enough - the lower range has been reached */ > + if (!(en->th.state & THRESH_LR_BEYOND)) { > + /* Send notification */ > + fs_event_send_thresh(en, FS_THR_LRBELOW); > + en->th.state &= ~THRESH_LR_BELOW; > + en->th.state |= THRESH_LR_BEYOND; > + } > + goto leave; > + } > + if (!(en->th.state & THRESH_UR_BEYOND)) { > + fs_event_send_thresh(en, FS_THR_URBELOW); > + en->th.state &= ~THRESH_UR_BELOW; > + en->th.state |= THRESH_UR_BEYOND; > + } > + > +leave: > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_alloc_space); > + > +void fs_event_free_space(struct super_block *sb, u64 ncount) > +{ > + struct fs_trace_entry *en; > + > + en = fs_trace_entry_get_rcu(sb); > + if (!en) > + return; > + > + spin_lock(&en->lock); > + > + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) > + goto leave; > + > + en->th.avail_space += ncount; > + > + if (en->th.avail_space > en->th.lrange) { > + if (!(en->th.state & THRESH_LR_BELOW) > + && en->th.state & THRESH_LR_BEYOND) { > + /* Send notification */ > + fs_event_send_thresh(en, FS_THR_LRABOVE); > + en->th.state &= ~(THRESH_LR_BEYOND|THRESH_UR_BEYOND); > + en->th.state |= THRESH_LR_BELOW; > + goto leave; > + } > + } > + if (en->th.avail_space > en->th.urange) { > + if (!(en->th.state & THRESH_UR_BELOW) > + && en->th.state & THRESH_UR_BEYOND) { > + /* Notify */ > + fs_event_send_thresh(en, FS_THR_URABOVE); > + en->th.state &= ~THRESH_UR_BEYOND; > + en->th.state |= THRESH_UR_BELOW; > + } > + } > +leave: > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > +} > +EXPORT_SYMBOL(fs_event_free_space); > + > +void fs_event_mount_dropped(struct vfsmount *mnt) > +{ > + /* > + * The mount is dropped but the super might not get released > + * at once so there is very small chance some notifications > + * will come through. > + * Note that the mount being dropped here might belong to a different > + * namespace - if this is the case, just ignore it. > + */ > + struct fs_trace_entry *en = fs_trace_entry_get_rcu(mnt->mnt_sb); > + struct vfsmount *en_mnt; > + > + if (!en || !atomic_read(&en->active)) > + return; > + /* > + * The entry once set, does not change the mountpoint it's being > + * pinned to, so no need to take the lock here. > + */ > + en_mnt = en->mnt_path.mnt; > + if (!(real_mount(mnt)->mnt_ns != (real_mount(en_mnt))->mnt_ns)) > + fs_remove_trace_entry(mnt->mnt_sb); > + fs_trace_entry_put(en); > +} > + > +static int fs_new_trace_entry(struct path *path, struct fs_event_thresh *thresh, > + unsigned int nmask) > +{ > + struct fs_trace_entry *en; > + struct super_block *sb; > + struct mount *r_mnt; > + > + en = kmem_cache_zalloc(fs_trace_cachep, GFP_KERNEL); > + if (unlikely(!en)) > + return -ENOMEM; > + /* > + * Note that no reference is being taken here for the path as it would > + * make the unmount unnecessarily puzzling (due to an extra 'valid' > + * reference for the mnt). > + * This is *rather* safe as the notification on mount being dropped > + * will get called prior to releasing the super block - so right > + * in time to perform appropriate clean-up > + */ > + r_mnt = real_mount(path->mnt); > + > + en->mnt_path.dentry = r_mnt->mnt.mnt_root; > + en->mnt_path.mnt = &r_mnt->mnt; > + > + sb = path->mnt->mnt_sb; > + en->sb = sb; > + /* > + * Increase the refcount for sb to mark it's being relied on. > + * Note that the reference to path is taken by the caller, so it > + * is safe to assume there is at least single active reference > + * to super as well. > + */ > + atomic_inc(&sb->s_active); > + > + nmask &= sb->s_etrace.events_cap_mask; > + if (!nmask) > + goto leave; > + > + spin_lock_init(&en->lock); > + INIT_LIST_HEAD(&en->node); > + > + en->notify = nmask; > + memcpy(&en->th, thresh, offsetof(struct fs_event_thresh, state)); > + if (nmask & FS_EVENT_THRESH) > + fs_trace_query_data(sb, en); > + > + kref_init(&en->count); > + > + if (rcu_access_pointer(sb->s_etrace.e_priv) != NULL) { > + struct fs_trace_entry *prev_en; > + > + prev_en = fs_trace_entry_get_rcu(sb); > + if (prev_en) { > + WARN_ON(prev_en); > + fs_trace_entry_put(prev_en); > + goto leave; > + } > + } > + atomic_set(&en->active, 1); > + > + mutex_lock(&fs_trace_lock); > + list_add(&en->node, &fs_trace_list); > + mutex_unlock(&fs_trace_lock); > + > + rcu_assign_pointer(sb->s_etrace.e_priv, en); > + synchronize_rcu(); > + > + return 0; > +leave: > + deactivate_super(sb); > + kmem_cache_free(fs_trace_cachep, en); > + return -EINVAL; > +} > + > +static int fs_update_trace_entry(struct path *path, > + struct fs_event_thresh *thresh, > + unsigned int nmask) > +{ > + struct fs_trace_entry *en; > + struct super_block *sb; > + int extend = nmask & FS_TRACE_ADD; > + int ret = -EINVAL; > + > + en = fs_trace_entry_get_rcu(path->mnt->mnt_sb); > + if (!en) > + return (extend) ? fs_new_trace_entry(path, thresh, nmask) > + : -EINVAL; > + > + if (!atomic_read(&en->active)) > + return -EINVAL; > + > + nmask &= ~FS_TRACE_ADD; > + > + spin_lock(&en->lock); > + sb = en->sb; > + if (!sb || !(nmask & sb->s_etrace.events_cap_mask)) > + goto leave; > + > + if (nmask & FS_EVENT_THRESH) { > + if (extend) { > + /* Get the current state */ > + if (!(en->notify & FS_EVENT_THRESH)) > + if (fs_trace_query_data(sb, en)) > + goto leave; > + > + if (thresh->state & THRESH_LR_ON) { > + en->th.lrange = thresh->lrange; > + en->th.state &= ~THRESH_LR_ON; > + } > + > + if (thresh->state & THRESH_UR_ON) { > + en->th.urange = thresh->urange; > + en->th.state &= ~THRESH_UR_ON; > + } > + } else { > + memset(&en->th, 0, sizeof(en->th)); > + } > + } > + > + if (extend) > + en->notify |= nmask; > + else > + en->notify &= ~nmask; > + ret = 0; > +leave: > + spin_unlock(&en->lock); > + fs_trace_entry_put(en); > + return ret; > +} > + > +static int fs_parse_trace_request(int argc, char **argv) > +{ > + struct fs_event_thresh thresh = {0}; > + struct path path; > + substring_t args[MAX_OPT_ARGS]; > + unsigned int nmask = FS_TRACE_ADD; > + int token; > + char *s; > + int ret = -EINVAL; > + > + if (!argc) { > + fs_remove_all_traces(); > + return 0; > + } > + > + s = *(argv); > + if (*s == '!') { > + /* Clear the trace entry */ > + nmask &= ~FS_TRACE_ADD; > + ++s; > + } > + > + if (kern_path_mountpoint(AT_FDCWD, s, &path, LOOKUP_FOLLOW)) > + return -EINVAL; > + > + if (!(--argc)) { > + if (!(nmask & FS_TRACE_ADD)) > + ret = fs_remove_trace_entry(path.mnt->mnt_sb); > + goto leave; > + } > + > +repeat: > + args[0].to = args[0].from = NULL; > + token = match_token(*(++argv), fs_etypes, args); > + if (!token && !nmask) > + goto leave; > + > + nmask |= token & FS_EVENTS_ALL; > + --argc; > + if ((token & FS_EVENT_THRESH) && (nmask & FS_TRACE_ADD)) { > + /* > + * Get the threshold config data: > + * lower range > + * upper range > + */ > + if (!argc) > + goto leave; > + > + ret = kstrtoull(*(++argv), 10, &thresh.lrange); > + if (ret) > + goto leave; > + thresh.state |= THRESH_LR_ON; > + if ((--argc)) { > + ret = kstrtoull(*(++argv), 10, &thresh.urange); > + if (ret) > + goto leave; > + thresh.state |= THRESH_UR_ON; > + --argc; > + } > + /* The thresholds are based on number of available blocks */ > + if (thresh.lrange < thresh.urange) { > + ret = -EINVAL; > + goto leave; > + } > + } > + if (argc) > + goto repeat; > + > + ret = fs_update_trace_entry(&path, &thresh, nmask); > +leave: > + path_put(&path); > + return ret; > +} > + > +#define DEFAULT_BUF_SIZE PAGE_SIZE > + > +static ssize_t fs_trace_write(struct file *file, const char __user *buffer, > + size_t count, loff_t *ppos) > +{ > + char **argv; > + char *kern_buf, *next, *cfg; > + size_t size, dcount = 0; > + int argc; > + > + if (!count) > + return 0; > + > + kern_buf = kmalloc(DEFAULT_BUF_SIZE, GFP_KERNEL); > + if (!kern_buf) > + return -ENOMEM; > + > + while (dcount < count) { > + > + size = count - dcount; > + if (size >= DEFAULT_BUF_SIZE) > + size = DEFAULT_BUF_SIZE - 1; > + if (copy_from_user(kern_buf, buffer + dcount, size)) { > + dcount = -EINVAL; > + goto leave; > + } > + > + kern_buf[size] = '\0'; > + > + next = cfg = kern_buf; > + > + do { > + next = strchr(cfg, ';'); > + if (next) > + *next = '\0'; > + > + argv = argv_split(GFP_KERNEL, cfg, &argc); > + if (!argv) { > + dcount = -ENOMEM; > + goto leave; > + } > + > + if (fs_parse_trace_request(argc, argv)) { > + dcount = -EINVAL; > + argv_free(argv); > + goto leave; > + } > + > + argv_free(argv); > + if (next) > + cfg = ++next; > + > + } while (next); > + dcount += size; > + } > +leave: > + kfree(kern_buf); > + return dcount; > +} > + > +static void *fs_trace_seq_start(struct seq_file *m, loff_t *pos) > +{ > + mutex_lock(&fs_trace_lock); > + return seq_list_start(&fs_trace_list, *pos); > +} > + > +static void *fs_trace_seq_next(struct seq_file *m, void *v, loff_t *pos) > +{ > + return seq_list_next(v, &fs_trace_list, pos); > +} > + > +static void fs_trace_seq_stop(struct seq_file *m, void *v) > +{ > + mutex_unlock(&fs_trace_lock); > +} > + > +static int fs_trace_seq_show(struct seq_file *m, void *v) > +{ > + struct fs_trace_entry *en; > + struct super_block *sb; > + struct mount *r_mnt; > + const struct match_token *match; > + unsigned int nmask; > + > + en = list_entry(v, struct fs_trace_entry, node); > + /* Do not show the entries outside current mount namespace */ > + r_mnt = real_mount(en->mnt_path.mnt); > + if (r_mnt->mnt_ns != current->nsproxy->mnt_ns) { > + if (!__is_local_mountpoint(r_mnt->mnt_mountpoint)) > + return 0; > + } > + > + sb = en->sb; > + > + seq_path(m, &en->mnt_path, "\t\n\\"); > + seq_putc(m, ' '); > + > + seq_escape(m, sb->s_type->name, " \t\n\\"); > + if (sb->s_subtype && sb->s_subtype[0]) { > + seq_putc(m, '.'); > + seq_escape(m, sb->s_subtype, " \t\n\\"); > + } > + > + seq_putc(m, ' '); > + if (sb->s_op->show_devname) { > + sb->s_op->show_devname(m, en->mnt_path.mnt->mnt_root); > + } else { > + seq_escape(m, r_mnt->mnt_devname ? r_mnt->mnt_devname : "none", > + " \t\n\\"); > + } > + seq_puts(m, " ("); > + > + nmask = en->notify; > + for (match = fs_etypes; match->pattern; ++match) { > + if (match->token & nmask) { > + seq_puts(m, match->pattern); > + nmask &= ~match->token; > + if (nmask) > + seq_putc(m, ','); > + } > + } > + seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); > + seq_puts(m, ")\n"); > + return 0; > +} > + > +static const struct seq_operations fs_trace_seq_ops = { > + .start = fs_trace_seq_start, > + .next = fs_trace_seq_next, > + .stop = fs_trace_seq_stop, > + .show = fs_trace_seq_show, > +}; > + > +static int fs_trace_open(struct inode *inode, struct file *file) > +{ > + return seq_open(file, &fs_trace_seq_ops); > +} > + > +static const struct file_operations fs_trace_fops = { > + .owner = THIS_MODULE, > + .open = fs_trace_open, > + .write = fs_trace_write, > + .read = seq_read, > + .llseek = seq_lseek, > + .release = seq_release, > +}; > + > +static int fs_trace_init(void) > +{ > + fs_trace_cachep = KMEM_CACHE(fs_trace_entry, 0); > + if (!fs_trace_cachep) > + return -EINVAL; > + init_waitqueue_head(&trace_wq); > + return 0; > +} > + > +/* VFS support */ > +static int fs_trace_fill_super(struct super_block *sb, void *data, int silen) > +{ > + int ret; > + static struct tree_descr desc[] = { > + [2] = { > + .name = "config", > + .ops = &fs_trace_fops, > + .mode = S_IWUSR | S_IRUGO, > + }, > + {""}, > + }; > + > + ret = simple_fill_super(sb, 0x7246332, desc); Please use a define for a magic number. > + return !ret ? fs_trace_init() : ret; > +} > + > +static struct dentry *fs_trace_do_mount(struct file_system_type *fs_type, > + int ntype, const char *dev_name, void *data) > +{ > + return mount_single(fs_type, ntype, data, fs_trace_fill_super); > +} > + > +static void fs_trace_kill_super(struct super_block *sb) > +{ > + /* > + * The rcu_barrier here will/should make sure all call_rcu > + * callbacks are completed - still there might be some active > + * trace objects in use which can make calling the > + * kmem_cache_destroy unsafe. So we wait until all traces > + * are finally released. > + */ > + fs_remove_all_traces(); > + rcu_barrier(); > + wait_event(trace_wq, !atomic_read(&stray_traces)); > + > + kmem_cache_destroy(fs_trace_cachep); > + kill_litter_super(sb); > +} > + > +static struct kset *fs_trace_kset; > + > +static struct file_system_type fs_trace_fstype = { > + .name = "fstrace", > + .mount = fs_trace_do_mount, > + .kill_sb = fs_trace_kill_super, > +}; > + > +static void __init fs_trace_vfs_init(void) > +{ > + fs_trace_kset = kset_create_and_add("events", NULL, fs_kobj); > + > + if (!fs_trace_kset) > + return; > + > + if (!register_filesystem(&fs_trace_fstype)) { > + if (!fs_event_netlink_register()) > + return; > + unregister_filesystem(&fs_trace_fstype); > + } > + kset_unregister(fs_trace_kset); > +} > + > +static int __init fs_trace_evens_init(void) > +{ > + fs_trace_vfs_init(); > + return 0; > +}; > +module_init(fs_trace_evens_init); > + > diff --git a/fs/events/fs_event.h b/fs/events/fs_event.h > new file mode 100644 > index 0000000..23f24c8 > --- /dev/null > +++ b/fs/events/fs_event.h > @@ -0,0 +1,22 @@ > +/* > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > + > +#ifndef __GENERIC_FS_EVENTS_H > +#define __GENERIC_FS_EVENTS_H > + > +int fs_event_netlink_register(void); > +void fs_event_netlink_unregister(void); > + > +#endif /* __GENERIC_FS_EVENTS_H */ > diff --git a/fs/events/fs_event_netlink.c b/fs/events/fs_event_netlink.c > new file mode 100644 > index 0000000..0c97eb7 > --- /dev/null > +++ b/fs/events/fs_event_netlink.c > @@ -0,0 +1,104 @@ > +/* > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "fs_event.h" > + > +static const struct genl_multicast_group fs_event_mcgroups[] = { > + { .name = FS_EVENTS_MCAST_GRP_NAME, }, > +}; > + > +static struct genl_family fs_event_family = { > + .id = GENL_ID_GENERATE, > + .name = FS_EVENTS_FAMILY_NAME, > + .version = 1, > + .maxattr = FS_NL_A_MAX, > + .mcgrps = fs_event_mcgroups, > + .n_mcgrps = ARRAY_SIZE(fs_event_mcgroups), > +}; > + > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), > + void *cbdata) > +{ > + static atomic_t seq; > + struct sk_buff *skb; > + void *msg_head; > + int ret = 0; > + > + if (!size || !compose_msg) > + return -EINVAL; > + > + /* Skip if there are no listeners */ > + if (!genl_has_listeners(&fs_event_family, &init_net, 0)) > + return 0; > + > + if (event_id != FS_EVENT_NONE) > + size += nla_total_size(sizeof(u32)); > + size += nla_total_size(sizeof(u64)); > + skb = genlmsg_new(size, GFP_NOWAIT); > + > + if (!skb) { > + pr_debug("Failed to allocate new FS generic netlink message\n"); > + return -ENOMEM; > + } > + > + msg_head = genlmsg_put(skb, 0, atomic_add_return(1, &seq), > + &fs_event_family, 0, FS_NL_C_EVENT); > + if (!msg_head) > + goto cleanup; > + > + if (event_id != FS_EVENT_NONE) > + if (nla_put_u32(skb, FS_NL_A_EVENT_ID, event_id)) > + goto cancel; > + > + ret = compose_msg(skb, cbdata); > + if (ret) > + goto cancel; > + > + genlmsg_end(skb, msg_head); > + ret = genlmsg_multicast(&fs_event_family, skb, 0, 0, GFP_NOWAIT); > + if (ret && ret != -ENOBUFS && ret != -ESRCH) > + goto cleanup; > + > + return ret; > + > +cancel: > + genlmsg_cancel(skb, msg_head); > +cleanup: > + nlmsg_free(skb); > + return ret; > +} > +EXPORT_SYMBOL(fs_netlink_send_event); > + > +int fs_event_netlink_register(void) > +{ > + int ret; > + > + ret = genl_register_family(&fs_event_family); > + if (ret) > + pr_err("Failed to register FS netlink interface\n"); > + return ret; > +} > + > +void fs_event_netlink_unregister(void) > +{ > + genl_unregister_family(&fs_event_family); > +} > diff --git a/fs/namespace.c b/fs/namespace.c > index 82ef140..ec6e2ef 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -1031,6 +1031,7 @@ static void cleanup_mnt(struct mount *mnt) > if (unlikely(mnt->mnt_pins.first)) > mnt_pin_kill(mnt); > fsnotify_vfsmount_delete(&mnt->mnt); > + fs_event_mount_dropped(&mnt->mnt); > dput(mnt->mnt.mnt_root); > deactivate_super(mnt->mnt.mnt_sb); > mnt_free_id(mnt); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index b4d71b5..b7dadd9 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -263,6 +263,10 @@ struct iattr { > * Includes for diskquotas. > */ > #include > +/* > + * Include for Generic File System Events Interface > + */ > +#include > > /* > * Maximum number of layers of fs stack. Needs to be limited to > @@ -1253,7 +1257,7 @@ struct super_block { > struct hlist_node s_instances; > unsigned int s_quota_types; /* Bitmask of supported quota types */ > struct quota_info s_dquot; /* Diskquota specific options */ > - > + struct fs_trace_info s_etrace; > struct sb_writers s_writers; > > char s_id[32]; /* Informational name */ > diff --git a/include/linux/fs_event.h b/include/linux/fs_event.h > new file mode 100644 > index 0000000..83e22dd > --- /dev/null > +++ b/include/linux/fs_event.h > @@ -0,0 +1,72 @@ > +/* > + * Generic File System Events Interface > + * > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > +#ifndef _LINUX_GENERIC_FS_EVETS_ > +#define _LINUX_GENERIC_FS_EVETS_ EVETS? Also the define name usually corresponds to the header filename. > +#include > +#include > + > +/* > + * Currently supported event types > + */ > +#define FS_EVENT_GENERIC 0x001 > +#define FS_EVENT_THRESH 0x002 > + > +#define FS_EVENTS_ALL (FS_EVENT_GENERIC | FS_EVENT_THRESH) > + > +struct fs_trace_operations { > + void (*query)(struct super_block *, u64 *); > +}; > + > +struct fs_trace_info { > + void __rcu *e_priv; /* READ ONLY */ > + unsigned int events_cap_mask; /* Supported notifications */ > + const struct fs_trace_operations *ops; > +}; > + > +#ifdef CONFIG_FS_EVENTS > + > +void fs_event_notify(struct super_block *sb, unsigned int event_id); > +void fs_event_alloc_space(struct super_block *sb, u64 ncount); > +void fs_event_free_space(struct super_block *sb, u64 ncount); > +void fs_event_mount_dropped(struct vfsmount *mnt); > + > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msg)(struct sk_buff *skb, void *data), > + void *cbdata); > + > +#else /* CONFIG_FS_EVENTS */ > + > +static inline > +void fs_event_notify(struct super_block *sb, unsigned int event_id) {}; > +static inline > +void fs_event_alloc_space(struct super_block *sb, u64 ncount) {}; > +static inline > +void fs_event_free_space(struct super_block *sb, u64 ncount) {}; > +static inline > +void fs_event_mount_dropped(struct vfsmount *mnt) {}; > + > +static inline > +int fs_netlink_send_event(size_t size, unsigned int event_id, > + int (*compose_msig)(struct sk_buff *skb, void *data), > + void *cbdata) > +{ > + return -ENOSYS; > +} > +#endif /* CONFIG_FS_EVENTS */ > + > +#endif /* _LINUX_GENERIC_FS_EVENTS_ */ > + > diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild > index 68ceb97..dae0fab 100644 > --- a/include/uapi/linux/Kbuild > +++ b/include/uapi/linux/Kbuild > @@ -129,6 +129,7 @@ header-y += firewire-constants.h > header-y += flat.h > header-y += fou.h > header-y += fs.h > +header-y += fs_event.h > header-y += fsl_hypervisor.h > header-y += fuse.h > header-y += futex.h > diff --git a/include/uapi/linux/fs_event.h b/include/uapi/linux/fs_event.h > new file mode 100644 > index 0000000..d8b07da > --- /dev/null > +++ b/include/uapi/linux/fs_event.h > @@ -0,0 +1,58 @@ > +/* > + * Generic netlink support for Generic File System Events Interface > + * > + * Copyright(c) 2015 Samsung Electronics. All rights reserved. > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2. > + * > + * The full GNU General Public License is included in this distribution in the > + * file called COPYING. > + * > + * This program is distributed in the hope that it will be useful, but WITHOUT > + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or > + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for > + * more details. > + */ > +#ifndef _UAPI_LINUX_GENERIC_FS_EVENTS_ > +#define _UAPI_LINUX_GENERIC_FS_EVENTS_ Define name usually corresponds to the header filename. > +#define FS_EVENTS_FAMILY_NAME "fs_event" > +#define FS_EVENTS_MCAST_GRP_NAME "fs_event_mc_grp" > + > +/* > + * Generic netlink attribute types > + */ > +enum { > + FS_NL_A_NONE, > + FS_NL_A_EVENT_ID, > + FS_NL_A_DEV_MAJOR, > + FS_NL_A_DEV_MINOR, > + FS_NL_A_CAUSED_ID, > + FS_NL_A_DATA, > + __FS_NL_A_MAX, > +}; > +#define FS_NL_A_MAX (__FS_NL_A_MAX - 1) > +/* > + * Generic netlink commands > + */ > +#define FS_NL_C_EVENT 1 > + > +/* > + * Supported set of FS events > + */ > +enum { > + FS_EVENT_NONE, > + FS_WARN_ENOSPC, /* No space left to reserve data blks */ > + FS_WARN_ENOSPC_META, /* No space left for metadata */ > + FS_THR_LRBELOW, /* The threshold lower range has been reached */ > + FS_THR_LRABOVE, /* The threshold lower range re-activcated*/ > + FS_THR_URBELOW, > + FS_THR_URABOVE, > + FS_ERR_REMOUNT_RO, /* The file system has been remounted as RO */ > + FS_ERR_CORRUPTED /* Critical error - fs corrupted */ > + > +}; > + > +#endif /* _UAPI_LINUX_GENERIC_FS_EVENTS_ */ > + Best regards, -- Bartlomiej Zolnierkiewicz Samsung R&D Institute Poland Samsung Electronics From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754917AbbG3IWw (ORCPT ); Thu, 30 Jul 2015 04:22:52 -0400 Received: from mailout2.w1.samsung.com ([210.118.77.12]:59809 "EHLO mailout2.w1.samsung.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750941AbbG3IWo (ORCPT ); Thu, 30 Jul 2015 04:22:44 -0400 X-AuditID: cbfec7f4-f79c56d0000012ee-97-55b9ded16b73 Message-id: <55B9DEC9.8020506@samsung.com> Date: Thu, 30 Jul 2015 10:22:33 +0200 From: Beata Michalska User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130804 Thunderbird/17.0.8 MIME-version: 1.0 To: Bartlomiej Zolnierkiewicz Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, greg@kroah.com, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, hughd@google.com, lczerner@redhat.com, hch@infradead.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, kyungmin.park@samsung.com, kmpark@infradead.org Subject: Re: [RFC v3 1/4] fs: Add generic file system event notifications References: <1434460173-18427-1-git-send-email-b.michalska@samsung.com> <1434460173-18427-2-git-send-email-b.michalska@samsung.com> <6913836.Rhse3j9PM4@amdc1976> In-reply-to: <6913836.Rhse3j9PM4@amdc1976> Content-type: text/plain; charset=ISO-8859-1 Content-transfer-encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFprNIsWRmVeSWpSXmKPExsVy+t/xK7oX7+0MNVj7hcfi65cOFouNM9az WpxbMIPR4vSERUwWTz/1sVjMnt7MZHHr8ioWi7NNb9gtlj3YzGKx+XsHm8XMeXfYLPbsPcli cXnXHDaLe2v+s1q09vxkd+D3aNlc7rFgU6nH5hVaHm8fBnhs+jSJ3aPpzFFmj/f7rrJ59G1Z xehxZsERdo/Pm+QCuKK4bFJSczLLUov07RK4Mv6uPMleMGkmc8XmaV+YGxg3HWbqYuTgkBAw kVi33KmLkRPIFJO4cG89WxcjF4eQwFJGiQ2PZ7JAOM8YJT7veMAKUsUroCUx72ULWDOLgKrE 9LcyIGE2AX2JVzNWMoHYogIREn9O74MqF5T4MfkeC4gtImAhsXbFW7CZzAJHmCRWvJsPViQs 4CnxcfYMdrjN77+8YwZJcAItu/drMTuIzSygI7G/dRobhC0vsXnNW+YJjAKzkCyZhaRsFpKy BYzMqxhFU0uTC4qT0nMN9YoTc4tL89L1kvNzNzFCouzLDsbFx6wOMQpwMCrx8L6YtzNUiDWx rLgy9xCjBAezkgivxRqgEG9KYmVValF+fFFpTmrxIUZpDhYlcd65u96HCAmkJ5akZqemFqQW wWSZODilGhin67Wt0V8TvFRL8PpV7/pShQ/NGebtk/pPv50wme+pXcfVzgeHL2q2Vt2s0ine n8AmGdrY3bBSOdz5i+NChcjqRFW/OdJqF9J6MjNPTUo7nD93+u697OKMBlNSeYvY6oNfB5mK P+3vu3s7fSf3W+mowq0JB2PNXLrPiNi9Kznyy9R3TZhvqRJLcUaioRZzUXEiAAGnI1OuAgAA Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/22/2015 05:55 PM, Bartlomiej Zolnierkiewicz wrote: > > Hi, > > Some comments below. > > On Tuesday, June 16, 2015 03:09:30 PM Beata Michalska wrote: >> Introduce configurable generic interface for file >> system-wide event notifications, to provide file >> systems with a common way of reporting any potential >> issues as they emerge. >> >> The notifications are to be issued through generic >> netlink interface by newly introduced multicast group. >> >> Threshold notifications have been included, allowing >> triggering an event whenever the amount of free space drops >> below a certain level - or levels to be more precise as two >> of them are being supported: the lower and the upper range. >> The notifications work both ways: once the threshold level >> has been reached, an event shall be generated whenever >> the number of available blocks goes up again re-activating >> the threshold. >> >> The interface has been exposed through a vfs. Once mounted, >> it serves as an entry point for the set-up where one can >> register for particular file system events. >> >> Signed-off-by: Beata Michalska >> --- >> Documentation/filesystems/events.txt | 232 ++++++++++ >> fs/Kconfig | 2 + >> fs/Makefile | 1 + >> fs/events/Kconfig | 7 + >> fs/events/Makefile | 5 + >> fs/events/fs_event.c | 809 ++++++++++++++++++++++++++++++++++ >> fs/events/fs_event.h | 22 + >> fs/events/fs_event_netlink.c | 104 +++++ >> fs/namespace.c | 1 + >> include/linux/fs.h | 6 +- >> include/linux/fs_event.h | 72 +++ >> include/uapi/linux/Kbuild | 1 + >> include/uapi/linux/fs_event.h | 58 +++ >> 13 files changed, 1319 insertions(+), 1 deletion(-) >> create mode 100644 Documentation/filesystems/events.txt >> create mode 100644 fs/events/Kconfig >> create mode 100644 fs/events/Makefile >> create mode 100644 fs/events/fs_event.c >> create mode 100644 fs/events/fs_event.h >> create mode 100644 fs/events/fs_event_netlink.c >> create mode 100644 include/linux/fs_event.h >> create mode 100644 include/uapi/linux/fs_event.h >> >> diff --git a/Documentation/filesystems/events.txt b/Documentation/filesystems/events.txt >> new file mode 100644 >> index 0000000..c2e6227 >> --- /dev/null >> +++ b/Documentation/filesystems/events.txt >> @@ -0,0 +1,232 @@ >> + >> + Generic file system event notification interface >> + >> +Document created 23 April 2015 by Beata Michalska >> + >> +1. The reason behind: >> +===================== >> + >> +There are many corner cases when things might get messy with the filesystems. >> +And it is not always obvious what and when went wrong. Sometimes you might >> +get some subtle hints that there is something going on - but by the time >> +you realise it, it might be too late as you are already out-of-space >> +or the filesystem has been remounted as read-only (i.e.). The generic >> +interface for the filesystem events fills the gap by providing a rather >> +easy way of real-time notifications triggered whenever something interesting >> +happens, allowing filesystems to report events in a common way, as they occur. >> + >> +2. How does it work: >> +==================== >> + >> +The interface itself has been exposed as fstrace-type Virtual File System, >> +primarily to ease the process of setting up the configuration for the >> +notifications. So for starters, it needs to get mounted (obviously): >> + >> + mount -t fstrace none /sys/fs/events >> + >> +This will unveil the single fstrace filesystem entry - the 'config' file, >> +through which the notification are being set-up. > > The patch creates a separate virtual filesystem for single file, > this is an overkill IMHO and a new sysfs or debugfs entry should > be sufficient. > >> + >> +Activating notifications for particular filesystem is as straightforward >> +as writing into the 'config' file. Note that by default all events, despite >> +the actual filesystem type, are being disregarded. >> + >> +Synopsis of config: >> +------------------ >> + >> + MOUNT EVENT_TYPE [L1] [L2] > > OTOH Why not use the advantages of having a separate virtual > filesystem and create separate directories for each mount point > (+ maybe even extra parent directories for mount namespaces) and > put separate entries for each event type in these directories. > > This would also allow usage of eventfd() notification interface > on such files. > > Please take look at: > > tools/cgroup/cgroup_event_listener.c > > and > > Documentation/cgroups/memcg_test.txt (point 9.10) > > to see how much easier it is to observe memory usage thresholds > on memory cgroups compared to available blocks on filesystems > using fs events.. > I'll give it some thoughts as the solution you are proposing eliminates some issues related with the generic netlink (mostly the one concerning the network namespaces) though I'd rather avoid creating numerous entries for each mount/mount namespace. I guess the best option is to meet halfway. > Also while at it please add your example user-space code (posted > on request in a some other mail) to tools/fs_events/ (preferably > in a separate patch). > Will do, once there is an overall agreement on the form of the events interface. >> + >> + MOUNT : the filesystem's mount point >> + EVENT_TYPE : event types - currently two of them are being supported: >> + >> + * generic events ("G") covering most common warnings >> + and errors that might be reported by any filesystem; >> + this option does not take any arguments; > > fs_event.h in uapi dir allows following events: > > /* > * Supported set of FS events > */ > enum { > FS_EVENT_NONE, > FS_WARN_ENOSPC, /* No space left to reserve data blks */ > FS_WARN_ENOSPC_META, /* No space left for metadata */ > FS_THR_LRBELOW, /* The threshold lower range has been reached */ > FS_THR_LRABOVE, /* The threshold lower range re-activcated*/ > FS_THR_URBELOW, > FS_THR_URABOVE, > FS_ERR_REMOUNT_RO, /* The file system has been remounted as RO */ > FS_ERR_CORRUPTED /* Critical error - fs corrupted */ > > }; > > For non-threshold related events the current interface allows > only configuration of all or none events to be anabled, i.e. > you cannot selectively enable notification on FS_WARN_ENOSPC > but not on FS_ERR_REMOUNT_RO. > > I also think that configuration interface should be made to > match the notification interface when it comes to event types. > Will take it into consideration - thanks. >> + >> + * threshold notifications ("T") - events sent whenever >> + the amount of available space drops below certain level; >> + it is possible to specify two threshold levels though >> + only one is required to properly setup the notifications; >> + as those refer to the number of available blocks, the lower >> + level [L1] needs to be higher than the upper one [L2] > > Why is there a limitation of only two thresholds? > > It should be relatively easy to make the code support > unlimited number of thresholds. > >> + >> +Sample request could look like the following: >> + >> + echo /sample/mount/point G T 710000 500000 > /sys/fs/events/config >> + >> +Multiple request might be specified provided they are separated with semicolon. > > s/request/requests/ > > I think that allowing multiple event types and requests in one > configuration request is not a good idea. Currently parsing > code is relatively simple but once somebody decides to enhance > the interface with new event types the parsing code may get > complex & ugly. > Noted. >> + >> +The configuration itself might be modified at any time. One can add/remove >> +particular event types for given fielsystem, modify the threshold levels, > > s/fielsystem/filesystem/ > >> +and remove single or all entries from the 'config' file. >> + >> + - Adding new event type: >> + >> + $ echo MOUNT EVENT_TYPE > /sys/fs/events/config >> + >> +(Note that is is enough to provide the event type to be enabled without > > s/is is/is/ > >> +the already set ones.) >> + >> + - Removing event type: >> + >> + $ echo '!MOUNT EVENT_TYPE' > /sys/fs/events/config >> + >> + - Updating threshold limits: >> + >> + $ echo MOUNT T L1 L2 > /sys/fs/events/config >> + >> + - Removing single entry: >> + >> + $ echo '!MOUNT' > /sys/fs/events/config >> + >> + - Removing all entries: >> + >> + $ echo > /sys/fs/events/config >> + >> +Reading the file will list all registered entries with their current set-up >> +along with some additional info like the filesystem type and the backing device >> +name if available. >> + >> +Final, though a very important note on the configuration: when and if the >> +actual events are being triggered falls way beyond the scope of the generic >> +filesystem events interface. It is up to a particular filesystem >> +implementation which events are to be supported - if any at all. So if >> +given filesystem does not support the event notifications, an attempt to >> +enable those through 'config' file will fail. >> + >> + >> +3. The generic netlink interface support: >> +========================================= >> + >> +Whenever an event notification is triggered (by given filesystem) the current >> +configuration is being validated to decide whether a userpsace notification > > s/userpsace/userspace/ > >> +should be launched. If there has been no request (in a mean of 'config' file >> +entry) for given event, one will be silently disregarded. If, on the other >> +hand, someone is 'watching' given filesystem for specific events, a generic >> +netlink message will be sent. A dedicated multicast group has been provided >> +solely for this purpose so in order to receive such notifications, one should >> +subscribe to this new multicast group. As for now only the init network >> +namespace is being supported. >> + >> +3.1 Message format >> + >> +The FS_NL_C_EVENT shall be stored within the generic netlink message header >> +as the command field. The message payload will provide more detailed info: >> +the backing device major and minor numbers, the event code and the id of >> +the process which action led to the event occurrence. In case of threshold >> +notifications, the current number of available blocks will be included >> +in the payload as well. >> + >> + >> + 0 1 2 3 >> + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | NETLINK MESSAGE HEADER | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | GENERIC NETLINK MESSAGE HEADER | >> + | (with FS_NL_C_EVENT as genlmsghdr cdm field) | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | Optional user specific message header | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + | GENERIC MESSAGE PAYLOAD: | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_EVENT_ID (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DEV_MAJOR (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DEV_MINOR (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_CAUSED_ID (NLA_U32) | >> + +---------------------------------------------------------------+ >> + | FS_NL_A_DATA (NLA_U64) | >> + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ >> + >> + >> +The above figure is based on: >> + http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto#Message_Format >> + >> + >> +4. API Reference: >> +================= >> + >> + 4.1 Generic file system event interface data & operations >> + >> + #include >> + >> + struct fs_trace_info { >> + void __rcu *e_priv /* READ ONLY */ > > It should be marked as private for fs events core code and > not for use by filesystems' code. If possible it would be > the best to move it out of this struct, > >> + unsigned int events_cap_mask; /* Supported notifications */ >> + const struct fs_trace_operations *ops; >> + }; >> + >> + struct fs_trace_operations { >> + void (*query)(struct super_block *, u64 *); >> + }; >> + >> + In order to get the fireworks and stuff, each filesystem needs to setup >> + the events_cap_mask field of the fs_trace_info structure, which has been >> + embedded within the super_block structure. This should reflect the type of >> + events the filesystem wants to support. In case of threshold notifications, >> + apart from setting the FS_EVENT_THRESH flag, the 'query' callback should >> + be provided as this enables the events interface to get the up-to-date >> + state of the number of available blocks whenever those notifications are >> + being requested. >> + >> + The 'e_priv' field of the fs_trace_info structure should be completely ignored >> + as it's for INTERNAL USE ONLY. So don't even think of messing with it, if you >> + do not want to get yourself into some real trouble. If still, you are tempted >> + to do so - feel free, it's gonna be pure fun. Consider yourself warned. >> + >> + >> + 4.2 Event notification: >> + >> + #include >> + void fs_event_notify(struct super_block *sb, unsigned int event_id); >> + >> + Notify the generic FS event interface of an occurring event. >> + This shall be used by any file system that wishes to inform any potential >> + listeners/watchers of a particular event. >> + - sb: the filesystem's super block >> + - event_id: an event identifier >> + >> + 4.3 Threshold notifications: >> + >> + #include >> + void fs_event_alloc_space(struct super_block *sb, u64 ncount); >> + void fs_event_free_space(struct super_block *sb, u64 ncount); >> + >> + Each filesystme supporting the threshold notifications should call >> + fs_event_alloc_space/fs_event_free_space respectively whenever the >> + amount of available blocks changes. >> + - sb: the filesystem's super block >> + - ncount: number of blocks being acquired/released >> + >> + Note that to properly handle the threshold notifications the fs events >> + interface needs to be kept up to date by the filesystems. Each should >> + register fs_trace_operations to enable querying the current number of >> + available blocks. >> + >> + 4.4 Sending message through generic netlink interface >> + >> + #include >> + >> + int fs_netlink_send_event(size_t size, unsigned int event_id, >> + int (*compose_msg)(struct sk_buff *skb, void *data), void *cbdata); >> + >> + Although the fs event interface is fully responsible for sending the messages >> + over the netlink, filesystems might use the FS_EVENT multicast group to send >> + their own custom messages. >> + - size: the size of the message payload >> + - event_id: the event identifier >> + - compose_msg: a callback responsible for filling-in the message payload >> + - cbdata: message custom data >> + >> + Calling fs_netlink_send_event will result in a message being sent by >> + the FS_EVENT multicast group. Note that the body of the message should be >> + prepared (set-up )by the caller - through compose_msg callback. The message's > > (set-up) > >> + sk_buff will be allocated on behalf of the caller (thus the size parameter). >> + The compose_msg should only fill the payload with proper data. Unless >> + the event id is specified as FS_EVENT_NONE, it's value shall be added >> + to the payload prior to calling the compose_msg. >> + >> + >> diff --git a/fs/Kconfig b/fs/Kconfig >> index ec35851..a89e678 100644 >> --- a/fs/Kconfig >> +++ b/fs/Kconfig >> @@ -69,6 +69,8 @@ config FILE_LOCKING >> for filesystems like NFS and for the flock() system >> call. Disabling this option saves about 11k. >> >> +source "fs/events/Kconfig" >> + >> source "fs/notify/Kconfig" >> >> source "fs/quota/Kconfig" >> diff --git a/fs/Makefile b/fs/Makefile >> index a88ac48..bcb3048 100644 >> --- a/fs/Makefile >> +++ b/fs/Makefile >> @@ -126,3 +126,4 @@ obj-y += exofs/ # Multiple modules >> obj-$(CONFIG_CEPH_FS) += ceph/ >> obj-$(CONFIG_PSTORE) += pstore/ >> obj-$(CONFIG_EFIVAR_FS) += efivarfs/ >> +obj-$(CONFIG_FS_EVENTS) += events/ >> diff --git a/fs/events/Kconfig b/fs/events/Kconfig >> new file mode 100644 >> index 0000000..1c60195 >> --- /dev/null >> +++ b/fs/events/Kconfig >> @@ -0,0 +1,7 @@ >> +# Generic Files System events interface >> +config FS_EVENTS >> + bool "Generic filesystem events" >> + select NET >> + default y > > Do we really want to default to yes? > > [ If so then maybe we want to make the config option visible > only when EXPERT mode is enabled? ] > >> + help >> + Enable generic filesystem events interface > > Please enhance the help entry. > >> diff --git a/fs/events/Makefile b/fs/events/Makefile >> new file mode 100644 >> index 0000000..9c98337 >> --- /dev/null >> +++ b/fs/events/Makefile >> @@ -0,0 +1,5 @@ >> +# >> +# Makefile for the Linux Generic File System Event Interface >> +# >> + >> +obj-y := fs_event.o fs_event_netlink.o >> diff --git a/fs/events/fs_event.c b/fs/events/fs_event.c >> new file mode 100644 >> index 0000000..1037311 >> --- /dev/null >> +++ b/fs/events/fs_event.c >> @@ -0,0 +1,809 @@ >> +/* >> + * Generic File System Evens Interface >> + * >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include "../pnode.h" >> +#include "fs_event.h" >> + >> +static LIST_HEAD(fs_trace_list); >> +static DEFINE_MUTEX(fs_trace_lock); >> + >> +static struct kmem_cache *fs_trace_cachep __read_mostly; >> + >> +static atomic_t stray_traces = ATOMIC_INIT(0); >> +static DECLARE_WAIT_QUEUE_HEAD(trace_wq); >> +/* >> + * Threshold notification state bits. >> + * Note the reverse as this refers to the number >> + * of available blocks. >> + */ >> +#define THRESH_LR_BELOW 0x0001 /* Falling below the lower range */ >> +#define THRESH_LR_BEYOND 0x0002 >> +#define THRESH_UR_BELOW 0x0004 >> +#define THRESH_UR_BEYOND 0x0008 /* Going beyond the upper range */ >> + >> +#define THRESH_LR_ON (THRESH_LR_BELOW | THRESH_LR_BEYOND) >> +#define THRESH_UR_ON (THRESH_UR_BELOW | THRESH_UR_BEYOND) >> + >> +#define FS_TRACE_ADD 0x100000 >> + >> +struct fs_trace_entry { >> + struct kref count; >> + atomic_t active; >> + struct super_block *sb; >> + unsigned int notify; >> + struct path mnt_path; >> + struct list_head node; >> + >> + struct fs_event_thresh { >> + u64 avail_space; >> + u64 lrange; >> + u64 urange; >> + unsigned int state; >> + } th; >> + struct rcu_head rcu_head; >> + spinlock_t lock; >> +}; >> + >> +static const match_table_t fs_etypes = { >> + { FS_EVENT_GENERIC, "G" }, >> + { FS_EVENT_THRESH, "T" }, >> + { 0, NULL }, >> +}; >> + >> +static inline int fs_trace_query_data(struct super_block *sb, >> + struct fs_trace_entry *en) >> +{ >> + if (sb->s_etrace.ops && sb->s_etrace.ops->query) { >> + sb->s_etrace.ops->query(sb, &en->th.avail_space); >> + return 0; >> + } >> + >> + return -EINVAL; >> +} >> + >> +static inline void fs_trace_entry_free(struct fs_trace_entry *en) > > I don't see a real need for this wrapper (it is used only once). > >> +{ >> + kmem_cache_free(fs_trace_cachep, en); >> +} >> + >> +static void fs_destroy_trace_entry(struct kref *en_ref) >> +{ >> + struct fs_trace_entry *en = container_of(en_ref, >> + struct fs_trace_entry, count); >> + >> + /* Last reference has been dropped */ >> + fs_trace_entry_free(en); >> + atomic_dec(&stray_traces); >> +} >> + >> +static void fs_trace_entry_put(struct fs_trace_entry *en) >> +{ >> + kref_put(&en->count, fs_destroy_trace_entry); >> +} >> + >> +static void fs_release_trace_entry(struct rcu_head *rcu_head) >> +{ >> + struct fs_trace_entry *en = container_of(rcu_head, >> + struct fs_trace_entry, >> + rcu_head); >> + /* >> + * As opposed to typical reference drop, this one is being >> + * called from the rcu callback. This is to make sure all >> + * readers have managed to safely grab the reference before >> + * the change to rcu pointer is visible to all and before >> + * the reference is dropped here. >> + */ >> + fs_trace_entry_put(en); >> +} >> + >> +static void fs_drop_trace_entry(struct fs_trace_entry *en) >> +{ >> + struct super_block *sb; >> + >> + lockdep_assert_held(&fs_trace_lock); >> + /* >> + * The trace entry might have already been removed >> + * from the list of active traces with the proper >> + * ref drop, though it was still in use handling >> + * one of the fs events. This means that the object >> + * has been already scheduled for being released. >> + * So leave... >> + */ >> + >> + if (!atomic_add_unless(&en->active, -1, 0)) >> + return; >> + /* >> + * At this point the trace entry is being marked as inactive >> + * so no new references will be allowed. >> + * Still it might be floating around somewhere >> + * so drop the reference when the rcu readers are done. >> + */ >> + spin_lock(&en->lock); >> + list_del(&en->node); >> + sb = en->sb; >> + en->sb = NULL; >> + spin_unlock(&en->lock); >> + >> + rcu_assign_pointer(sb->s_etrace.e_priv, NULL); >> + call_rcu(&en->rcu_head, fs_release_trace_entry); >> + /* It's safe now to drop the reference to the super */ >> + deactivate_super(sb); >> + atomic_inc(&stray_traces); >> +} >> + >> +static inline >> +struct fs_trace_entry *fs_trace_entry_get(struct fs_trace_entry *en) >> +{ >> + if (en) { >> + if (!kref_get_unless_zero(&en->count)) >> + return NULL; >> + /* Don't allow referencing inactive object */ >> + if (!atomic_read(&en->active)) { >> + fs_trace_entry_put(en); >> + return NULL; >> + } >> + } >> + return en; >> +} >> + >> +static struct fs_trace_entry *fs_trace_entry_get_rcu(struct super_block *sb) >> +{ >> + struct fs_trace_entry *en; >> + >> + if (!sb) >> + return NULL; >> + >> + rcu_read_lock(); >> + en = rcu_dereference(sb->s_etrace.e_priv); >> + en = fs_trace_entry_get(en); >> + rcu_read_unlock(); >> + >> + return en; >> +} >> + >> +static int fs_remove_trace_entry(struct super_block *sb) >> +{ >> + struct fs_trace_entry *en; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return -EINVAL; >> + >> + mutex_lock(&fs_trace_lock); >> + fs_drop_trace_entry(en); >> + mutex_unlock(&fs_trace_lock); >> + fs_trace_entry_put(en); >> + return 0; >> +} >> + >> +static void fs_remove_all_traces(void) >> +{ >> + struct fs_trace_entry *en, *guard; >> + >> + mutex_lock(&fs_trace_lock); >> + list_for_each_entry_safe(en, guard, &fs_trace_list, node) >> + fs_drop_trace_entry(en); >> + mutex_unlock(&fs_trace_lock); >> +} >> + >> +static int create_common_msg(struct sk_buff *skb, void *data) >> +{ >> + struct fs_trace_entry *en = (struct fs_trace_entry *)data; >> + struct super_block *sb = en->sb; >> + >> + if (nla_put_u32(skb, FS_NL_A_DEV_MAJOR, MAJOR(sb->s_dev)) >> + || nla_put_u32(skb, FS_NL_A_DEV_MINOR, MINOR(sb->s_dev))) >> + return -EINVAL; >> + >> + if (nla_put_u64(skb, FS_NL_A_CAUSED_ID, pid_vnr(task_pid(current)))) >> + return -EINVAL; >> + >> + return 0; >> +} >> + >> +static int create_thresh_msg(struct sk_buff *skb, void *data) >> +{ >> + struct fs_trace_entry *en = (struct fs_trace_entry *)data; >> + int ret; >> + >> + ret = create_common_msg(skb, data); >> + if (!ret) >> + ret = nla_put_u64(skb, FS_NL_A_DATA, en->th.avail_space); >> + return ret; >> +} >> + >> +static void fs_event_send(struct fs_trace_entry *en, unsigned int event_id) >> +{ >> + size_t size = nla_total_size(sizeof(u32)) * 2 + >> + nla_total_size(sizeof(u64)); >> + >> + fs_netlink_send_event(size, event_id, create_common_msg, en); >> +} >> + >> +static void fs_event_send_thresh(struct fs_trace_entry *en, >> + unsigned int event_id) >> +{ >> + size_t size = nla_total_size(sizeof(u32)) * 2 + >> + nla_total_size(sizeof(u64)) * 2; >> + >> + fs_netlink_send_event(size, event_id, create_thresh_msg, en); >> +} >> + >> +void fs_event_notify(struct super_block *sb, unsigned int event_id) >> +{ >> + struct fs_trace_entry *en; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return; >> + >> + spin_lock(&en->lock); >> + if (atomic_read(&en->active) && (en->notify & FS_EVENT_GENERIC)) >> + fs_event_send(en, event_id); >> + spin_unlock(&en->lock); >> + fs_trace_entry_put(en); >> +} >> +EXPORT_SYMBOL(fs_event_notify); >> + >> +void fs_event_alloc_space(struct super_block *sb, u64 ncount) >> +{ >> + struct fs_trace_entry *en; >> + s64 count; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return; >> + >> + spin_lock(&en->lock); >> + >> + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) >> + goto leave; >> + /* >> + * we shouldn't drop below 0 here, >> + * unless there is a sync issue somewhere (?) >> + */ >> + count = en->th.avail_space - ncount; >> + en->th.avail_space = count < 0 ? 0 : count; >> + >> + if (en->th.avail_space > en->th.lrange) >> + /* Not 'even' close - leave */ >> + goto leave; >> + >> + if (en->th.avail_space > en->th.urange) { >> + /* Close enough - the lower range has been reached */ >> + if (!(en->th.state & THRESH_LR_BEYOND)) { >> + /* Send notification */ >> + fs_event_send_thresh(en, FS_THR_LRBELOW); >> + en->th.state &= ~THRESH_LR_BELOW; >> + en->th.state |= THRESH_LR_BEYOND; >> + } >> + goto leave; >> + } >> + if (!(en->th.state & THRESH_UR_BEYOND)) { >> + fs_event_send_thresh(en, FS_THR_URBELOW); >> + en->th.state &= ~THRESH_UR_BELOW; >> + en->th.state |= THRESH_UR_BEYOND; >> + } >> + >> +leave: >> + spin_unlock(&en->lock); >> + fs_trace_entry_put(en); >> +} >> +EXPORT_SYMBOL(fs_event_alloc_space); >> + >> +void fs_event_free_space(struct super_block *sb, u64 ncount) >> +{ >> + struct fs_trace_entry *en; >> + >> + en = fs_trace_entry_get_rcu(sb); >> + if (!en) >> + return; >> + >> + spin_lock(&en->lock); >> + >> + if (!atomic_read(&en->active) || !(en->notify & FS_EVENT_THRESH)) >> + goto leave; >> + >> + en->th.avail_space += ncount; >> + >> + if (en->th.avail_space > en->th.lrange) { >> + if (!(en->th.state & THRESH_LR_BELOW) >> + && en->th.state & THRESH_LR_BEYOND) { >> + /* Send notification */ >> + fs_event_send_thresh(en, FS_THR_LRABOVE); >> + en->th.state &= ~(THRESH_LR_BEYOND|THRESH_UR_BEYOND); >> + en->th.state |= THRESH_LR_BELOW; >> + goto leave; >> + } >> + } >> + if (en->th.avail_space > en->th.urange) { >> + if (!(en->th.state & THRESH_UR_BELOW) >> + && en->th.state & THRESH_UR_BEYOND) { >> + /* Notify */ >> + fs_event_send_thresh(en, FS_THR_URABOVE); >> + en->th.state &= ~THRESH_UR_BEYOND; >> + en->th.state |= THRESH_UR_BELOW; >> + } >> + } >> +leave: >> + spin_unlock(&en->lock); >> + fs_trace_entry_put(en); >> +} >> +EXPORT_SYMBOL(fs_event_free_space); >> + >> +void fs_event_mount_dropped(struct vfsmount *mnt) >> +{ >> + /* >> + * The mount is dropped but the super might not get released >> + * at once so there is very small chance some notifications >> + * will come through. >> + * Note that the mount being dropped here might belong to a different >> + * namespace - if this is the case, just ignore it. >> + */ >> + struct fs_trace_entry *en = fs_trace_entry_get_rcu(mnt->mnt_sb); >> + struct vfsmount *en_mnt; >> + >> + if (!en || !atomic_read(&en->active)) >> + return; >> + /* >> + * The entry once set, does not change the mountpoint it's being >> + * pinned to, so no need to take the lock here. >> + */ >> + en_mnt = en->mnt_path.mnt; >> + if (!(real_mount(mnt)->mnt_ns != (real_mount(en_mnt))->mnt_ns)) >> + fs_remove_trace_entry(mnt->mnt_sb); >> + fs_trace_entry_put(en); >> +} >> + >> +static int fs_new_trace_entry(struct path *path, struct fs_event_thresh *thresh, >> + unsigned int nmask) >> +{ >> + struct fs_trace_entry *en; >> + struct super_block *sb; >> + struct mount *r_mnt; >> + >> + en = kmem_cache_zalloc(fs_trace_cachep, GFP_KERNEL); >> + if (unlikely(!en)) >> + return -ENOMEM; >> + /* >> + * Note that no reference is being taken here for the path as it would >> + * make the unmount unnecessarily puzzling (due to an extra 'valid' >> + * reference for the mnt). >> + * This is *rather* safe as the notification on mount being dropped >> + * will get called prior to releasing the super block - so right >> + * in time to perform appropriate clean-up >> + */ >> + r_mnt = real_mount(path->mnt); >> + >> + en->mnt_path.dentry = r_mnt->mnt.mnt_root; >> + en->mnt_path.mnt = &r_mnt->mnt; >> + >> + sb = path->mnt->mnt_sb; >> + en->sb = sb; >> + /* >> + * Increase the refcount for sb to mark it's being relied on. >> + * Note that the reference to path is taken by the caller, so it >> + * is safe to assume there is at least single active reference >> + * to super as well. >> + */ >> + atomic_inc(&sb->s_active); >> + >> + nmask &= sb->s_etrace.events_cap_mask; >> + if (!nmask) >> + goto leave; >> + >> + spin_lock_init(&en->lock); >> + INIT_LIST_HEAD(&en->node); >> + >> + en->notify = nmask; >> + memcpy(&en->th, thresh, offsetof(struct fs_event_thresh, state)); >> + if (nmask & FS_EVENT_THRESH) >> + fs_trace_query_data(sb, en); >> + >> + kref_init(&en->count); >> + >> + if (rcu_access_pointer(sb->s_etrace.e_priv) != NULL) { >> + struct fs_trace_entry *prev_en; >> + >> + prev_en = fs_trace_entry_get_rcu(sb); >> + if (prev_en) { >> + WARN_ON(prev_en); >> + fs_trace_entry_put(prev_en); >> + goto leave; >> + } >> + } >> + atomic_set(&en->active, 1); >> + >> + mutex_lock(&fs_trace_lock); >> + list_add(&en->node, &fs_trace_list); >> + mutex_unlock(&fs_trace_lock); >> + >> + rcu_assign_pointer(sb->s_etrace.e_priv, en); >> + synchronize_rcu(); >> + >> + return 0; >> +leave: >> + deactivate_super(sb); >> + kmem_cache_free(fs_trace_cachep, en); >> + return -EINVAL; >> +} >> + >> +static int fs_update_trace_entry(struct path *path, >> + struct fs_event_thresh *thresh, >> + unsigned int nmask) >> +{ >> + struct fs_trace_entry *en; >> + struct super_block *sb; >> + int extend = nmask & FS_TRACE_ADD; >> + int ret = -EINVAL; >> + >> + en = fs_trace_entry_get_rcu(path->mnt->mnt_sb); >> + if (!en) >> + return (extend) ? fs_new_trace_entry(path, thresh, nmask) >> + : -EINVAL; >> + >> + if (!atomic_read(&en->active)) >> + return -EINVAL; >> + >> + nmask &= ~FS_TRACE_ADD; >> + >> + spin_lock(&en->lock); >> + sb = en->sb; >> + if (!sb || !(nmask & sb->s_etrace.events_cap_mask)) >> + goto leave; >> + >> + if (nmask & FS_EVENT_THRESH) { >> + if (extend) { >> + /* Get the current state */ >> + if (!(en->notify & FS_EVENT_THRESH)) >> + if (fs_trace_query_data(sb, en)) >> + goto leave; >> + >> + if (thresh->state & THRESH_LR_ON) { >> + en->th.lrange = thresh->lrange; >> + en->th.state &= ~THRESH_LR_ON; >> + } >> + >> + if (thresh->state & THRESH_UR_ON) { >> + en->th.urange = thresh->urange; >> + en->th.state &= ~THRESH_UR_ON; >> + } >> + } else { >> + memset(&en->th, 0, sizeof(en->th)); >> + } >> + } >> + >> + if (extend) >> + en->notify |= nmask; >> + else >> + en->notify &= ~nmask; >> + ret = 0; >> +leave: >> + spin_unlock(&en->lock); >> + fs_trace_entry_put(en); >> + return ret; >> +} >> + >> +static int fs_parse_trace_request(int argc, char **argv) >> +{ >> + struct fs_event_thresh thresh = {0}; >> + struct path path; >> + substring_t args[MAX_OPT_ARGS]; >> + unsigned int nmask = FS_TRACE_ADD; >> + int token; >> + char *s; >> + int ret = -EINVAL; >> + >> + if (!argc) { >> + fs_remove_all_traces(); >> + return 0; >> + } >> + >> + s = *(argv); >> + if (*s == '!') { >> + /* Clear the trace entry */ >> + nmask &= ~FS_TRACE_ADD; >> + ++s; >> + } >> + >> + if (kern_path_mountpoint(AT_FDCWD, s, &path, LOOKUP_FOLLOW)) >> + return -EINVAL; >> + >> + if (!(--argc)) { >> + if (!(nmask & FS_TRACE_ADD)) >> + ret = fs_remove_trace_entry(path.mnt->mnt_sb); >> + goto leave; >> + } >> + >> +repeat: >> + args[0].to = args[0].from = NULL; >> + token = match_token(*(++argv), fs_etypes, args); >> + if (!token && !nmask) >> + goto leave; >> + >> + nmask |= token & FS_EVENTS_ALL; >> + --argc; >> + if ((token & FS_EVENT_THRESH) && (nmask & FS_TRACE_ADD)) { >> + /* >> + * Get the threshold config data: >> + * lower range >> + * upper range >> + */ >> + if (!argc) >> + goto leave; >> + >> + ret = kstrtoull(*(++argv), 10, &thresh.lrange); >> + if (ret) >> + goto leave; >> + thresh.state |= THRESH_LR_ON; >> + if ((--argc)) { >> + ret = kstrtoull(*(++argv), 10, &thresh.urange); >> + if (ret) >> + goto leave; >> + thresh.state |= THRESH_UR_ON; >> + --argc; >> + } >> + /* The thresholds are based on number of available blocks */ >> + if (thresh.lrange < thresh.urange) { >> + ret = -EINVAL; >> + goto leave; >> + } >> + } >> + if (argc) >> + goto repeat; >> + >> + ret = fs_update_trace_entry(&path, &thresh, nmask); >> +leave: >> + path_put(&path); >> + return ret; >> +} >> + >> +#define DEFAULT_BUF_SIZE PAGE_SIZE >> + >> +static ssize_t fs_trace_write(struct file *file, const char __user *buffer, >> + size_t count, loff_t *ppos) >> +{ >> + char **argv; >> + char *kern_buf, *next, *cfg; >> + size_t size, dcount = 0; >> + int argc; >> + >> + if (!count) >> + return 0; >> + >> + kern_buf = kmalloc(DEFAULT_BUF_SIZE, GFP_KERNEL); >> + if (!kern_buf) >> + return -ENOMEM; >> + >> + while (dcount < count) { >> + >> + size = count - dcount; >> + if (size >= DEFAULT_BUF_SIZE) >> + size = DEFAULT_BUF_SIZE - 1; >> + if (copy_from_user(kern_buf, buffer + dcount, size)) { >> + dcount = -EINVAL; >> + goto leave; >> + } >> + >> + kern_buf[size] = '\0'; >> + >> + next = cfg = kern_buf; >> + >> + do { >> + next = strchr(cfg, ';'); >> + if (next) >> + *next = '\0'; >> + >> + argv = argv_split(GFP_KERNEL, cfg, &argc); >> + if (!argv) { >> + dcount = -ENOMEM; >> + goto leave; >> + } >> + >> + if (fs_parse_trace_request(argc, argv)) { >> + dcount = -EINVAL; >> + argv_free(argv); >> + goto leave; >> + } >> + >> + argv_free(argv); >> + if (next) >> + cfg = ++next; >> + >> + } while (next); >> + dcount += size; >> + } >> +leave: >> + kfree(kern_buf); >> + return dcount; >> +} >> + >> +static void *fs_trace_seq_start(struct seq_file *m, loff_t *pos) >> +{ >> + mutex_lock(&fs_trace_lock); >> + return seq_list_start(&fs_trace_list, *pos); >> +} >> + >> +static void *fs_trace_seq_next(struct seq_file *m, void *v, loff_t *pos) >> +{ >> + return seq_list_next(v, &fs_trace_list, pos); >> +} >> + >> +static void fs_trace_seq_stop(struct seq_file *m, void *v) >> +{ >> + mutex_unlock(&fs_trace_lock); >> +} >> + >> +static int fs_trace_seq_show(struct seq_file *m, void *v) >> +{ >> + struct fs_trace_entry *en; >> + struct super_block *sb; >> + struct mount *r_mnt; >> + const struct match_token *match; >> + unsigned int nmask; >> + >> + en = list_entry(v, struct fs_trace_entry, node); >> + /* Do not show the entries outside current mount namespace */ >> + r_mnt = real_mount(en->mnt_path.mnt); >> + if (r_mnt->mnt_ns != current->nsproxy->mnt_ns) { >> + if (!__is_local_mountpoint(r_mnt->mnt_mountpoint)) >> + return 0; >> + } >> + >> + sb = en->sb; >> + >> + seq_path(m, &en->mnt_path, "\t\n\\"); >> + seq_putc(m, ' '); >> + >> + seq_escape(m, sb->s_type->name, " \t\n\\"); >> + if (sb->s_subtype && sb->s_subtype[0]) { >> + seq_putc(m, '.'); >> + seq_escape(m, sb->s_subtype, " \t\n\\"); >> + } >> + >> + seq_putc(m, ' '); >> + if (sb->s_op->show_devname) { >> + sb->s_op->show_devname(m, en->mnt_path.mnt->mnt_root); >> + } else { >> + seq_escape(m, r_mnt->mnt_devname ? r_mnt->mnt_devname : "none", >> + " \t\n\\"); >> + } >> + seq_puts(m, " ("); >> + >> + nmask = en->notify; >> + for (match = fs_etypes; match->pattern; ++match) { >> + if (match->token & nmask) { >> + seq_puts(m, match->pattern); >> + nmask &= ~match->token; >> + if (nmask) >> + seq_putc(m, ','); >> + } >> + } >> + seq_printf(m, " %llu %llu", en->th.lrange, en->th.urange); >> + seq_puts(m, ")\n"); >> + return 0; >> +} >> + >> +static const struct seq_operations fs_trace_seq_ops = { >> + .start = fs_trace_seq_start, >> + .next = fs_trace_seq_next, >> + .stop = fs_trace_seq_stop, >> + .show = fs_trace_seq_show, >> +}; >> + >> +static int fs_trace_open(struct inode *inode, struct file *file) >> +{ >> + return seq_open(file, &fs_trace_seq_ops); >> +} >> + >> +static const struct file_operations fs_trace_fops = { >> + .owner = THIS_MODULE, >> + .open = fs_trace_open, >> + .write = fs_trace_write, >> + .read = seq_read, >> + .llseek = seq_lseek, >> + .release = seq_release, >> +}; >> + >> +static int fs_trace_init(void) >> +{ >> + fs_trace_cachep = KMEM_CACHE(fs_trace_entry, 0); >> + if (!fs_trace_cachep) >> + return -EINVAL; >> + init_waitqueue_head(&trace_wq); >> + return 0; >> +} >> + >> +/* VFS support */ >> +static int fs_trace_fill_super(struct super_block *sb, void *data, int silen) >> +{ >> + int ret; >> + static struct tree_descr desc[] = { >> + [2] = { >> + .name = "config", >> + .ops = &fs_trace_fops, >> + .mode = S_IWUSR | S_IRUGO, >> + }, >> + {""}, >> + }; >> + >> + ret = simple_fill_super(sb, 0x7246332, desc); > > Please use a define for a magic number. > >> + return !ret ? fs_trace_init() : ret; >> +} >> + >> +static struct dentry *fs_trace_do_mount(struct file_system_type *fs_type, >> + int ntype, const char *dev_name, void *data) >> +{ >> + return mount_single(fs_type, ntype, data, fs_trace_fill_super); >> +} >> + >> +static void fs_trace_kill_super(struct super_block *sb) >> +{ >> + /* >> + * The rcu_barrier here will/should make sure all call_rcu >> + * callbacks are completed - still there might be some active >> + * trace objects in use which can make calling the >> + * kmem_cache_destroy unsafe. So we wait until all traces >> + * are finally released. >> + */ >> + fs_remove_all_traces(); >> + rcu_barrier(); >> + wait_event(trace_wq, !atomic_read(&stray_traces)); >> + >> + kmem_cache_destroy(fs_trace_cachep); >> + kill_litter_super(sb); >> +} >> + >> +static struct kset *fs_trace_kset; >> + >> +static struct file_system_type fs_trace_fstype = { >> + .name = "fstrace", >> + .mount = fs_trace_do_mount, >> + .kill_sb = fs_trace_kill_super, >> +}; >> + >> +static void __init fs_trace_vfs_init(void) >> +{ >> + fs_trace_kset = kset_create_and_add("events", NULL, fs_kobj); >> + >> + if (!fs_trace_kset) >> + return; >> + >> + if (!register_filesystem(&fs_trace_fstype)) { >> + if (!fs_event_netlink_register()) >> + return; >> + unregister_filesystem(&fs_trace_fstype); >> + } >> + kset_unregister(fs_trace_kset); >> +} >> + >> +static int __init fs_trace_evens_init(void) >> +{ >> + fs_trace_vfs_init(); >> + return 0; >> +}; >> +module_init(fs_trace_evens_init); >> + >> diff --git a/fs/events/fs_event.h b/fs/events/fs_event.h >> new file mode 100644 >> index 0000000..23f24c8 >> --- /dev/null >> +++ b/fs/events/fs_event.h >> @@ -0,0 +1,22 @@ >> +/* >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> + >> +#ifndef __GENERIC_FS_EVENTS_H >> +#define __GENERIC_FS_EVENTS_H >> + >> +int fs_event_netlink_register(void); >> +void fs_event_netlink_unregister(void); >> + >> +#endif /* __GENERIC_FS_EVENTS_H */ >> diff --git a/fs/events/fs_event_netlink.c b/fs/events/fs_event_netlink.c >> new file mode 100644 >> index 0000000..0c97eb7 >> --- /dev/null >> +++ b/fs/events/fs_event_netlink.c >> @@ -0,0 +1,104 @@ >> +/* >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include "fs_event.h" >> + >> +static const struct genl_multicast_group fs_event_mcgroups[] = { >> + { .name = FS_EVENTS_MCAST_GRP_NAME, }, >> +}; >> + >> +static struct genl_family fs_event_family = { >> + .id = GENL_ID_GENERATE, >> + .name = FS_EVENTS_FAMILY_NAME, >> + .version = 1, >> + .maxattr = FS_NL_A_MAX, >> + .mcgrps = fs_event_mcgroups, >> + .n_mcgrps = ARRAY_SIZE(fs_event_mcgroups), >> +}; >> + >> +int fs_netlink_send_event(size_t size, unsigned int event_id, >> + int (*compose_msg)(struct sk_buff *skb, void *data), >> + void *cbdata) >> +{ >> + static atomic_t seq; >> + struct sk_buff *skb; >> + void *msg_head; >> + int ret = 0; >> + >> + if (!size || !compose_msg) >> + return -EINVAL; >> + >> + /* Skip if there are no listeners */ >> + if (!genl_has_listeners(&fs_event_family, &init_net, 0)) >> + return 0; >> + >> + if (event_id != FS_EVENT_NONE) >> + size += nla_total_size(sizeof(u32)); >> + size += nla_total_size(sizeof(u64)); >> + skb = genlmsg_new(size, GFP_NOWAIT); >> + >> + if (!skb) { >> + pr_debug("Failed to allocate new FS generic netlink message\n"); >> + return -ENOMEM; >> + } >> + >> + msg_head = genlmsg_put(skb, 0, atomic_add_return(1, &seq), >> + &fs_event_family, 0, FS_NL_C_EVENT); >> + if (!msg_head) >> + goto cleanup; >> + >> + if (event_id != FS_EVENT_NONE) >> + if (nla_put_u32(skb, FS_NL_A_EVENT_ID, event_id)) >> + goto cancel; >> + >> + ret = compose_msg(skb, cbdata); >> + if (ret) >> + goto cancel; >> + >> + genlmsg_end(skb, msg_head); >> + ret = genlmsg_multicast(&fs_event_family, skb, 0, 0, GFP_NOWAIT); >> + if (ret && ret != -ENOBUFS && ret != -ESRCH) >> + goto cleanup; >> + >> + return ret; >> + >> +cancel: >> + genlmsg_cancel(skb, msg_head); >> +cleanup: >> + nlmsg_free(skb); >> + return ret; >> +} >> +EXPORT_SYMBOL(fs_netlink_send_event); >> + >> +int fs_event_netlink_register(void) >> +{ >> + int ret; >> + >> + ret = genl_register_family(&fs_event_family); >> + if (ret) >> + pr_err("Failed to register FS netlink interface\n"); >> + return ret; >> +} >> + >> +void fs_event_netlink_unregister(void) >> +{ >> + genl_unregister_family(&fs_event_family); >> +} >> diff --git a/fs/namespace.c b/fs/namespace.c >> index 82ef140..ec6e2ef 100644 >> --- a/fs/namespace.c >> +++ b/fs/namespace.c >> @@ -1031,6 +1031,7 @@ static void cleanup_mnt(struct mount *mnt) >> if (unlikely(mnt->mnt_pins.first)) >> mnt_pin_kill(mnt); >> fsnotify_vfsmount_delete(&mnt->mnt); >> + fs_event_mount_dropped(&mnt->mnt); >> dput(mnt->mnt.mnt_root); >> deactivate_super(mnt->mnt.mnt_sb); >> mnt_free_id(mnt); >> diff --git a/include/linux/fs.h b/include/linux/fs.h >> index b4d71b5..b7dadd9 100644 >> --- a/include/linux/fs.h >> +++ b/include/linux/fs.h >> @@ -263,6 +263,10 @@ struct iattr { >> * Includes for diskquotas. >> */ >> #include >> +/* >> + * Include for Generic File System Events Interface >> + */ >> +#include >> >> /* >> * Maximum number of layers of fs stack. Needs to be limited to >> @@ -1253,7 +1257,7 @@ struct super_block { >> struct hlist_node s_instances; >> unsigned int s_quota_types; /* Bitmask of supported quota types */ >> struct quota_info s_dquot; /* Diskquota specific options */ >> - >> + struct fs_trace_info s_etrace; >> struct sb_writers s_writers; >> >> char s_id[32]; /* Informational name */ >> diff --git a/include/linux/fs_event.h b/include/linux/fs_event.h >> new file mode 100644 >> index 0000000..83e22dd >> --- /dev/null >> +++ b/include/linux/fs_event.h >> @@ -0,0 +1,72 @@ >> +/* >> + * Generic File System Events Interface >> + * >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> +#ifndef _LINUX_GENERIC_FS_EVETS_ >> +#define _LINUX_GENERIC_FS_EVETS_ > > EVETS? > > Also the define name usually corresponds to the header filename. > >> +#include >> +#include >> + >> +/* >> + * Currently supported event types >> + */ >> +#define FS_EVENT_GENERIC 0x001 >> +#define FS_EVENT_THRESH 0x002 >> + >> +#define FS_EVENTS_ALL (FS_EVENT_GENERIC | FS_EVENT_THRESH) >> + >> +struct fs_trace_operations { >> + void (*query)(struct super_block *, u64 *); >> +}; >> + >> +struct fs_trace_info { >> + void __rcu *e_priv; /* READ ONLY */ >> + unsigned int events_cap_mask; /* Supported notifications */ >> + const struct fs_trace_operations *ops; >> +}; >> + >> +#ifdef CONFIG_FS_EVENTS >> + >> +void fs_event_notify(struct super_block *sb, unsigned int event_id); >> +void fs_event_alloc_space(struct super_block *sb, u64 ncount); >> +void fs_event_free_space(struct super_block *sb, u64 ncount); >> +void fs_event_mount_dropped(struct vfsmount *mnt); >> + >> +int fs_netlink_send_event(size_t size, unsigned int event_id, >> + int (*compose_msg)(struct sk_buff *skb, void *data), >> + void *cbdata); >> + >> +#else /* CONFIG_FS_EVENTS */ >> + >> +static inline >> +void fs_event_notify(struct super_block *sb, unsigned int event_id) {}; >> +static inline >> +void fs_event_alloc_space(struct super_block *sb, u64 ncount) {}; >> +static inline >> +void fs_event_free_space(struct super_block *sb, u64 ncount) {}; >> +static inline >> +void fs_event_mount_dropped(struct vfsmount *mnt) {}; >> + >> +static inline >> +int fs_netlink_send_event(size_t size, unsigned int event_id, >> + int (*compose_msig)(struct sk_buff *skb, void *data), >> + void *cbdata) >> +{ >> + return -ENOSYS; >> +} >> +#endif /* CONFIG_FS_EVENTS */ >> + >> +#endif /* _LINUX_GENERIC_FS_EVENTS_ */ >> + >> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild >> index 68ceb97..dae0fab 100644 >> --- a/include/uapi/linux/Kbuild >> +++ b/include/uapi/linux/Kbuild >> @@ -129,6 +129,7 @@ header-y += firewire-constants.h >> header-y += flat.h >> header-y += fou.h >> header-y += fs.h >> +header-y += fs_event.h >> header-y += fsl_hypervisor.h >> header-y += fuse.h >> header-y += futex.h >> diff --git a/include/uapi/linux/fs_event.h b/include/uapi/linux/fs_event.h >> new file mode 100644 >> index 0000000..d8b07da >> --- /dev/null >> +++ b/include/uapi/linux/fs_event.h >> @@ -0,0 +1,58 @@ >> +/* >> + * Generic netlink support for Generic File System Events Interface >> + * >> + * Copyright(c) 2015 Samsung Electronics. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2. >> + * >> + * The full GNU General Public License is included in this distribution in the >> + * file called COPYING. >> + * >> + * This program is distributed in the hope that it will be useful, but WITHOUT >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or >> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for >> + * more details. >> + */ >> +#ifndef _UAPI_LINUX_GENERIC_FS_EVENTS_ >> +#define _UAPI_LINUX_GENERIC_FS_EVENTS_ > > Define name usually corresponds to the header filename. > >> +#define FS_EVENTS_FAMILY_NAME "fs_event" >> +#define FS_EVENTS_MCAST_GRP_NAME "fs_event_mc_grp" >> + >> +/* >> + * Generic netlink attribute types >> + */ >> +enum { >> + FS_NL_A_NONE, >> + FS_NL_A_EVENT_ID, >> + FS_NL_A_DEV_MAJOR, >> + FS_NL_A_DEV_MINOR, >> + FS_NL_A_CAUSED_ID, >> + FS_NL_A_DATA, >> + __FS_NL_A_MAX, >> +}; >> +#define FS_NL_A_MAX (__FS_NL_A_MAX - 1) >> +/* >> + * Generic netlink commands >> + */ >> +#define FS_NL_C_EVENT 1 >> + >> +/* >> + * Supported set of FS events >> + */ >> +enum { >> + FS_EVENT_NONE, >> + FS_WARN_ENOSPC, /* No space left to reserve data blks */ >> + FS_WARN_ENOSPC_META, /* No space left for metadata */ >> + FS_THR_LRBELOW, /* The threshold lower range has been reached */ >> + FS_THR_LRABOVE, /* The threshold lower range re-activcated*/ >> + FS_THR_URBELOW, >> + FS_THR_URABOVE, >> + FS_ERR_REMOUNT_RO, /* The file system has been remounted as RO */ >> + FS_ERR_CORRUPTED /* Critical error - fs corrupted */ >> + >> +}; >> + >> +#endif /* _UAPI_LINUX_GENERIC_FS_EVENTS_ */ >> + > > Best regards, > -- > Bartlomiej Zolnierkiewicz > Samsung R&D Institute Poland > Samsung Electronics > > Thanks for Your comments. Best Regards Beata