* [PATCH 0/2] Refactor snapshot vs nocow writers locking
@ 2020-01-30 12:59 Nikolay Borisov
2020-01-30 12:59 ` [PATCH 1/2] btrfs: Implement DRW lock Nikolay Borisov
` (4 more replies)
0 siblings, 5 replies; 26+ messages in thread
From: Nikolay Borisov @ 2020-01-30 12:59 UTC (permalink / raw)
To: linux-btrfs; +Cc: Nikolay Borisov
This patchset first factors out the open code which essentially implements a
lock that allows to have either multiple reader or multiple writers but not
both. Then patch 2 just converts the code to using the newly introduced lock.
The individual patch descriptions contain more information about the technical
details and invariants that the lock provide.
I have also CC'ed a copule of the maintainer of linux memory model since my
patches just factor out the code and I would really like someone proficient
enough in the usage/semantics of memory barries to review it as well.
Nikolay Borisov (2):
btrfs: Implement DRW lock
btrfs: convert snapshot/nocow exlcusion to drw lock
fs/btrfs/Makefile | 2 +-
fs/btrfs/ctree.h | 10 ++----
fs/btrfs/disk-io.c | 39 ++---------------------
fs/btrfs/drw_lock.c | 71 ++++++++++++++++++++++++++++++++++++++++++
fs/btrfs/drw_lock.h | 23 ++++++++++++++
fs/btrfs/extent-tree.c | 35 ---------------------
fs/btrfs/file.c | 12 +++----
fs/btrfs/inode.c | 8 ++---
fs/btrfs/ioctl.c | 10 ++----
9 files changed, 114 insertions(+), 96 deletions(-)
create mode 100644 fs/btrfs/drw_lock.c
create mode 100644 fs/btrfs/drw_lock.h
--
2.17.1
^ permalink raw reply [flat|nested] 26+ messages in thread* [PATCH 1/2] btrfs: Implement DRW lock 2020-01-30 12:59 [PATCH 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov @ 2020-01-30 12:59 ` Nikolay Borisov 2020-01-30 13:41 ` Christoph Hellwig 2020-01-30 12:59 ` [PATCH 2/2] btrfs: convert snapshot/nocow exlcusion to drw lock Nikolay Borisov ` (3 subsequent siblings) 4 siblings, 1 reply; 26+ messages in thread From: Nikolay Borisov @ 2020-01-30 12:59 UTC (permalink / raw) To: linux-btrfs; +Cc: Nikolay Borisov A (D)ouble (R)eader (W)riter lock is a locking primitive that allows to have multiple readers or multiple writers but not multiple readers and writers holding it concurrently. The code is factored out from the existing open-coded locking scheme used to exclude pending snapshots from nocow writers and vice-versa. Current implementation actually favors Readers (that is snapshot creaters) to writers (nocow writers of the filesystem). Signed-off-by: Nikolay Borisov <nborisov@suse.com> --- fs/btrfs/Makefile | 2 +- fs/btrfs/drw_lock.c | 71 +++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/drw_lock.h | 23 +++++++++++++++ 3 files changed, 95 insertions(+), 1 deletion(-) create mode 100644 fs/btrfs/drw_lock.c create mode 100644 fs/btrfs/drw_lock.h diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index ca693dd554e9..dc60127791e6 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \ compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ - uuid-tree.o props.o free-space-tree.o tree-checker.o + uuid-tree.o props.o free-space-tree.o tree-checker.o drw_lock.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o diff --git a/fs/btrfs/drw_lock.c b/fs/btrfs/drw_lock.c new file mode 100644 index 000000000000..9681bf7544be --- /dev/null +++ b/fs/btrfs/drw_lock.c @@ -0,0 +1,71 @@ +#include "drw_lock.h" +#include "ctree.h" + +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock) +{ + atomic_set(&lock->readers, 0); + percpu_counter_init(&lock->writers, 0, GFP_KERNEL); + init_waitqueue_head(&lock->pending_readers); + init_waitqueue_head(&lock->pending_writers); +} + +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock) +{ + percpu_counter_destroy(&lock->writers); +} + +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock) +{ + if (atomic_read(&lock->readers)) + return false; + + percpu_counter_inc(&lock->writers); + + /* + * Ensure writers count is updated before we check for + * pending readers + */ + smp_mb(); + if (atomic_read(&lock->readers)) { + btrfs_drw_read_unlock(lock); + return false; + } + + return true; +} + +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock) +{ + while(true) { + if (btrfs_drw_try_write_lock(lock)) + return; + wait_event(lock->pending_writers, !atomic_read(&lock->readers)); + } +} + +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock) +{ + percpu_counter_dec(&lock->writers); + cond_wake_up(&lock->pending_readers); +} + +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock) +{ + atomic_inc(&lock->readers); + smp_mb__after_atomic(); + + wait_event(lock->pending_readers, + percpu_counter_sum(&lock->writers) == 0); +} + +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock) +{ + /* + * Atomic RMW operations imply full barrier, so woken up writers + * are guaranteed to see the decrement + */ + if (atomic_dec_and_test(&lock->readers)) + wake_up(&lock->pending_writers); +} + + diff --git a/fs/btrfs/drw_lock.h b/fs/btrfs/drw_lock.h new file mode 100644 index 000000000000..baff59561c06 --- /dev/null +++ b/fs/btrfs/drw_lock.h @@ -0,0 +1,23 @@ +#ifndef BTRFS_DRW_LOCK_H +#define BTRFS_DRW_LOCK_H + +#include <linux/atomic.h> +#include <linux/wait.h> +#include <linux/percpu_counter.h> + +struct btrfs_drw_lock { + atomic_t readers; + struct percpu_counter writers; + wait_queue_head_t pending_writers; + wait_queue_head_t pending_readers; +}; + +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock); +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock); +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock); +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock); +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock); +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock); +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock); + +#endif -- 2.17.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] btrfs: Implement DRW lock 2020-01-30 12:59 ` [PATCH 1/2] btrfs: Implement DRW lock Nikolay Borisov @ 2020-01-30 13:41 ` Christoph Hellwig 2020-01-30 13:42 ` Nikolay Borisov 0 siblings, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2020-01-30 13:41 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs On Thu, Jan 30, 2020 at 02:59:41PM +0200, Nikolay Borisov wrote: > A (D)ouble (R)eader (W)riter lock is a locking primitive that allows > to have multiple readers or multiple writers but not multiple readers > and writers holding it concurrently. The code is factored out from > the existing open-coded locking scheme used to exclude pending > snapshots from nocow writers and vice-versa. Current implementation > actually favors Readers (that is snapshot creaters) to writers (nocow > writers of the filesystem). Any reason not to move it to lib/ under a new option so that other users could reuse it as needed? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] btrfs: Implement DRW lock 2020-01-30 13:41 ` Christoph Hellwig @ 2020-01-30 13:42 ` Nikolay Borisov 2020-01-30 13:43 ` Christoph Hellwig 0 siblings, 1 reply; 26+ messages in thread From: Nikolay Borisov @ 2020-01-30 13:42 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-btrfs On 30.01.20 г. 15:41 ч., Christoph Hellwig wrote: > On Thu, Jan 30, 2020 at 02:59:41PM +0200, Nikolay Borisov wrote: >> A (D)ouble (R)eader (W)riter lock is a locking primitive that allows >> to have multiple readers or multiple writers but not multiple readers >> and writers holding it concurrently. The code is factored out from >> the existing open-coded locking scheme used to exclude pending >> snapshots from nocow writers and vice-versa. Current implementation >> actually favors Readers (that is snapshot creaters) to writers (nocow >> writers of the filesystem). > > Any reason not to move it to lib/ under a new option so that other > users could reuse it as needed? > Yes, I think it's rather specific to btrfs. I don't care honestly if there is demand I wouldn't mind. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] btrfs: Implement DRW lock 2020-01-30 13:42 ` Nikolay Borisov @ 2020-01-30 13:43 ` Christoph Hellwig 0 siblings, 0 replies; 26+ messages in thread From: Christoph Hellwig @ 2020-01-30 13:43 UTC (permalink / raw) To: Nikolay Borisov; +Cc: Christoph Hellwig, linux-btrfs On Thu, Jan 30, 2020 at 03:42:27PM +0200, Nikolay Borisov wrote: > > > On 30.01.20 г. 15:41 ч., Christoph Hellwig wrote: > > On Thu, Jan 30, 2020 at 02:59:41PM +0200, Nikolay Borisov wrote: > >> A (D)ouble (R)eader (W)riter lock is a locking primitive that allows > >> to have multiple readers or multiple writers but not multiple readers > >> and writers holding it concurrently. The code is factored out from > >> the existing open-coded locking scheme used to exclude pending > >> snapshots from nocow writers and vice-versa. Current implementation > >> actually favors Readers (that is snapshot creaters) to writers (nocow > >> writers of the filesystem). > > > > Any reason not to move it to lib/ under a new option so that other > > users could reuse it as needed? > > > > Yes, I think it's rather specific to btrfs. I don't care honestly if > there is demand I wouldn't mind. Ok, just keep it in btrfs for now, we can always move it later. It just looks like very generic code. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 2/2] btrfs: convert snapshot/nocow exlcusion to drw lock 2020-01-30 12:59 [PATCH 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov 2020-01-30 12:59 ` [PATCH 1/2] btrfs: Implement DRW lock Nikolay Borisov @ 2020-01-30 12:59 ` Nikolay Borisov 2020-01-30 12:59 ` [RESEND PATCH v2 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov ` (2 subsequent siblings) 4 siblings, 0 replies; 26+ messages in thread From: Nikolay Borisov @ 2020-01-30 12:59 UTC (permalink / raw) To: linux-btrfs; +Cc: Nikolay Borisov This patch removes all haphazard code implementing nocow writers exclusion from pending snapshot creation and switches to using the drw lock to ensure this invariant still holds. "Readers" are snapshot creators from create_snapshot and 'writers' are nocow writers from buffered write path or btrfs_setsize. This locking scheme allows for multiple snapshots to happen while any nocow writers are blocked, since writes to page cache in the nocow path will make snapshots inconsistent. So for performance reasons we'd like to have the ability to run multiple concurrent snapshots and also favors readers in this case. And in case there aren't pending snapshots (which will be the majority of the cases) we rely on the percpu's writers counter to avoid cacheline contention. The main gain from using the drw is it's now a lot easier to reason about the guarantees of the locking scheme and whether there is some silent breakage lurking. Signed-off-by: Nikolay Borisov <nborisov@suse.com> --- fs/btrfs/ctree.h | 10 +++------- fs/btrfs/disk-io.c | 39 +++------------------------------------ fs/btrfs/extent-tree.c | 35 ----------------------------------- fs/btrfs/file.c | 12 ++++++------ fs/btrfs/inode.c | 8 ++++---- fs/btrfs/ioctl.c | 10 +++------- 6 files changed, 19 insertions(+), 95 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index a66ed58058d9..fa8a2e15c77c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -32,6 +32,7 @@ #include "extent_io.h" #include "extent_map.h" #include "async-thread.h" +#include "drw_lock.h" struct btrfs_trans_handle; struct btrfs_transaction; @@ -1174,11 +1175,6 @@ static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb) return sb->s_fs_info; } -struct btrfs_subvolume_writers { - struct percpu_counter counter; - wait_queue_head_t wait; -}; - /* * The state of btrfs root */ @@ -1350,8 +1346,8 @@ struct btrfs_root { * root_item_lock. */ int dedupe_in_progress; - struct btrfs_subvolume_writers *subv_writers; - atomic_t will_be_snapshotted; + struct btrfs_drw_lock snapshot_lock; + atomic_t snapshot_force_cow; /* For qgroup metadata reserved space */ diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 05f215b4d060..ece45e606846 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1125,32 +1125,6 @@ void btrfs_clean_tree_block(struct extent_buffer *buf) } } -static struct btrfs_subvolume_writers *btrfs_alloc_subvolume_writers(void) -{ - struct btrfs_subvolume_writers *writers; - int ret; - - writers = kmalloc(sizeof(*writers), GFP_NOFS); - if (!writers) - return ERR_PTR(-ENOMEM); - - ret = percpu_counter_init(&writers->counter, 0, GFP_NOFS); - if (ret < 0) { - kfree(writers); - return ERR_PTR(ret); - } - - init_waitqueue_head(&writers->wait); - return writers; -} - -static void -btrfs_free_subvolume_writers(struct btrfs_subvolume_writers *writers) -{ - percpu_counter_destroy(&writers->counter); - kfree(writers); -} - static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, u64 objectid) { @@ -1198,7 +1172,6 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, atomic_set(&root->log_writers, 0); atomic_set(&root->log_batch, 0); refcount_set(&root->refs, 1); - atomic_set(&root->will_be_snapshotted, 0); atomic_set(&root->snapshot_force_cow, 0); atomic_set(&root->nr_swapfiles, 0); root->log_transid = 0; @@ -1487,7 +1460,6 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_root *tree_root, int btrfs_init_fs_root(struct btrfs_root *root) { int ret; - struct btrfs_subvolume_writers *writers; root->free_ino_ctl = kzalloc(sizeof(*root->free_ino_ctl), GFP_NOFS); root->free_ino_pinned = kzalloc(sizeof(*root->free_ino_pinned), @@ -1497,12 +1469,8 @@ int btrfs_init_fs_root(struct btrfs_root *root) goto fail; } - writers = btrfs_alloc_subvolume_writers(); - if (IS_ERR(writers)) { - ret = PTR_ERR(writers); - goto fail; - } - root->subv_writers = writers; + + btrfs_drw_lock_init(&root->snapshot_lock); btrfs_init_free_ino_ctl(root); spin_lock_init(&root->ino_cache_lock); @@ -3873,8 +3841,7 @@ void btrfs_free_fs_root(struct btrfs_root *root) WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree)); if (root->anon_dev) free_anon_bdev(root->anon_dev); - if (root->subv_writers) - btrfs_free_subvolume_writers(root->subv_writers); + btrfs_drw_lock_destroy(&root->snapshot_lock); free_extent_buffer(root->node); free_extent_buffer(root->commit_root); kfree(root->free_ino_ctl); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 5a11e4988243..3564bae0434d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -11322,41 +11322,6 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range) * operations while snapshotting is ongoing and that cause the snapshot to be * inconsistent (writes followed by expanding truncates for example). */ -void btrfs_end_write_no_snapshotting(struct btrfs_root *root) -{ - percpu_counter_dec(&root->subv_writers->counter); - cond_wake_up(&root->subv_writers->wait); -} - -int btrfs_start_write_no_snapshotting(struct btrfs_root *root) -{ - if (atomic_read(&root->will_be_snapshotted)) - return 0; - - percpu_counter_inc(&root->subv_writers->counter); - /* - * Make sure counter is updated before we check for snapshot creation. - */ - smp_mb(); - if (atomic_read(&root->will_be_snapshotted)) { - btrfs_end_write_no_snapshotting(root); - return 0; - } - return 1; -} - -void btrfs_wait_for_snapshot_creation(struct btrfs_root *root) -{ - while (true) { - int ret; - - ret = btrfs_start_write_no_snapshotting(root); - if (ret) - break; - wait_var_event(&root->will_be_snapshotted, - !atomic_read(&root->will_be_snapshotted)); - } -} void btrfs_mark_bg_unused(struct btrfs_block_group_cache *bg) { diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 5370152ea7e3..b9f01efff276 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -26,6 +26,7 @@ #include "volumes.h" #include "qgroup.h" #include "compression.h" +#include "drw_lock.h" static struct kmem_cache *btrfs_inode_defrag_cachep; /* @@ -1554,8 +1555,7 @@ static noinline int check_can_nocow(struct btrfs_inode *inode, loff_t pos, u64 num_bytes; int ret; - ret = btrfs_start_write_no_snapshotting(root); - if (!ret) + if (!btrfs_drw_try_write_lock(&root->snapshot_lock)) return -EAGAIN; lockstart = round_down(pos, fs_info->sectorsize); @@ -1570,7 +1570,7 @@ static noinline int check_can_nocow(struct btrfs_inode *inode, loff_t pos, NULL, NULL, NULL); if (ret <= 0) { ret = 0; - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); } else { *write_bytes = min_t(size_t, *write_bytes , num_bytes - pos + lockstart); @@ -1675,7 +1675,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, data_reserved, pos, write_bytes); else - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); break; } @@ -1769,7 +1769,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, release_bytes = 0; if (only_release_metadata) - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); if (only_release_metadata && copied > 0) { lockstart = round_down(pos, @@ -1799,7 +1799,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, if (release_bytes) { if (only_release_metadata) { - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); btrfs_delalloc_release_metadata(BTRFS_I(inode), release_bytes, true); } else { diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index f5e19ba27bdc..00118805ef00 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5102,16 +5102,16 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) * truncation, it must capture all writes that happened before * this truncation. */ - btrfs_wait_for_snapshot_creation(root); + btrfs_drw_write_lock(&root->snapshot_lock); ret = btrfs_cont_expand(inode, oldsize, newsize); if (ret) { - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); return ret; } trans = btrfs_start_transaction(root, 1); if (IS_ERR(trans)) { - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); return PTR_ERR(trans); } @@ -5119,7 +5119,7 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) btrfs_ordered_update_i_size(inode, i_size_read(inode), NULL); pagecache_isize_extended(inode, oldsize, newsize); ret = btrfs_update_inode(trans, root, inode); - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); btrfs_end_transaction(trans); } else { diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 803577d42518..e35f1b14d772 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -791,11 +791,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, * possible. This is to avoid later writeback (running dealloc) to * fallback to COW mode and unexpectedly fail with ENOSPC. */ - atomic_inc(&root->will_be_snapshotted); - smp_mb__after_atomic(); - /* wait for no snapshot writes */ - wait_event(root->subv_writers->wait, - percpu_counter_sum(&root->subv_writers->counter) == 0); + btrfs_drw_read_lock(&root->snapshot_lock); ret = btrfs_start_delalloc_snapshot(root); if (ret) @@ -875,8 +871,8 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, dec_and_free: if (snapshot_force_cow) atomic_dec(&root->snapshot_force_cow); - if (atomic_dec_and_test(&root->will_be_snapshotted)) - wake_up_var(&root->will_be_snapshotted); + btrfs_drw_read_unlock(&root->snapshot_lock); + free_pending: kfree(pending_snapshot->root_item); btrfs_free_path(pending_snapshot->path); -- 2.17.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [RESEND PATCH v2 0/2] Refactor snapshot vs nocow writers locking 2020-01-30 12:59 [PATCH 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov 2020-01-30 12:59 ` [PATCH 1/2] btrfs: Implement DRW lock Nikolay Borisov 2020-01-30 12:59 ` [PATCH 2/2] btrfs: convert snapshot/nocow exlcusion to drw lock Nikolay Borisov @ 2020-01-30 12:59 ` Nikolay Borisov 2020-01-30 14:21 ` David Sterba 2020-01-30 12:59 ` [PATCH v2 1/2] btrfs: Implement DRW lock Nikolay Borisov 2020-01-30 12:59 ` [PATCH v2 2/2] btrfs: convert snapshot/nocow exlcusion to drw lock Nikolay Borisov 4 siblings, 1 reply; 26+ messages in thread From: Nikolay Borisov @ 2020-01-30 12:59 UTC (permalink / raw) To: linux-btrfs; +Cc: Nikolay Borisov Hello, Here is the second version of the DRW lock for btrfs. Main changes from v1: * Fixed all checkpatch warnings (Andrea Parri) * Properly call write_unlock in btrfs_drw_try_write_lock (Filipe Manana) * Comment fix. * Stress tested it via locktorture. Survived for 8 straight days on a 4 socket 48 thread machine. I'm resending this one since it's been quite a while. Additionally Valentin Schneider has created a more accurate spec of the algorithm than the one I had initially submitted. It can be found here https://lore.kernel.org/lkml/2dcaf81c-f0d3-409e-cb29-733d8b3b4cc9@arm.com/ I've been running with those patches on my custoom tree for a couple of months and haven't observed any issues. Nikolay Borisov (2): btrfs: Implement DRW lock btrfs: convert snapshot/nocow exlcusion to drw lock fs/btrfs/ctree.h | 10 ++--- fs/btrfs/disk-io.c | 46 ++++++---------------- fs/btrfs/extent-tree.c | 44 --------------------- fs/btrfs/file.c | 11 +++--- fs/btrfs/inode.c | 8 ++-- fs/btrfs/ioctl.c | 10 ++--- fs/btrfs/locking.c | 87 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/locking.h | 21 ++++++++++ 8 files changed, 134 insertions(+), 103 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RESEND PATCH v2 0/2] Refactor snapshot vs nocow writers locking 2020-01-30 12:59 ` [RESEND PATCH v2 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov @ 2020-01-30 14:21 ` David Sterba 2020-02-21 14:51 ` David Sterba 0 siblings, 1 reply; 26+ messages in thread From: David Sterba @ 2020-01-30 14:21 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs On Thu, Jan 30, 2020 at 02:59:43PM +0200, Nikolay Borisov wrote: > Hello, > > Here is the second version of the DRW lock for btrfs. Main changes from v1: As the code is wrapping the existing code I think we're good to add it to misc-next soon so we have more coverage. Please add comments with the lock semantics overview and to all the public functions. Having more assertions would be also good, like the tree locks do. Thanks. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RESEND PATCH v2 0/2] Refactor snapshot vs nocow writers locking 2020-01-30 14:21 ` David Sterba @ 2020-02-21 14:51 ` David Sterba 0 siblings, 0 replies; 26+ messages in thread From: David Sterba @ 2020-02-21 14:51 UTC (permalink / raw) To: dsterba, Nikolay Borisov, linux-btrfs On Thu, Jan 30, 2020 at 03:21:44PM +0100, David Sterba wrote: > Please add comments with the lock semantics overview and to all the > public functions. Having more assertions would be also good, like the > tree locks do. Thanks. The patches started to conflict with others so I've removed it from for-next until the updated version with comments arrives. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2 1/2] btrfs: Implement DRW lock 2020-01-30 12:59 [PATCH 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov ` (2 preceding siblings ...) 2020-01-30 12:59 ` [RESEND PATCH v2 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov @ 2020-01-30 12:59 ` Nikolay Borisov 2020-01-30 13:37 ` Nikolay Borisov 2020-01-30 12:59 ` [PATCH v2 2/2] btrfs: convert snapshot/nocow exlcusion to drw lock Nikolay Borisov 4 siblings, 1 reply; 26+ messages in thread From: Nikolay Borisov @ 2020-01-30 12:59 UTC (permalink / raw) To: linux-btrfs; +Cc: Nikolay Borisov A (D)ouble (R)eader (W)riter lock is a locking primitive that allows to have multiple readers or multiple writers but not multiple readers and writers holding it concurrently. The code is factored out from the existing open-coded locking scheme used to exclude pending snapshots from nocow writers and vice-versa. Current implementation actually favors Readers (that is snapshot creaters) to writers (nocow writers of the filesystem). Signed-off-by: Nikolay Borisov <nborisov@suse.com> --- fs/btrfs/ctree.h | 1 + fs/btrfs/locking.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/locking.h | 21 +++++++++++ 3 files changed, 109 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index f90b82050d2d..908430f563fa 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -33,6 +33,7 @@ #include "extent_map.h" #include "async-thread.h" #include "block-rsv.h" +#include "locking.h" struct btrfs_trans_handle; struct btrfs_transaction; diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c index 571c4826c428..66d7d1279535 100644 --- a/fs/btrfs/locking.c +++ b/fs/btrfs/locking.c @@ -523,3 +523,90 @@ void btrfs_unlock_up_safe(struct btrfs_path *path, int level) path->locks[i] = 0; } } + +int btrfs_drw_lock_init(struct btrfs_drw_lock *lock) +{ + int ret; + + ret = percpu_counter_init(&lock->writers, 0, GFP_KERNEL); + if (ret) + return ret; + + atomic_set(&lock->readers, 0); + init_waitqueue_head(&lock->pending_readers); + init_waitqueue_head(&lock->pending_writers); + + return 0; +} +EXPORT_SYMBOL(btrfs_drw_lock_init); + +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock) +{ + percpu_counter_destroy(&lock->writers); +} + +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock) +{ + if (atomic_read(&lock->readers)) + return false; + + percpu_counter_inc(&lock->writers); + + /* + * Ensure writers count is updated before we check for + * pending readers + */ + smp_mb(); + if (atomic_read(&lock->readers)) { + btrfs_drw_write_unlock(lock); + return false; + } + + return true; +} +EXPORT_SYMBOL(btrfs_drw_try_write_lock); + +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock) +{ + while (true) { + if (btrfs_drw_try_write_lock(lock)) + return; + wait_event(lock->pending_writers, !atomic_read(&lock->readers)); + } +} +EXPORT_SYMBOL(btrfs_drw_write_lock); + +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock) +{ + percpu_counter_dec(&lock->writers); + cond_wake_up(&lock->pending_readers); +} +EXPORT_SYMBOL(btrfs_drw_write_unlock); + +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock) +{ + atomic_inc(&lock->readers); + + /* + * Ensure the pending reader count is perceieved BEFORE this reader + * goes to sleep in case of active writers. This guarantees new writers + * won't be allowed and that the current reader will be woken up when + * the last active writer finishes its jobs. + */ + smp_mb__after_atomic(); + + wait_event(lock->pending_readers, + percpu_counter_sum(&lock->writers) == 0); +} +EXPORT_SYMBOL(btrfs_drw_read_lock); + +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock) +{ + /* + * atomic_dec_and_test implies a full barrier, so woken up writers + * are guaranteed to see the decrement + */ + if (atomic_dec_and_test(&lock->readers)) + wake_up(&lock->pending_writers); +} +EXPORT_SYMBOL(btrfs_drw_read_unlock); diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h index 21a285883e89..ba60318c53d5 100644 --- a/fs/btrfs/locking.h +++ b/fs/btrfs/locking.h @@ -7,12 +7,17 @@ #define BTRFS_LOCKING_H #include "extent_io.h" +#include <linux/atomic.h> +#include <linux/wait.h> +#include <linux/percpu_counter.h> #define BTRFS_WRITE_LOCK 1 #define BTRFS_READ_LOCK 2 #define BTRFS_WRITE_LOCK_BLOCKING 3 #define BTRFS_READ_LOCK_BLOCKING 4 +struct btrfs_path; + void btrfs_tree_lock(struct extent_buffer *eb); void btrfs_tree_unlock(struct extent_buffer *eb); @@ -48,4 +53,20 @@ static inline void btrfs_tree_unlock_rw(struct extent_buffer *eb, int rw) BUG(); } + +struct btrfs_drw_lock { + atomic_t readers; + struct percpu_counter writers; + wait_queue_head_t pending_writers; + wait_queue_head_t pending_readers; +}; + +int btrfs_drw_lock_init(struct btrfs_drw_lock *lock); +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock); +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock); +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock); +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock); +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock); +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock); + #endif -- 2.17.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v2 1/2] btrfs: Implement DRW lock 2020-01-30 12:59 ` [PATCH v2 1/2] btrfs: Implement DRW lock Nikolay Borisov @ 2020-01-30 13:37 ` Nikolay Borisov 2020-01-30 14:06 ` David Sterba 0 siblings, 1 reply; 26+ messages in thread From: Nikolay Borisov @ 2020-01-30 13:37 UTC (permalink / raw) To: linux-btrfs On 30.01.20 г. 14:59 ч., Nikolay Borisov wrote: > A (D)ouble (R)eader (W)riter lock is a locking primitive that allows > to have multiple readers or multiple writers but not multiple readers > and writers holding it concurrently. The code is factored out from > the existing open-coded locking scheme used to exclude pending > snapshots from nocow writers and vice-versa. Current implementation > actually favors Readers (that is snapshot creaters) to writers (nocow > writers of the filesystem). > > Signed-off-by: Nikolay Borisov <nborisov@suse.com> > --- > fs/btrfs/ctree.h | 1 + > fs/btrfs/locking.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++ > fs/btrfs/locking.h | 21 +++++++++++ > 3 files changed, 109 insertions(+) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index f90b82050d2d..908430f563fa 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -33,6 +33,7 @@ > #include "extent_map.h" > #include "async-thread.h" > #include "block-rsv.h" > +#include "locking.h" > > struct btrfs_trans_handle; > struct btrfs_transaction; > diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c > index 571c4826c428..66d7d1279535 100644 > --- a/fs/btrfs/locking.c > +++ b/fs/btrfs/locking.c > @@ -523,3 +523,90 @@ void btrfs_unlock_up_safe(struct btrfs_path *path, int level) > path->locks[i] = 0; > } > } > + > +int btrfs_drw_lock_init(struct btrfs_drw_lock *lock) > +{ > + int ret; > + > + ret = percpu_counter_init(&lock->writers, 0, GFP_KERNEL); > + if (ret) > + return ret; > + > + atomic_set(&lock->readers, 0); > + init_waitqueue_head(&lock->pending_readers); > + init_waitqueue_head(&lock->pending_writers); > + > + return 0; > +} > +EXPORT_SYMBOL(btrfs_drw_lock_init); I have the functions EXPORT_SYMBOL since I have an internal patch which is hooking this code to locktorture. SO they can be removed. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 1/2] btrfs: Implement DRW lock 2020-01-30 13:37 ` Nikolay Borisov @ 2020-01-30 14:06 ` David Sterba 2020-01-30 14:11 ` Nikolay Borisov 0 siblings, 1 reply; 26+ messages in thread From: David Sterba @ 2020-01-30 14:06 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs On Thu, Jan 30, 2020 at 03:37:26PM +0200, Nikolay Borisov wrote: > > +int btrfs_drw_lock_init(struct btrfs_drw_lock *lock) > > +{ > > + int ret; > > + > > + ret = percpu_counter_init(&lock->writers, 0, GFP_KERNEL); > > + if (ret) > > + return ret; > > + > > + atomic_set(&lock->readers, 0); > > + init_waitqueue_head(&lock->pending_readers); > > + init_waitqueue_head(&lock->pending_writers); > > + > > + return 0; > > +} > > +EXPORT_SYMBOL(btrfs_drw_lock_init); > > I have the functions EXPORT_SYMBOL since I have an internal patch which > is hooking this code to locktorture. SO they can be removed. You can make the exports conditional, #ifdef LOCKTORTURE. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 1/2] btrfs: Implement DRW lock 2020-01-30 14:06 ` David Sterba @ 2020-01-30 14:11 ` Nikolay Borisov 0 siblings, 0 replies; 26+ messages in thread From: Nikolay Borisov @ 2020-01-30 14:11 UTC (permalink / raw) To: dsterba, linux-btrfs On 30.01.20 г. 16:06 ч., David Sterba wrote: > On Thu, Jan 30, 2020 at 03:37:26PM +0200, Nikolay Borisov wrote: >>> +int btrfs_drw_lock_init(struct btrfs_drw_lock *lock) >>> +{ >>> + int ret; >>> + >>> + ret = percpu_counter_init(&lock->writers, 0, GFP_KERNEL); >>> + if (ret) >>> + return ret; >>> + >>> + atomic_set(&lock->readers, 0); >>> + init_waitqueue_head(&lock->pending_readers); >>> + init_waitqueue_head(&lock->pending_writers); >>> + >>> + return 0; >>> +} >>> +EXPORT_SYMBOL(btrfs_drw_lock_init); >> >> I have the functions EXPORT_SYMBOL since I have an internal patch which >> is hooking this code to locktorture. SO they can be removed. > > You can make the exports conditional, #ifdef LOCKTORTURE. > I don't think I will be submitting that patch unless the lock is moved to lib/. I guess I will just remove it in v3 after review. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2 2/2] btrfs: convert snapshot/nocow exlcusion to drw lock 2020-01-30 12:59 [PATCH 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov ` (3 preceding siblings ...) 2020-01-30 12:59 ` [PATCH v2 1/2] btrfs: Implement DRW lock Nikolay Borisov @ 2020-01-30 12:59 ` Nikolay Borisov 4 siblings, 0 replies; 26+ messages in thread From: Nikolay Borisov @ 2020-01-30 12:59 UTC (permalink / raw) To: linux-btrfs; +Cc: Nikolay Borisov This patch removes all haphazard code implementing nocow writers exclusion from pending snapshot creation and switches to using the drw lock to ensure this invariant still holds. "Readers" are snapshot creators from create_snapshot and 'writers' are nocow writers from buffered write path or btrfs_setsize. This locking scheme allows for multiple snapshots to happen while any nocow writers are blocked, since writes to page cache in the nocow path will make snapshots inconsistent. So for performance reasons we'd like to have the ability to run multiple concurrent snapshots and also favors readers in this case. And in case there aren't pending snapshots (which will be the majority of the cases) we rely on the percpu's writers counter to avoid cacheline contention. The main gain from using the drw is it's now a lot easier to reason about the guarantees of the locking scheme and whether there is some silent breakage lurking. Signed-off-by: Nikolay Borisov <nborisov@suse.com> --- fs/btrfs/ctree.h | 9 ++------- fs/btrfs/disk-io.c | 46 ++++++++++-------------------------------- fs/btrfs/extent-tree.c | 44 ---------------------------------------- fs/btrfs/file.c | 11 +++++----- fs/btrfs/inode.c | 8 ++++---- fs/btrfs/ioctl.c | 10 +++------ 6 files changed, 25 insertions(+), 103 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 908430f563fa..4c2ac346bb73 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -958,11 +958,6 @@ static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb) return sb->s_fs_info; } -struct btrfs_subvolume_writers { - struct percpu_counter counter; - wait_queue_head_t wait; -}; - /* * The state of btrfs root */ @@ -1134,8 +1129,8 @@ struct btrfs_root { * root_item_lock. */ int dedupe_in_progress; - struct btrfs_subvolume_writers *subv_writers; - atomic_t will_be_snapshotted; + struct btrfs_drw_lock snapshot_lock; + atomic_t snapshot_force_cow; /* For qgroup metadata reserved space */ diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index aea48d6ddc0c..699ea2461eb5 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1100,32 +1100,6 @@ void btrfs_clean_tree_block(struct extent_buffer *buf) } } -static struct btrfs_subvolume_writers *btrfs_alloc_subvolume_writers(void) -{ - struct btrfs_subvolume_writers *writers; - int ret; - - writers = kmalloc(sizeof(*writers), GFP_NOFS); - if (!writers) - return ERR_PTR(-ENOMEM); - - ret = percpu_counter_init(&writers->counter, 0, GFP_NOFS); - if (ret < 0) { - kfree(writers); - return ERR_PTR(ret); - } - - init_waitqueue_head(&writers->wait); - return writers; -} - -static void -btrfs_free_subvolume_writers(struct btrfs_subvolume_writers *writers) -{ - percpu_counter_destroy(&writers->counter); - kfree(writers); -} - static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, u64 objectid) { @@ -1173,7 +1147,6 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info, atomic_set(&root->log_writers, 0); atomic_set(&root->log_batch, 0); refcount_set(&root->refs, 1); - atomic_set(&root->will_be_snapshotted, 0); atomic_set(&root->snapshot_force_cow, 0); atomic_set(&root->nr_swapfiles, 0); root->log_transid = 0; @@ -1462,7 +1435,7 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_root *tree_root, int btrfs_init_fs_root(struct btrfs_root *root) { int ret; - struct btrfs_subvolume_writers *writers; + unsigned int nofs_flag; root->free_ino_ctl = kzalloc(sizeof(*root->free_ino_ctl), GFP_NOFS); root->free_ino_pinned = kzalloc(sizeof(*root->free_ino_pinned), @@ -1472,12 +1445,16 @@ int btrfs_init_fs_root(struct btrfs_root *root) goto fail; } - writers = btrfs_alloc_subvolume_writers(); - if (IS_ERR(writers)) { - ret = PTR_ERR(writers); + /* + * We might be called under a transaction (e.g. indirect + * backref resolution) which could deadlock if it triggers memory + * reclaim + */ + nofs_flag = memalloc_nofs_save(); + ret = btrfs_drw_lock_init(&root->snapshot_lock); + memalloc_nofs_restore(nofs_flag); + if (ret) goto fail; - } - root->subv_writers = writers; btrfs_init_free_ino_ctl(root); spin_lock_init(&root->ino_cache_lock); @@ -3862,8 +3839,7 @@ void btrfs_free_fs_root(struct btrfs_root *root) WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree)); if (root->anon_dev) free_anon_bdev(root->anon_dev); - if (root->subv_writers) - btrfs_free_subvolume_writers(root->subv_writers); + btrfs_drw_lock_destroy(&root->snapshot_lock); free_extent_buffer(root->node); free_extent_buffer(root->commit_root); kfree(root->free_ino_ctl); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0163fdd59f8f..2b83e8c7da11 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5747,47 +5747,3 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range) return bg_ret; return dev_ret; } - -/* - * btrfs_{start,end}_write_no_snapshotting() are similar to - * mnt_{want,drop}_write(), they are used to prevent some tasks from writing - * data into the page cache through nocow before the subvolume is snapshoted, - * but flush the data into disk after the snapshot creation, or to prevent - * operations while snapshotting is ongoing and that cause the snapshot to be - * inconsistent (writes followed by expanding truncates for example). - */ -void btrfs_end_write_no_snapshotting(struct btrfs_root *root) -{ - percpu_counter_dec(&root->subv_writers->counter); - cond_wake_up(&root->subv_writers->wait); -} - -int btrfs_start_write_no_snapshotting(struct btrfs_root *root) -{ - if (atomic_read(&root->will_be_snapshotted)) - return 0; - - percpu_counter_inc(&root->subv_writers->counter); - /* - * Make sure counter is updated before we check for snapshot creation. - */ - smp_mb(); - if (atomic_read(&root->will_be_snapshotted)) { - btrfs_end_write_no_snapshotting(root); - return 0; - } - return 1; -} - -void btrfs_wait_for_snapshot_creation(struct btrfs_root *root) -{ - while (true) { - int ret; - - ret = btrfs_start_write_no_snapshotting(root); - if (ret) - break; - wait_var_event(&root->will_be_snapshotted, - !atomic_read(&root->will_be_snapshotted)); - } -} diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index a16da274c9aa..52f776383718 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1552,8 +1552,7 @@ static noinline int check_can_nocow(struct btrfs_inode *inode, loff_t pos, u64 num_bytes; int ret; - ret = btrfs_start_write_no_snapshotting(root); - if (!ret) + if (!btrfs_drw_try_write_lock(&root->snapshot_lock)) return -EAGAIN; lockstart = round_down(pos, fs_info->sectorsize); @@ -1568,7 +1567,7 @@ static noinline int check_can_nocow(struct btrfs_inode *inode, loff_t pos, NULL, NULL, NULL); if (ret <= 0) { ret = 0; - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); } else { *write_bytes = min_t(size_t, *write_bytes , num_bytes - pos + lockstart); @@ -1674,7 +1673,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, data_reserved, pos, write_bytes); else - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); break; } @@ -1778,7 +1777,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, release_bytes = 0; if (only_release_metadata) - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); if (only_release_metadata && copied > 0) { lockstart = round_down(pos, @@ -1807,7 +1806,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb, if (release_bytes) { if (only_release_metadata) { - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); btrfs_delalloc_release_metadata(BTRFS_I(inode), release_bytes, true); } else { diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 581728c6bcb8..882cf977f9cb 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4594,16 +4594,16 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) * truncation, it must capture all writes that happened before * this truncation. */ - btrfs_wait_for_snapshot_creation(root); + btrfs_drw_write_lock(&root->snapshot_lock); ret = btrfs_cont_expand(inode, oldsize, newsize); if (ret) { - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); return ret; } trans = btrfs_start_transaction(root, 1); if (IS_ERR(trans)) { - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); return PTR_ERR(trans); } @@ -4611,7 +4611,7 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) btrfs_ordered_update_i_size(inode, i_size_read(inode), NULL); pagecache_isize_extended(inode, oldsize, newsize); ret = btrfs_update_inode(trans, root, inode); - btrfs_end_write_no_snapshotting(root); + btrfs_drw_write_unlock(&root->snapshot_lock); btrfs_end_transaction(trans); } else { diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index b2396d72329d..c24f9be244c3 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -789,11 +789,7 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, * possible. This is to avoid later writeback (running dealloc) to * fallback to COW mode and unexpectedly fail with ENOSPC. */ - atomic_inc(&root->will_be_snapshotted); - smp_mb__after_atomic(); - /* wait for no snapshot writes */ - wait_event(root->subv_writers->wait, - percpu_counter_sum(&root->subv_writers->counter) == 0); + btrfs_drw_read_lock(&root->snapshot_lock); ret = btrfs_start_delalloc_snapshot(root); if (ret) @@ -873,8 +869,8 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, dec_and_free: if (snapshot_force_cow) atomic_dec(&root->snapshot_force_cow); - if (atomic_dec_and_test(&root->will_be_snapshotted)) - wake_up_var(&root->will_be_snapshotted); + btrfs_drw_read_unlock(&root->snapshot_lock); + free_pending: kfree(pending_snapshot->root_item); btrfs_free_path(pending_snapshot->path); -- 2.17.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 0/2] Refactor snapshot vs nocow writers locking @ 2020-02-24 15:26 Nikolay Borisov 2020-02-24 15:32 ` [PATCH 1/2] btrfs: Implement DRW lock Nikolay Borisov 0 siblings, 1 reply; 26+ messages in thread From: Nikolay Borisov @ 2020-02-24 15:26 UTC (permalink / raw) To: linux-btrfs; +Cc: Nikolay Borisov Here is v3 of the DRW locking patches. Main changes in this verison: * Removed EXPORT_SYMBOL for function since I do not intend to submit the locktorture patch * Added high-level comment as per David's request. V2: * Fixed all checkpatch warnings (Andrea Parri) * Properly call write_unlock in btrfs_drw_try_write_lock (Filipe Manana) * Comment fix. * Stress tested it via locktorture. Survived for 8 straight days on a 4 socket 48 thread machine. Nikolay Borisov (2): btrfs: convert snapshot/nocow exlcusion to drw lock btrfs: Hook btrfs' DRW lock to locktorture infrastructure fs/btrfs/ctree.h | 9 +---- fs/btrfs/disk-io.c | 46 ++++++--------------- fs/btrfs/extent-tree.c | 44 --------------------- fs/btrfs/file.c | 11 +++--- fs/btrfs/inode.c | 8 ++-- fs/btrfs/ioctl.c | 10 ++--- fs/btrfs/locking.c | 5 +++ fs/btrfs/locking.h | 1 + kernel/locking/locktorture.c | 77 +++++++++++++++++++++++++++++++++++- 9 files changed, 107 insertions(+), 104 deletions(-) -- 2.17.1 ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/2] btrfs: Implement DRW lock 2020-02-24 15:26 [PATCH v3 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov @ 2020-02-24 15:32 ` Nikolay Borisov 0 siblings, 0 replies; 26+ messages in thread From: Nikolay Borisov @ 2020-02-24 15:32 UTC (permalink / raw) To: linux-btrfs; +Cc: Nikolay Borisov A (D)ouble (R)eader (W)riter lock is a locking primitive that allows to have multiple readers or multiple writers but not multiple readers and writers holding it concurrently. The code is factored out from the existing open-coded locking scheme used to exclude pending snapshots from nocow writers and vice-versa. Current implementation actually favors Readers (that is snapshot creaters) to writers (nocow writers of the filesystem). Signed-off-by: Nikolay Borisov <nborisov@suse.com> --- fs/btrfs/ctree.h | 1 + fs/btrfs/locking.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/locking.h | 21 +++++++++++ 3 files changed, 112 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index ad275d06e95f..efc2cd147141 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -33,6 +33,7 @@ #include "extent_map.h" #include "async-thread.h" #include "block-rsv.h" +#include "locking.h" struct btrfs_trans_handle; struct btrfs_transaction; diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c index e713900f96b6..d890833694c9 100644 --- a/fs/btrfs/locking.c +++ b/fs/btrfs/locking.c @@ -565,3 +565,93 @@ struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root) } return eb; } + +/* + * DRW stands for double-reader-writer lock. It's used in situation where you + * want to provide A-B exclusion but not AA or BB. Currently implementation + * gives more priority to reader. If a reader and a writer both race to acquire + * their respective sides of the lock the writer would yield its lock as soon as + * it detects a concurrent reader. Additionally if there are pending readers no + * new writers would be allowed to come in and acquire the lock. + */ +int btrfs_drw_lock_init(struct btrfs_drw_lock *lock) +{ + int ret; + + ret = percpu_counter_init(&lock->writers, 0, GFP_KERNEL); + if (ret) + return ret; + + atomic_set(&lock->readers, 0); + init_waitqueue_head(&lock->pending_readers); + init_waitqueue_head(&lock->pending_writers); + + return 0; +} + +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock) +{ + percpu_counter_destroy(&lock->writers); +} + +/* Return true if acquisition is successful, false otherwise */ +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock) +{ + if (atomic_read(&lock->readers)) + return false; + + percpu_counter_inc(&lock->writers); + + /* + * Ensure writers count is updated before we check for + * pending readers + */ + smp_mb(); + if (atomic_read(&lock->readers)) { + btrfs_drw_write_unlock(lock); + return false; + } + + return true; +} + +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock) +{ + while (true) { + if (btrfs_drw_try_write_lock(lock)) + return; + wait_event(lock->pending_writers, !atomic_read(&lock->readers)); + } +} + +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock) +{ + percpu_counter_dec(&lock->writers); + cond_wake_up(&lock->pending_readers); +} + +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock) +{ + atomic_inc(&lock->readers); + + /* + * Ensure the pending reader count is perceieved BEFORE this reader + * goes to sleep in case of active writers. This guarantees new writers + * won't be allowed and that the current reader will be woken up when + * the last active writer finishes its jobs. + */ + smp_mb__after_atomic(); + + wait_event(lock->pending_readers, + percpu_counter_sum(&lock->writers) == 0); +} + +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock) +{ + /* + * atomic_dec_and_test implies a full barrier, so woken up writers + * are guaranteed to see the decrement + */ + if (atomic_dec_and_test(&lock->readers)) + wake_up(&lock->pending_writers); +} diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h index 21a285883e89..ba60318c53d5 100644 --- a/fs/btrfs/locking.h +++ b/fs/btrfs/locking.h @@ -7,12 +7,17 @@ #define BTRFS_LOCKING_H #include "extent_io.h" +#include <linux/atomic.h> +#include <linux/wait.h> +#include <linux/percpu_counter.h> #define BTRFS_WRITE_LOCK 1 #define BTRFS_READ_LOCK 2 #define BTRFS_WRITE_LOCK_BLOCKING 3 #define BTRFS_READ_LOCK_BLOCKING 4 +struct btrfs_path; + void btrfs_tree_lock(struct extent_buffer *eb); void btrfs_tree_unlock(struct extent_buffer *eb); @@ -48,4 +53,20 @@ static inline void btrfs_tree_unlock_rw(struct extent_buffer *eb, int rw) BUG(); } + +struct btrfs_drw_lock { + atomic_t readers; + struct percpu_counter writers; + wait_queue_head_t pending_writers; + wait_queue_head_t pending_readers; +}; + +int btrfs_drw_lock_init(struct btrfs_drw_lock *lock); +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock); +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock); +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock); +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock); +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock); +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock); + #endif -- 2.17.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 0/2] Refactor snapshot vs nocow writers locking @ 2019-06-06 13:52 Nikolay Borisov 2019-06-06 13:52 ` [PATCH 1/2] btrfs: Implement DRW lock Nikolay Borisov 0 siblings, 1 reply; 26+ messages in thread From: Nikolay Borisov @ 2019-06-06 13:52 UTC (permalink / raw) To: linux-btrfs; +Cc: linux-kernel, andrea.parri, peterz, paulmck, Nikolay Borisov This patchset first factors out the open code which essentially implements a lock that allows to have either multiple reader or multiple writers but not both. Then patch 2 just converts the code to using the newly introduced lock. The individual patch descriptions contain more information about the technical details and invariants that the lock provide. I have also CC'ed a copule of the maintainer of linux memory model since my patches just factor out the code and I would really like someone proficient enough in the usage/semantics of memory barries to review it as well. Nikolay Borisov (2): btrfs: Implement DRW lock btrfs: convert snapshot/nocow exlcusion to drw lock fs/btrfs/Makefile | 2 +- fs/btrfs/ctree.h | 10 ++---- fs/btrfs/disk-io.c | 39 ++--------------------- fs/btrfs/drw_lock.c | 71 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/drw_lock.h | 23 ++++++++++++++ fs/btrfs/extent-tree.c | 35 --------------------- fs/btrfs/file.c | 12 +++---- fs/btrfs/inode.c | 8 ++--- fs/btrfs/ioctl.c | 10 ++---- 9 files changed, 114 insertions(+), 96 deletions(-) create mode 100644 fs/btrfs/drw_lock.c create mode 100644 fs/btrfs/drw_lock.h -- 2.17.1 ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/2] btrfs: Implement DRW lock 2019-06-06 13:52 [PATCH 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov @ 2019-06-06 13:52 ` Nikolay Borisov 2019-06-06 15:15 ` Filipe Manana ` (3 more replies) 0 siblings, 4 replies; 26+ messages in thread From: Nikolay Borisov @ 2019-06-06 13:52 UTC (permalink / raw) To: linux-btrfs; +Cc: linux-kernel, andrea.parri, peterz, paulmck, Nikolay Borisov A (D)ouble (R)eader (W)riter lock is a locking primitive that allows to have multiple readers or multiple writers but not multiple readers and writers holding it concurrently. The code is factored out from the existing open-coded locking scheme used to exclude pending snapshots from nocow writers and vice-versa. Current implementation actually favors Readers (that is snapshot creaters) to writers (nocow writers of the filesystem). Signed-off-by: Nikolay Borisov <nborisov@suse.com> --- fs/btrfs/Makefile | 2 +- fs/btrfs/drw_lock.c | 71 +++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/drw_lock.h | 23 +++++++++++++++ 3 files changed, 95 insertions(+), 1 deletion(-) create mode 100644 fs/btrfs/drw_lock.c create mode 100644 fs/btrfs/drw_lock.h diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index ca693dd554e9..dc60127791e6 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \ compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ - uuid-tree.o props.o free-space-tree.o tree-checker.o + uuid-tree.o props.o free-space-tree.o tree-checker.o drw_lock.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o diff --git a/fs/btrfs/drw_lock.c b/fs/btrfs/drw_lock.c new file mode 100644 index 000000000000..9681bf7544be --- /dev/null +++ b/fs/btrfs/drw_lock.c @@ -0,0 +1,71 @@ +#include "drw_lock.h" +#include "ctree.h" + +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock) +{ + atomic_set(&lock->readers, 0); + percpu_counter_init(&lock->writers, 0, GFP_KERNEL); + init_waitqueue_head(&lock->pending_readers); + init_waitqueue_head(&lock->pending_writers); +} + +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock) +{ + percpu_counter_destroy(&lock->writers); +} + +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock) +{ + if (atomic_read(&lock->readers)) + return false; + + percpu_counter_inc(&lock->writers); + + /* + * Ensure writers count is updated before we check for + * pending readers + */ + smp_mb(); + if (atomic_read(&lock->readers)) { + btrfs_drw_read_unlock(lock); + return false; + } + + return true; +} + +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock) +{ + while(true) { + if (btrfs_drw_try_write_lock(lock)) + return; + wait_event(lock->pending_writers, !atomic_read(&lock->readers)); + } +} + +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock) +{ + percpu_counter_dec(&lock->writers); + cond_wake_up(&lock->pending_readers); +} + +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock) +{ + atomic_inc(&lock->readers); + smp_mb__after_atomic(); + + wait_event(lock->pending_readers, + percpu_counter_sum(&lock->writers) == 0); +} + +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock) +{ + /* + * Atomic RMW operations imply full barrier, so woken up writers + * are guaranteed to see the decrement + */ + if (atomic_dec_and_test(&lock->readers)) + wake_up(&lock->pending_writers); +} + + diff --git a/fs/btrfs/drw_lock.h b/fs/btrfs/drw_lock.h new file mode 100644 index 000000000000..baff59561c06 --- /dev/null +++ b/fs/btrfs/drw_lock.h @@ -0,0 +1,23 @@ +#ifndef BTRFS_DRW_LOCK_H +#define BTRFS_DRW_LOCK_H + +#include <linux/atomic.h> +#include <linux/wait.h> +#include <linux/percpu_counter.h> + +struct btrfs_drw_lock { + atomic_t readers; + struct percpu_counter writers; + wait_queue_head_t pending_writers; + wait_queue_head_t pending_readers; +}; + +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock); +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock); +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock); +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock); +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock); +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock); +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock); + +#endif -- 2.17.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] btrfs: Implement DRW lock 2019-06-06 13:52 ` [PATCH 1/2] btrfs: Implement DRW lock Nikolay Borisov @ 2019-06-06 15:15 ` Filipe Manana 2019-06-07 10:52 ` Paul E. McKenney ` (2 subsequent siblings) 3 siblings, 0 replies; 26+ messages in thread From: Filipe Manana @ 2019-06-06 15:15 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs, linux-kernel, andrea.parri, peterz, paulmck On Thu, Jun 6, 2019 at 2:55 PM Nikolay Borisov <nborisov@suse.com> wrote: > > A (D)ouble (R)eader (W)riter lock is a locking primitive that allows > to have multiple readers or multiple writers but not multiple readers > and writers holding it concurrently. The code is factored out from > the existing open-coded locking scheme used to exclude pending > snapshots from nocow writers and vice-versa. Current implementation > actually favors Readers (that is snapshot creaters) to writers (nocow > writers of the filesystem). > > Signed-off-by: Nikolay Borisov <nborisov@suse.com> > --- > fs/btrfs/Makefile | 2 +- > fs/btrfs/drw_lock.c | 71 +++++++++++++++++++++++++++++++++++++++++++++ > fs/btrfs/drw_lock.h | 23 +++++++++++++++ > 3 files changed, 95 insertions(+), 1 deletion(-) > create mode 100644 fs/btrfs/drw_lock.c > create mode 100644 fs/btrfs/drw_lock.h > > diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile > index ca693dd554e9..dc60127791e6 100644 > --- a/fs/btrfs/Makefile > +++ b/fs/btrfs/Makefile > @@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ > export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \ > compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ > reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ > - uuid-tree.o props.o free-space-tree.o tree-checker.o > + uuid-tree.o props.o free-space-tree.o tree-checker.o drw_lock.o > > btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o > btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o > diff --git a/fs/btrfs/drw_lock.c b/fs/btrfs/drw_lock.c > new file mode 100644 > index 000000000000..9681bf7544be > --- /dev/null > +++ b/fs/btrfs/drw_lock.c > @@ -0,0 +1,71 @@ > +#include "drw_lock.h" > +#include "ctree.h" > + > +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock) > +{ > + atomic_set(&lock->readers, 0); > + percpu_counter_init(&lock->writers, 0, GFP_KERNEL); > + init_waitqueue_head(&lock->pending_readers); > + init_waitqueue_head(&lock->pending_writers); > +} > + > +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock) > +{ > + percpu_counter_destroy(&lock->writers); > +} > + > +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock) > +{ > + if (atomic_read(&lock->readers)) > + return false; > + > + percpu_counter_inc(&lock->writers); > + > + /* > + * Ensure writers count is updated before we check for > + * pending readers > + */ > + smp_mb(); > + if (atomic_read(&lock->readers)) { > + btrfs_drw_read_unlock(lock); Should be btrfs_drw_write_unlock(lock) > + return false; > + } > + > + return true; > +} > + > +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock) > +{ > + while(true) { Missing space after 'while'. Thanks. > + if (btrfs_drw_try_write_lock(lock)) > + return; > + wait_event(lock->pending_writers, !atomic_read(&lock->readers)); > + } > +} > + > +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock) > +{ > + percpu_counter_dec(&lock->writers); > + cond_wake_up(&lock->pending_readers); > +} > + > +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock) > +{ > + atomic_inc(&lock->readers); > + smp_mb__after_atomic(); > + > + wait_event(lock->pending_readers, > + percpu_counter_sum(&lock->writers) == 0); > +} > + > +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock) > +{ > + /* > + * Atomic RMW operations imply full barrier, so woken up writers > + * are guaranteed to see the decrement > + */ > + if (atomic_dec_and_test(&lock->readers)) > + wake_up(&lock->pending_writers); > +} > + > + > diff --git a/fs/btrfs/drw_lock.h b/fs/btrfs/drw_lock.h > new file mode 100644 > index 000000000000..baff59561c06 > --- /dev/null > +++ b/fs/btrfs/drw_lock.h > @@ -0,0 +1,23 @@ > +#ifndef BTRFS_DRW_LOCK_H > +#define BTRFS_DRW_LOCK_H > + > +#include <linux/atomic.h> > +#include <linux/wait.h> > +#include <linux/percpu_counter.h> > + > +struct btrfs_drw_lock { > + atomic_t readers; > + struct percpu_counter writers; > + wait_queue_head_t pending_writers; > + wait_queue_head_t pending_readers; > +}; > + > +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock); > +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock); > +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock); > +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock); > +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock); > +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock); > +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock); > + > +#endif > -- > 2.17.1 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.” ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] btrfs: Implement DRW lock 2019-06-06 13:52 ` [PATCH 1/2] btrfs: Implement DRW lock Nikolay Borisov 2019-06-06 15:15 ` Filipe Manana @ 2019-06-07 10:52 ` Paul E. McKenney 2019-06-07 11:59 ` Nikolay Borisov 2019-06-08 16:33 ` Andrea Parri 2019-06-12 14:05 ` David Sterba 3 siblings, 1 reply; 26+ messages in thread From: Paul E. McKenney @ 2019-06-07 10:52 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs, linux-kernel, andrea.parri, peterz On Thu, Jun 06, 2019 at 04:52:18PM +0300, Nikolay Borisov wrote: > A (D)ouble (R)eader (W)riter lock is a locking primitive that allows > to have multiple readers or multiple writers but not multiple readers > and writers holding it concurrently. The code is factored out from > the existing open-coded locking scheme used to exclude pending > snapshots from nocow writers and vice-versa. Current implementation > actually favors Readers (that is snapshot creaters) to writers (nocow > writers of the filesystem). > > Signed-off-by: Nikolay Borisov <nborisov@suse.com> A preliminary question... What prevents the following sequence of events from happening? o btrfs_drw_write_lock() invokes btrfs_drw_try_write_lock(), which sees that lock->readers is zero and thus executes percpu_counter_inc(&lock->writers). o btrfs_drw_read_lock() increments lock->readers, does the smp_mb__after_atomic(), and then does the wait_event(). Because btrfs_drw_try_write_lock() incremented its CPU's lock->writers, the sum is the value one, so it blocks. o btrfs_drw_try_write_lock() checks lock->readers, sees that it is now nonzero, and thus invokes btrfs_drw_read_unlock() (which decrements the current CPU's counter, so that a future sum would get zero), and returns false. o btrfs_drw_write_lock() therefore does its wait_event(). Because lock->readers is nonzero, it blocks. o Both tasks are now blocked. In the absence of future calls to these functions (and perhaps even given such future calls), we have deadlock. So what am I missing here? Thanx, Paul > --- > fs/btrfs/Makefile | 2 +- > fs/btrfs/drw_lock.c | 71 +++++++++++++++++++++++++++++++++++++++++++++ > fs/btrfs/drw_lock.h | 23 +++++++++++++++ > 3 files changed, 95 insertions(+), 1 deletion(-) > create mode 100644 fs/btrfs/drw_lock.c > create mode 100644 fs/btrfs/drw_lock.h > > diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile > index ca693dd554e9..dc60127791e6 100644 > --- a/fs/btrfs/Makefile > +++ b/fs/btrfs/Makefile > @@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ > export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \ > compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ > reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ > - uuid-tree.o props.o free-space-tree.o tree-checker.o > + uuid-tree.o props.o free-space-tree.o tree-checker.o drw_lock.o > > btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o > btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o > diff --git a/fs/btrfs/drw_lock.c b/fs/btrfs/drw_lock.c > new file mode 100644 > index 000000000000..9681bf7544be > --- /dev/null > +++ b/fs/btrfs/drw_lock.c > @@ -0,0 +1,71 @@ > +#include "drw_lock.h" > +#include "ctree.h" > + > +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock) > +{ > + atomic_set(&lock->readers, 0); > + percpu_counter_init(&lock->writers, 0, GFP_KERNEL); > + init_waitqueue_head(&lock->pending_readers); > + init_waitqueue_head(&lock->pending_writers); > +} > + > +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock) > +{ > + percpu_counter_destroy(&lock->writers); > +} > + > +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock) > +{ > + if (atomic_read(&lock->readers)) > + return false; > + > + percpu_counter_inc(&lock->writers); > + > + /* > + * Ensure writers count is updated before we check for > + * pending readers > + */ > + smp_mb(); > + if (atomic_read(&lock->readers)) { > + btrfs_drw_read_unlock(lock); > + return false; > + } > + > + return true; > +} > + > +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock) > +{ > + while(true) { > + if (btrfs_drw_try_write_lock(lock)) > + return; > + wait_event(lock->pending_writers, !atomic_read(&lock->readers)); > + } > +} > + > +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock) > +{ > + percpu_counter_dec(&lock->writers); > + cond_wake_up(&lock->pending_readers); > +} > + > +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock) > +{ > + atomic_inc(&lock->readers); > + smp_mb__after_atomic(); > + > + wait_event(lock->pending_readers, > + percpu_counter_sum(&lock->writers) == 0); > +} > + > +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock) > +{ > + /* > + * Atomic RMW operations imply full barrier, so woken up writers > + * are guaranteed to see the decrement > + */ > + if (atomic_dec_and_test(&lock->readers)) > + wake_up(&lock->pending_writers); > +} > + > + > diff --git a/fs/btrfs/drw_lock.h b/fs/btrfs/drw_lock.h > new file mode 100644 > index 000000000000..baff59561c06 > --- /dev/null > +++ b/fs/btrfs/drw_lock.h > @@ -0,0 +1,23 @@ > +#ifndef BTRFS_DRW_LOCK_H > +#define BTRFS_DRW_LOCK_H > + > +#include <linux/atomic.h> > +#include <linux/wait.h> > +#include <linux/percpu_counter.h> > + > +struct btrfs_drw_lock { > + atomic_t readers; > + struct percpu_counter writers; > + wait_queue_head_t pending_writers; > + wait_queue_head_t pending_readers; > +}; > + > +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock); > +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock); > +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock); > +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock); > +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock); > +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock); > +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock); > + > +#endif > -- > 2.17.1 > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] btrfs: Implement DRW lock 2019-06-07 10:52 ` Paul E. McKenney @ 2019-06-07 11:59 ` Nikolay Borisov 2019-06-08 15:13 ` Paul E. McKenney 0 siblings, 1 reply; 26+ messages in thread From: Nikolay Borisov @ 2019-06-07 11:59 UTC (permalink / raw) To: paulmck; +Cc: linux-btrfs, linux-kernel, andrea.parri, peterz On 7.06.19 г. 13:52 ч., Paul E. McKenney wrote: > On Thu, Jun 06, 2019 at 04:52:18PM +0300, Nikolay Borisov wrote: >> A (D)ouble (R)eader (W)riter lock is a locking primitive that allows >> to have multiple readers or multiple writers but not multiple readers >> and writers holding it concurrently. The code is factored out from >> the existing open-coded locking scheme used to exclude pending >> snapshots from nocow writers and vice-versa. Current implementation >> actually favors Readers (that is snapshot creaters) to writers (nocow >> writers of the filesystem). >> >> Signed-off-by: Nikolay Borisov <nborisov@suse.com> > > A preliminary question... > > What prevents the following sequence of events from happening? > > o btrfs_drw_write_lock() invokes btrfs_drw_try_write_lock(), > which sees that lock->readers is zero and thus executes > percpu_counter_inc(&lock->writers). > > o btrfs_drw_read_lock() increments lock->readers, does the > smp_mb__after_atomic(), and then does the wait_event(). > Because btrfs_drw_try_write_lock() incremented its CPU's > lock->writers, the sum is the value one, so it blocks. > > o btrfs_drw_try_write_lock() checks lock->readers, sees that > it is now nonzero, and thus invokes btrfs_drw_read_unlock() > (which decrements the current CPU's counter, so that a future > sum would get zero), and returns false. btrfs_drw_read_unlock is actually btrfs_drw_write_unlock, my bad, Filipe already pointed that out and I've fixed it. The idea here is that if a reader came after we've incremented out percpu counter then it would have blocked, the writer would see that and invoke btrfs_drw_write_unlock which will decrement the percpu counter and will wakeup the reader that is now blocked on pending_readers. > > o btrfs_drw_write_lock() therefore does its wait_event(). > Because lock->readers is nonzero, it blocks. > > o Both tasks are now blocked. In the absence of future calls > to these functions (and perhaps even given such future calls), > we have deadlock. > > So what am I missing here? > > Thanx, Paul > >> --- >> fs/btrfs/Makefile | 2 +- >> fs/btrfs/drw_lock.c | 71 +++++++++++++++++++++++++++++++++++++++++++++ >> fs/btrfs/drw_lock.h | 23 +++++++++++++++ >> 3 files changed, 95 insertions(+), 1 deletion(-) >> create mode 100644 fs/btrfs/drw_lock.c >> create mode 100644 fs/btrfs/drw_lock.h >> >> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile >> index ca693dd554e9..dc60127791e6 100644 >> --- a/fs/btrfs/Makefile >> +++ b/fs/btrfs/Makefile >> @@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ >> export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \ >> compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ >> reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ >> - uuid-tree.o props.o free-space-tree.o tree-checker.o >> + uuid-tree.o props.o free-space-tree.o tree-checker.o drw_lock.o >> >> btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o >> btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o >> diff --git a/fs/btrfs/drw_lock.c b/fs/btrfs/drw_lock.c >> new file mode 100644 >> index 000000000000..9681bf7544be >> --- /dev/null >> +++ b/fs/btrfs/drw_lock.c >> @@ -0,0 +1,71 @@ >> +#include "drw_lock.h" >> +#include "ctree.h" >> + >> +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock) >> +{ >> + atomic_set(&lock->readers, 0); >> + percpu_counter_init(&lock->writers, 0, GFP_KERNEL); >> + init_waitqueue_head(&lock->pending_readers); >> + init_waitqueue_head(&lock->pending_writers); >> +} >> + >> +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock) >> +{ >> + percpu_counter_destroy(&lock->writers); >> +} >> + >> +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock) >> +{ >> + if (atomic_read(&lock->readers)) >> + return false; >> + >> + percpu_counter_inc(&lock->writers); >> + >> + /* >> + * Ensure writers count is updated before we check for >> + * pending readers >> + */ >> + smp_mb(); >> + if (atomic_read(&lock->readers)) { >> + btrfs_drw_read_unlock(lock); >> + return false; >> + } >> + >> + return true; >> +} >> + >> +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock) >> +{ >> + while(true) { >> + if (btrfs_drw_try_write_lock(lock)) >> + return; >> + wait_event(lock->pending_writers, !atomic_read(&lock->readers)); >> + } >> +} >> + >> +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock) >> +{ >> + percpu_counter_dec(&lock->writers); >> + cond_wake_up(&lock->pending_readers); >> +} >> + >> +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock) >> +{ >> + atomic_inc(&lock->readers); >> + smp_mb__after_atomic(); >> + >> + wait_event(lock->pending_readers, >> + percpu_counter_sum(&lock->writers) == 0); >> +} >> + >> +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock) >> +{ >> + /* >> + * Atomic RMW operations imply full barrier, so woken up writers >> + * are guaranteed to see the decrement >> + */ >> + if (atomic_dec_and_test(&lock->readers)) >> + wake_up(&lock->pending_writers); >> +} >> + >> + >> diff --git a/fs/btrfs/drw_lock.h b/fs/btrfs/drw_lock.h >> new file mode 100644 >> index 000000000000..baff59561c06 >> --- /dev/null >> +++ b/fs/btrfs/drw_lock.h >> @@ -0,0 +1,23 @@ >> +#ifndef BTRFS_DRW_LOCK_H >> +#define BTRFS_DRW_LOCK_H >> + >> +#include <linux/atomic.h> >> +#include <linux/wait.h> >> +#include <linux/percpu_counter.h> >> + >> +struct btrfs_drw_lock { >> + atomic_t readers; >> + struct percpu_counter writers; >> + wait_queue_head_t pending_writers; >> + wait_queue_head_t pending_readers; >> +}; >> + >> +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock); >> +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock); >> +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock); >> +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock); >> +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock); >> +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock); >> +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock); >> + >> +#endif >> -- >> 2.17.1 >> > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] btrfs: Implement DRW lock 2019-06-07 11:59 ` Nikolay Borisov @ 2019-06-08 15:13 ` Paul E. McKenney 2019-06-08 15:44 ` Nikolay Borisov 0 siblings, 1 reply; 26+ messages in thread From: Paul E. McKenney @ 2019-06-08 15:13 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs, linux-kernel, andrea.parri, peterz On Fri, Jun 07, 2019 at 02:59:34PM +0300, Nikolay Borisov wrote: > > > On 7.06.19 г. 13:52 ч., Paul E. McKenney wrote: > > On Thu, Jun 06, 2019 at 04:52:18PM +0300, Nikolay Borisov wrote: > >> A (D)ouble (R)eader (W)riter lock is a locking primitive that allows > >> to have multiple readers or multiple writers but not multiple readers > >> and writers holding it concurrently. The code is factored out from > >> the existing open-coded locking scheme used to exclude pending > >> snapshots from nocow writers and vice-versa. Current implementation > >> actually favors Readers (that is snapshot creaters) to writers (nocow > >> writers of the filesystem). > >> > >> Signed-off-by: Nikolay Borisov <nborisov@suse.com> > > > > A preliminary question... > > > > What prevents the following sequence of events from happening? > > > > o btrfs_drw_write_lock() invokes btrfs_drw_try_write_lock(), > > which sees that lock->readers is zero and thus executes > > percpu_counter_inc(&lock->writers). > > > > o btrfs_drw_read_lock() increments lock->readers, does the > > smp_mb__after_atomic(), and then does the wait_event(). > > Because btrfs_drw_try_write_lock() incremented its CPU's > > lock->writers, the sum is the value one, so it blocks. > > > > o btrfs_drw_try_write_lock() checks lock->readers, sees that > > it is now nonzero, and thus invokes btrfs_drw_read_unlock() > > (which decrements the current CPU's counter, so that a future > > sum would get zero), and returns false. > > btrfs_drw_read_unlock is actually btrfs_drw_write_unlock, my bad, Filipe > already pointed that out and I've fixed it. Ah! I must then ask what you are using to test this. kernel/locktorture.c? > The idea here is that if a reader came after we've incremented out > percpu counter then it would have blocked, the writer would see that and > invoke btrfs_drw_write_unlock which will decrement the percpu counter > and will wakeup the reader that is now blocked on pending_readers. OK, I will await your next version. Thanx, Paul > > o btrfs_drw_write_lock() therefore does its wait_event(). > > Because lock->readers is nonzero, it blocks. > > > > o Both tasks are now blocked. In the absence of future calls > > to these functions (and perhaps even given such future calls), > > we have deadlock. > > > > So what am I missing here? > > > > Thanx, Paul > > > >> --- > >> fs/btrfs/Makefile | 2 +- > >> fs/btrfs/drw_lock.c | 71 +++++++++++++++++++++++++++++++++++++++++++++ > >> fs/btrfs/drw_lock.h | 23 +++++++++++++++ > >> 3 files changed, 95 insertions(+), 1 deletion(-) > >> create mode 100644 fs/btrfs/drw_lock.c > >> create mode 100644 fs/btrfs/drw_lock.h > >> > >> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile > >> index ca693dd554e9..dc60127791e6 100644 > >> --- a/fs/btrfs/Makefile > >> +++ b/fs/btrfs/Makefile > >> @@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ > >> export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \ > >> compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ > >> reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ > >> - uuid-tree.o props.o free-space-tree.o tree-checker.o > >> + uuid-tree.o props.o free-space-tree.o tree-checker.o drw_lock.o > >> > >> btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o > >> btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o > >> diff --git a/fs/btrfs/drw_lock.c b/fs/btrfs/drw_lock.c > >> new file mode 100644 > >> index 000000000000..9681bf7544be > >> --- /dev/null > >> +++ b/fs/btrfs/drw_lock.c > >> @@ -0,0 +1,71 @@ > >> +#include "drw_lock.h" > >> +#include "ctree.h" > >> + > >> +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock) > >> +{ > >> + atomic_set(&lock->readers, 0); > >> + percpu_counter_init(&lock->writers, 0, GFP_KERNEL); > >> + init_waitqueue_head(&lock->pending_readers); > >> + init_waitqueue_head(&lock->pending_writers); > >> +} > >> + > >> +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock) > >> +{ > >> + percpu_counter_destroy(&lock->writers); > >> +} > >> + > >> +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock) > >> +{ > >> + if (atomic_read(&lock->readers)) > >> + return false; > >> + > >> + percpu_counter_inc(&lock->writers); > >> + > >> + /* > >> + * Ensure writers count is updated before we check for > >> + * pending readers > >> + */ > >> + smp_mb(); > >> + if (atomic_read(&lock->readers)) { > >> + btrfs_drw_read_unlock(lock); > >> + return false; > >> + } > >> + > >> + return true; > >> +} > >> + > >> +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock) > >> +{ > >> + while(true) { > >> + if (btrfs_drw_try_write_lock(lock)) > >> + return; > >> + wait_event(lock->pending_writers, !atomic_read(&lock->readers)); > >> + } > >> +} > >> + > >> +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock) > >> +{ > >> + percpu_counter_dec(&lock->writers); > >> + cond_wake_up(&lock->pending_readers); > >> +} > >> + > >> +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock) > >> +{ > >> + atomic_inc(&lock->readers); > >> + smp_mb__after_atomic(); > >> + > >> + wait_event(lock->pending_readers, > >> + percpu_counter_sum(&lock->writers) == 0); > >> +} > >> + > >> +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock) > >> +{ > >> + /* > >> + * Atomic RMW operations imply full barrier, so woken up writers > >> + * are guaranteed to see the decrement > >> + */ > >> + if (atomic_dec_and_test(&lock->readers)) > >> + wake_up(&lock->pending_writers); > >> +} > >> + > >> + > >> diff --git a/fs/btrfs/drw_lock.h b/fs/btrfs/drw_lock.h > >> new file mode 100644 > >> index 000000000000..baff59561c06 > >> --- /dev/null > >> +++ b/fs/btrfs/drw_lock.h > >> @@ -0,0 +1,23 @@ > >> +#ifndef BTRFS_DRW_LOCK_H > >> +#define BTRFS_DRW_LOCK_H > >> + > >> +#include <linux/atomic.h> > >> +#include <linux/wait.h> > >> +#include <linux/percpu_counter.h> > >> + > >> +struct btrfs_drw_lock { > >> + atomic_t readers; > >> + struct percpu_counter writers; > >> + wait_queue_head_t pending_writers; > >> + wait_queue_head_t pending_readers; > >> +}; > >> + > >> +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock); > >> +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock); > >> +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock); > >> +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock); > >> +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock); > >> +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock); > >> +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock); > >> + > >> +#endif > >> -- > >> 2.17.1 > >> > > > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] btrfs: Implement DRW lock 2019-06-08 15:13 ` Paul E. McKenney @ 2019-06-08 15:44 ` Nikolay Borisov 2019-06-08 16:06 ` Paul E. McKenney 0 siblings, 1 reply; 26+ messages in thread From: Nikolay Borisov @ 2019-06-08 15:44 UTC (permalink / raw) To: paulmck; +Cc: linux-btrfs, linux-kernel, andrea.parri, peterz On 8.06.19 г. 18:13 ч., Paul E. McKenney wrote: > On Fri, Jun 07, 2019 at 02:59:34PM +0300, Nikolay Borisov wrote: >> >> >> On 7.06.19 г. 13:52 ч., Paul E. McKenney wrote: >>> On Thu, Jun 06, 2019 at 04:52:18PM +0300, Nikolay Borisov wrote: >>>> A (D)ouble (R)eader (W)riter lock is a locking primitive that allows >>>> to have multiple readers or multiple writers but not multiple readers >>>> and writers holding it concurrently. The code is factored out from >>>> the existing open-coded locking scheme used to exclude pending >>>> snapshots from nocow writers and vice-versa. Current implementation >>>> actually favors Readers (that is snapshot creaters) to writers (nocow >>>> writers of the filesystem). >>>> >>>> Signed-off-by: Nikolay Borisov <nborisov@suse.com> >>> >>> A preliminary question... >>> >>> What prevents the following sequence of events from happening? >>> >>> o btrfs_drw_write_lock() invokes btrfs_drw_try_write_lock(), >>> which sees that lock->readers is zero and thus executes >>> percpu_counter_inc(&lock->writers). >>> >>> o btrfs_drw_read_lock() increments lock->readers, does the >>> smp_mb__after_atomic(), and then does the wait_event(). >>> Because btrfs_drw_try_write_lock() incremented its CPU's >>> lock->writers, the sum is the value one, so it blocks. >>> >>> o btrfs_drw_try_write_lock() checks lock->readers, sees that >>> it is now nonzero, and thus invokes btrfs_drw_read_unlock() >>> (which decrements the current CPU's counter, so that a future >>> sum would get zero), and returns false. >> >> btrfs_drw_read_unlock is actually btrfs_drw_write_unlock, my bad, Filipe >> already pointed that out and I've fixed it. > > Ah! I must then ask what you are using to test this. kernel/locktorture.c? At the moment - nothing. I rely on the fact that the original code I extracted that from is bug-free (ha-ha). So perhahps hooking up locktorture seems like a good suggestion. From a quick look I guess I could mostly model that lock against the rwsem. The question is how do I model the trylock semantics as well as the "double" part? > >> The idea here is that if a reader came after we've incremented out >> percpu counter then it would have blocked, the writer would see that and >> invoke btrfs_drw_write_unlock which will decrement the percpu counter >> and will wakeup the reader that is now blocked on pending_readers. > > OK, I will await your next version. > > Thanx, Paul > >>> o btrfs_drw_write_lock() therefore does its wait_event(). >>> Because lock->readers is nonzero, it blocks. >>> >>> o Both tasks are now blocked. In the absence of future calls >>> to these functions (and perhaps even given such future calls), >>> we have deadlock. >>> >>> So what am I missing here? >>> >>> Thanx, Paul >>> >>>> --- >>>> fs/btrfs/Makefile | 2 +- >>>> fs/btrfs/drw_lock.c | 71 +++++++++++++++++++++++++++++++++++++++++++++ >>>> fs/btrfs/drw_lock.h | 23 +++++++++++++++ >>>> 3 files changed, 95 insertions(+), 1 deletion(-) >>>> create mode 100644 fs/btrfs/drw_lock.c >>>> create mode 100644 fs/btrfs/drw_lock.h >>>> >>>> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile >>>> index ca693dd554e9..dc60127791e6 100644 >>>> --- a/fs/btrfs/Makefile >>>> +++ b/fs/btrfs/Makefile >>>> @@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ >>>> export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \ >>>> compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ >>>> reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ >>>> - uuid-tree.o props.o free-space-tree.o tree-checker.o >>>> + uuid-tree.o props.o free-space-tree.o tree-checker.o drw_lock.o >>>> >>>> btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o >>>> btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o >>>> diff --git a/fs/btrfs/drw_lock.c b/fs/btrfs/drw_lock.c >>>> new file mode 100644 >>>> index 000000000000..9681bf7544be >>>> --- /dev/null >>>> +++ b/fs/btrfs/drw_lock.c >>>> @@ -0,0 +1,71 @@ >>>> +#include "drw_lock.h" >>>> +#include "ctree.h" >>>> + >>>> +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock) >>>> +{ >>>> + atomic_set(&lock->readers, 0); >>>> + percpu_counter_init(&lock->writers, 0, GFP_KERNEL); >>>> + init_waitqueue_head(&lock->pending_readers); >>>> + init_waitqueue_head(&lock->pending_writers); >>>> +} >>>> + >>>> +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock) >>>> +{ >>>> + percpu_counter_destroy(&lock->writers); >>>> +} >>>> + >>>> +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock) >>>> +{ >>>> + if (atomic_read(&lock->readers)) >>>> + return false; >>>> + >>>> + percpu_counter_inc(&lock->writers); >>>> + >>>> + /* >>>> + * Ensure writers count is updated before we check for >>>> + * pending readers >>>> + */ >>>> + smp_mb(); >>>> + if (atomic_read(&lock->readers)) { >>>> + btrfs_drw_read_unlock(lock); >>>> + return false; >>>> + } >>>> + >>>> + return true; >>>> +} >>>> + >>>> +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock) >>>> +{ >>>> + while(true) { >>>> + if (btrfs_drw_try_write_lock(lock)) >>>> + return; >>>> + wait_event(lock->pending_writers, !atomic_read(&lock->readers)); >>>> + } >>>> +} >>>> + >>>> +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock) >>>> +{ >>>> + percpu_counter_dec(&lock->writers); >>>> + cond_wake_up(&lock->pending_readers); >>>> +} >>>> + >>>> +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock) >>>> +{ >>>> + atomic_inc(&lock->readers); >>>> + smp_mb__after_atomic(); >>>> + >>>> + wait_event(lock->pending_readers, >>>> + percpu_counter_sum(&lock->writers) == 0); >>>> +} >>>> + >>>> +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock) >>>> +{ >>>> + /* >>>> + * Atomic RMW operations imply full barrier, so woken up writers >>>> + * are guaranteed to see the decrement >>>> + */ >>>> + if (atomic_dec_and_test(&lock->readers)) >>>> + wake_up(&lock->pending_writers); >>>> +} >>>> + >>>> + >>>> diff --git a/fs/btrfs/drw_lock.h b/fs/btrfs/drw_lock.h >>>> new file mode 100644 >>>> index 000000000000..baff59561c06 >>>> --- /dev/null >>>> +++ b/fs/btrfs/drw_lock.h >>>> @@ -0,0 +1,23 @@ >>>> +#ifndef BTRFS_DRW_LOCK_H >>>> +#define BTRFS_DRW_LOCK_H >>>> + >>>> +#include <linux/atomic.h> >>>> +#include <linux/wait.h> >>>> +#include <linux/percpu_counter.h> >>>> + >>>> +struct btrfs_drw_lock { >>>> + atomic_t readers; >>>> + struct percpu_counter writers; >>>> + wait_queue_head_t pending_writers; >>>> + wait_queue_head_t pending_readers; >>>> +}; >>>> + >>>> +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock); >>>> +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock); >>>> +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock); >>>> +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock); >>>> +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock); >>>> +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock); >>>> +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock); >>>> + >>>> +#endif >>>> -- >>>> 2.17.1 >>>> >>> >>> >> > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] btrfs: Implement DRW lock 2019-06-08 15:44 ` Nikolay Borisov @ 2019-06-08 16:06 ` Paul E. McKenney 2019-06-08 16:21 ` Nikolay Borisov 0 siblings, 1 reply; 26+ messages in thread From: Paul E. McKenney @ 2019-06-08 16:06 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs, linux-kernel, andrea.parri, peterz On Sat, Jun 08, 2019 at 06:44:17PM +0300, Nikolay Borisov wrote: > On 8.06.19 г. 18:13 ч., Paul E. McKenney wrote: > > On Fri, Jun 07, 2019 at 02:59:34PM +0300, Nikolay Borisov wrote: > >> On 7.06.19 г. 13:52 ч., Paul E. McKenney wrote: > >>> On Thu, Jun 06, 2019 at 04:52:18PM +0300, Nikolay Borisov wrote: > >>>> A (D)ouble (R)eader (W)riter lock is a locking primitive that allows > >>>> to have multiple readers or multiple writers but not multiple readers > >>>> and writers holding it concurrently. The code is factored out from > >>>> the existing open-coded locking scheme used to exclude pending > >>>> snapshots from nocow writers and vice-versa. Current implementation > >>>> actually favors Readers (that is snapshot creaters) to writers (nocow > >>>> writers of the filesystem). > >>>> > >>>> Signed-off-by: Nikolay Borisov <nborisov@suse.com> > >>> > >>> A preliminary question... > >>> > >>> What prevents the following sequence of events from happening? > >>> > >>> o btrfs_drw_write_lock() invokes btrfs_drw_try_write_lock(), > >>> which sees that lock->readers is zero and thus executes > >>> percpu_counter_inc(&lock->writers). > >>> > >>> o btrfs_drw_read_lock() increments lock->readers, does the > >>> smp_mb__after_atomic(), and then does the wait_event(). > >>> Because btrfs_drw_try_write_lock() incremented its CPU's > >>> lock->writers, the sum is the value one, so it blocks. > >>> > >>> o btrfs_drw_try_write_lock() checks lock->readers, sees that > >>> it is now nonzero, and thus invokes btrfs_drw_read_unlock() > >>> (which decrements the current CPU's counter, so that a future > >>> sum would get zero), and returns false. > >> > >> btrfs_drw_read_unlock is actually btrfs_drw_write_unlock, my bad, Filipe > >> already pointed that out and I've fixed it. > > > > Ah! I must then ask what you are using to test this. kernel/locktorture.c? Right... Make that kernel/locking/locktorture.c > At the moment - nothing. I rely on the fact that the original code I > extracted that from is bug-free (ha-ha). So perhahps hooking up > locktorture seems like a good suggestion. From a quick look I guess I > could mostly model that lock against the rwsem. The question is how do I > model the trylock semantics as well as the "double" part? Implementing a correct synchronization primitive is like committing the perfect crime. There are at least 50 things that can go wrong, and if you are a highly experienced genius, you -might- be able to anticipate and handle 25 of them. (With apologies to any Kathleen Turner fans who might still be alive.) Please note that this still applies to code ported from somewhere else because different environments likely have different assumptions and properties. Therefore, heavy-duty stress testing is not optional. In fact, formal verification is becoming non-optional as well -- please see Catalin Marinas's work on verifying the Linux kernel's queued spinlock for an example. You are right, current locktorture would get upset about having concurrent writers. To teach locktorture about this, I suggest adding a flag to the lock_torture_ops structure named something like concurrent_write, but hopefully shorter. Then this flag can be used to disable the "only one writer" check in lock_torture_writer(). Seem reasonable? Thanx, Paul > >> The idea here is that if a reader came after we've incremented out > >> percpu counter then it would have blocked, the writer would see that and > >> invoke btrfs_drw_write_unlock which will decrement the percpu counter > >> and will wakeup the reader that is now blocked on pending_readers. > > > > OK, I will await your next version. > > > > Thanx, Paul > > > >>> o btrfs_drw_write_lock() therefore does its wait_event(). > >>> Because lock->readers is nonzero, it blocks. > >>> > >>> o Both tasks are now blocked. In the absence of future calls > >>> to these functions (and perhaps even given such future calls), > >>> we have deadlock. > >>> > >>> So what am I missing here? > >>> > >>> Thanx, Paul > >>> > >>>> --- > >>>> fs/btrfs/Makefile | 2 +- > >>>> fs/btrfs/drw_lock.c | 71 +++++++++++++++++++++++++++++++++++++++++++++ > >>>> fs/btrfs/drw_lock.h | 23 +++++++++++++++ > >>>> 3 files changed, 95 insertions(+), 1 deletion(-) > >>>> create mode 100644 fs/btrfs/drw_lock.c > >>>> create mode 100644 fs/btrfs/drw_lock.h > >>>> > >>>> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile > >>>> index ca693dd554e9..dc60127791e6 100644 > >>>> --- a/fs/btrfs/Makefile > >>>> +++ b/fs/btrfs/Makefile > >>>> @@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ > >>>> export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \ > >>>> compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ > >>>> reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ > >>>> - uuid-tree.o props.o free-space-tree.o tree-checker.o > >>>> + uuid-tree.o props.o free-space-tree.o tree-checker.o drw_lock.o > >>>> > >>>> btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o > >>>> btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o > >>>> diff --git a/fs/btrfs/drw_lock.c b/fs/btrfs/drw_lock.c > >>>> new file mode 100644 > >>>> index 000000000000..9681bf7544be > >>>> --- /dev/null > >>>> +++ b/fs/btrfs/drw_lock.c > >>>> @@ -0,0 +1,71 @@ > >>>> +#include "drw_lock.h" > >>>> +#include "ctree.h" > >>>> + > >>>> +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock) > >>>> +{ > >>>> + atomic_set(&lock->readers, 0); > >>>> + percpu_counter_init(&lock->writers, 0, GFP_KERNEL); > >>>> + init_waitqueue_head(&lock->pending_readers); > >>>> + init_waitqueue_head(&lock->pending_writers); > >>>> +} > >>>> + > >>>> +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock) > >>>> +{ > >>>> + percpu_counter_destroy(&lock->writers); > >>>> +} > >>>> + > >>>> +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock) > >>>> +{ > >>>> + if (atomic_read(&lock->readers)) > >>>> + return false; > >>>> + > >>>> + percpu_counter_inc(&lock->writers); > >>>> + > >>>> + /* > >>>> + * Ensure writers count is updated before we check for > >>>> + * pending readers > >>>> + */ > >>>> + smp_mb(); > >>>> + if (atomic_read(&lock->readers)) { > >>>> + btrfs_drw_read_unlock(lock); > >>>> + return false; > >>>> + } > >>>> + > >>>> + return true; > >>>> +} > >>>> + > >>>> +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock) > >>>> +{ > >>>> + while(true) { > >>>> + if (btrfs_drw_try_write_lock(lock)) > >>>> + return; > >>>> + wait_event(lock->pending_writers, !atomic_read(&lock->readers)); > >>>> + } > >>>> +} > >>>> + > >>>> +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock) > >>>> +{ > >>>> + percpu_counter_dec(&lock->writers); > >>>> + cond_wake_up(&lock->pending_readers); > >>>> +} > >>>> + > >>>> +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock) > >>>> +{ > >>>> + atomic_inc(&lock->readers); > >>>> + smp_mb__after_atomic(); > >>>> + > >>>> + wait_event(lock->pending_readers, > >>>> + percpu_counter_sum(&lock->writers) == 0); > >>>> +} > >>>> + > >>>> +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock) > >>>> +{ > >>>> + /* > >>>> + * Atomic RMW operations imply full barrier, so woken up writers > >>>> + * are guaranteed to see the decrement > >>>> + */ > >>>> + if (atomic_dec_and_test(&lock->readers)) > >>>> + wake_up(&lock->pending_writers); > >>>> +} > >>>> + > >>>> + > >>>> diff --git a/fs/btrfs/drw_lock.h b/fs/btrfs/drw_lock.h > >>>> new file mode 100644 > >>>> index 000000000000..baff59561c06 > >>>> --- /dev/null > >>>> +++ b/fs/btrfs/drw_lock.h > >>>> @@ -0,0 +1,23 @@ > >>>> +#ifndef BTRFS_DRW_LOCK_H > >>>> +#define BTRFS_DRW_LOCK_H > >>>> + > >>>> +#include <linux/atomic.h> > >>>> +#include <linux/wait.h> > >>>> +#include <linux/percpu_counter.h> > >>>> + > >>>> +struct btrfs_drw_lock { > >>>> + atomic_t readers; > >>>> + struct percpu_counter writers; > >>>> + wait_queue_head_t pending_writers; > >>>> + wait_queue_head_t pending_readers; > >>>> +}; > >>>> + > >>>> +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock); > >>>> +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock); > >>>> +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock); > >>>> +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock); > >>>> +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock); > >>>> +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock); > >>>> +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock); > >>>> + > >>>> +#endif > >>>> -- > >>>> 2.17.1 > >>>> > >>> > >>> > >> > > > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] btrfs: Implement DRW lock 2019-06-08 16:06 ` Paul E. McKenney @ 2019-06-08 16:21 ` Nikolay Borisov 2019-06-08 16:39 ` Paul E. McKenney 0 siblings, 1 reply; 26+ messages in thread From: Nikolay Borisov @ 2019-06-08 16:21 UTC (permalink / raw) To: paulmck; +Cc: linux-btrfs, linux-kernel, andrea.parri, peterz On 8.06.19 г. 19:06 ч., Paul E. McKenney wrote: > On Sat, Jun 08, 2019 at 06:44:17PM +0300, Nikolay Borisov wrote: >> On 8.06.19 г. 18:13 ч., Paul E. McKenney wrote: >>> On Fri, Jun 07, 2019 at 02:59:34PM +0300, Nikolay Borisov wrote: >>>> On 7.06.19 г. 13:52 ч., Paul E. McKenney wrote: >>>>> On Thu, Jun 06, 2019 at 04:52:18PM +0300, Nikolay Borisov wrote: >>>>>> A (D)ouble (R)eader (W)riter lock is a locking primitive that allows >>>>>> to have multiple readers or multiple writers but not multiple readers >>>>>> and writers holding it concurrently. The code is factored out from >>>>>> the existing open-coded locking scheme used to exclude pending >>>>>> snapshots from nocow writers and vice-versa. Current implementation >>>>>> actually favors Readers (that is snapshot creaters) to writers (nocow >>>>>> writers of the filesystem). >>>>>> >>>>>> Signed-off-by: Nikolay Borisov <nborisov@suse.com> >>>>> >>>>> A preliminary question... >>>>> >>>>> What prevents the following sequence of events from happening? >>>>> >>>>> o btrfs_drw_write_lock() invokes btrfs_drw_try_write_lock(), >>>>> which sees that lock->readers is zero and thus executes >>>>> percpu_counter_inc(&lock->writers). >>>>> >>>>> o btrfs_drw_read_lock() increments lock->readers, does the >>>>> smp_mb__after_atomic(), and then does the wait_event(). >>>>> Because btrfs_drw_try_write_lock() incremented its CPU's >>>>> lock->writers, the sum is the value one, so it blocks. >>>>> >>>>> o btrfs_drw_try_write_lock() checks lock->readers, sees that >>>>> it is now nonzero, and thus invokes btrfs_drw_read_unlock() >>>>> (which decrements the current CPU's counter, so that a future >>>>> sum would get zero), and returns false. >>>> >>>> btrfs_drw_read_unlock is actually btrfs_drw_write_unlock, my bad, Filipe >>>> already pointed that out and I've fixed it. >>> >>> Ah! I must then ask what you are using to test this. kernel/locktorture.c? > > Right... Make that kernel/locking/locktorture.c > >> At the moment - nothing. I rely on the fact that the original code I >> extracted that from is bug-free (ha-ha). So perhahps hooking up >> locktorture seems like a good suggestion. From a quick look I guess I >> could mostly model that lock against the rwsem. The question is how do I >> model the trylock semantics as well as the "double" part? > > Implementing a correct synchronization primitive is like committing the > perfect crime. There are at least 50 things that can go wrong, and if > you are a highly experienced genius, you -might- be able to anticipate > and handle 25 of them. (With apologies to any Kathleen Turner fans who > might still be alive.) Please note that this still applies to code > ported from somewhere else because different environments likely have > different assumptions and properties. I agree, I'm far from thinking that the locking scheme is actually bug free (hence the 'ha-ha') I'm not that arrogant (yet). > > Therefore, heavy-duty stress testing is not optional. In fact, formal > verification is becoming non-optional as well -- please see Catalin > Marinas's work on verifying the Linux kernel's queued spinlock for > an example. I assume you are referring to "Formal Methods for kernel hackers"? If so, TLA+ has been on my radar ever since https://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf . However I've yet to invest the time to be able to properly model a real protocol (be it locking or otherwise) in it. Perhahps I could use the DRW lock as a learning opportunity, we'll see. > > You are right, current locktorture would get upset about having concurrent > writers. To teach locktorture about this, I suggest adding a flag to > the lock_torture_ops structure named something like concurrent_write, > but hopefully shorter. Then this flag can be used to disable the "only > one writer" check in lock_torture_writer(). > > Seem reasonable? Good idea, I'll see to extending lock-torture but this will happen in a week or so because I'm about to go on a holiday. > > Thanx, Paul > >>>> The idea here is that if a reader came after we've incremented out >>>> percpu counter then it would have blocked, the writer would see that and >>>> invoke btrfs_drw_write_unlock which will decrement the percpu counter >>>> and will wakeup the reader that is now blocked on pending_readers. >>> >>> OK, I will await your next version. >>> >>> Thanx, Paul >>> >>>>> o btrfs_drw_write_lock() therefore does its wait_event(). >>>>> Because lock->readers is nonzero, it blocks. >>>>> >>>>> o Both tasks are now blocked. In the absence of future calls >>>>> to these functions (and perhaps even given such future calls), >>>>> we have deadlock. >>>>> >>>>> So what am I missing here? >>>>> >>>>> Thanx, Paul >>>>> >>>>>> --- >>>>>> fs/btrfs/Makefile | 2 +- >>>>>> fs/btrfs/drw_lock.c | 71 +++++++++++++++++++++++++++++++++++++++++++++ >>>>>> fs/btrfs/drw_lock.h | 23 +++++++++++++++ >>>>>> 3 files changed, 95 insertions(+), 1 deletion(-) >>>>>> create mode 100644 fs/btrfs/drw_lock.c >>>>>> create mode 100644 fs/btrfs/drw_lock.h >>>>>> >>>>>> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile >>>>>> index ca693dd554e9..dc60127791e6 100644 >>>>>> --- a/fs/btrfs/Makefile >>>>>> +++ b/fs/btrfs/Makefile >>>>>> @@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ >>>>>> export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \ >>>>>> compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ >>>>>> reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ >>>>>> - uuid-tree.o props.o free-space-tree.o tree-checker.o >>>>>> + uuid-tree.o props.o free-space-tree.o tree-checker.o drw_lock.o >>>>>> >>>>>> btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o >>>>>> btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o >>>>>> diff --git a/fs/btrfs/drw_lock.c b/fs/btrfs/drw_lock.c >>>>>> new file mode 100644 >>>>>> index 000000000000..9681bf7544be >>>>>> --- /dev/null >>>>>> +++ b/fs/btrfs/drw_lock.c >>>>>> @@ -0,0 +1,71 @@ >>>>>> +#include "drw_lock.h" >>>>>> +#include "ctree.h" >>>>>> + >>>>>> +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock) >>>>>> +{ >>>>>> + atomic_set(&lock->readers, 0); >>>>>> + percpu_counter_init(&lock->writers, 0, GFP_KERNEL); >>>>>> + init_waitqueue_head(&lock->pending_readers); >>>>>> + init_waitqueue_head(&lock->pending_writers); >>>>>> +} >>>>>> + >>>>>> +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock) >>>>>> +{ >>>>>> + percpu_counter_destroy(&lock->writers); >>>>>> +} >>>>>> + >>>>>> +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock) >>>>>> +{ >>>>>> + if (atomic_read(&lock->readers)) >>>>>> + return false; >>>>>> + >>>>>> + percpu_counter_inc(&lock->writers); >>>>>> + >>>>>> + /* >>>>>> + * Ensure writers count is updated before we check for >>>>>> + * pending readers >>>>>> + */ >>>>>> + smp_mb(); >>>>>> + if (atomic_read(&lock->readers)) { >>>>>> + btrfs_drw_read_unlock(lock); >>>>>> + return false; >>>>>> + } >>>>>> + >>>>>> + return true; >>>>>> +} >>>>>> + >>>>>> +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock) >>>>>> +{ >>>>>> + while(true) { >>>>>> + if (btrfs_drw_try_write_lock(lock)) >>>>>> + return; >>>>>> + wait_event(lock->pending_writers, !atomic_read(&lock->readers)); >>>>>> + } >>>>>> +} >>>>>> + >>>>>> +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock) >>>>>> +{ >>>>>> + percpu_counter_dec(&lock->writers); >>>>>> + cond_wake_up(&lock->pending_readers); >>>>>> +} >>>>>> + >>>>>> +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock) >>>>>> +{ >>>>>> + atomic_inc(&lock->readers); >>>>>> + smp_mb__after_atomic(); >>>>>> + >>>>>> + wait_event(lock->pending_readers, >>>>>> + percpu_counter_sum(&lock->writers) == 0); >>>>>> +} >>>>>> + >>>>>> +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock) >>>>>> +{ >>>>>> + /* >>>>>> + * Atomic RMW operations imply full barrier, so woken up writers >>>>>> + * are guaranteed to see the decrement >>>>>> + */ >>>>>> + if (atomic_dec_and_test(&lock->readers)) >>>>>> + wake_up(&lock->pending_writers); >>>>>> +} >>>>>> + >>>>>> + >>>>>> diff --git a/fs/btrfs/drw_lock.h b/fs/btrfs/drw_lock.h >>>>>> new file mode 100644 >>>>>> index 000000000000..baff59561c06 >>>>>> --- /dev/null >>>>>> +++ b/fs/btrfs/drw_lock.h >>>>>> @@ -0,0 +1,23 @@ >>>>>> +#ifndef BTRFS_DRW_LOCK_H >>>>>> +#define BTRFS_DRW_LOCK_H >>>>>> + >>>>>> +#include <linux/atomic.h> >>>>>> +#include <linux/wait.h> >>>>>> +#include <linux/percpu_counter.h> >>>>>> + >>>>>> +struct btrfs_drw_lock { >>>>>> + atomic_t readers; >>>>>> + struct percpu_counter writers; >>>>>> + wait_queue_head_t pending_writers; >>>>>> + wait_queue_head_t pending_readers; >>>>>> +}; >>>>>> + >>>>>> +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock); >>>>>> +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock); >>>>>> +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock); >>>>>> +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock); >>>>>> +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock); >>>>>> +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock); >>>>>> +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock); >>>>>> + >>>>>> +#endif >>>>>> -- >>>>>> 2.17.1 >>>>>> >>>>> >>>>> >>>> >>> >>> >> > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] btrfs: Implement DRW lock 2019-06-08 16:21 ` Nikolay Borisov @ 2019-06-08 16:39 ` Paul E. McKenney 0 siblings, 0 replies; 26+ messages in thread From: Paul E. McKenney @ 2019-06-08 16:39 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs, linux-kernel, andrea.parri, peterz On Sat, Jun 08, 2019 at 07:21:53PM +0300, Nikolay Borisov wrote: > On 8.06.19 г. 19:06 ч., Paul E. McKenney wrote: > > On Sat, Jun 08, 2019 at 06:44:17PM +0300, Nikolay Borisov wrote: > >> On 8.06.19 г. 18:13 ч., Paul E. McKenney wrote: > >>> On Fri, Jun 07, 2019 at 02:59:34PM +0300, Nikolay Borisov wrote: > >>>> On 7.06.19 г. 13:52 ч., Paul E. McKenney wrote: > >>>>> On Thu, Jun 06, 2019 at 04:52:18PM +0300, Nikolay Borisov wrote: > >>>>>> A (D)ouble (R)eader (W)riter lock is a locking primitive that allows > >>>>>> to have multiple readers or multiple writers but not multiple readers > >>>>>> and writers holding it concurrently. The code is factored out from > >>>>>> the existing open-coded locking scheme used to exclude pending > >>>>>> snapshots from nocow writers and vice-versa. Current implementation > >>>>>> actually favors Readers (that is snapshot creaters) to writers (nocow > >>>>>> writers of the filesystem). > >>>>>> > >>>>>> Signed-off-by: Nikolay Borisov <nborisov@suse.com> > >>>>> > >>>>> A preliminary question... > >>>>> > >>>>> What prevents the following sequence of events from happening? > >>>>> > >>>>> o btrfs_drw_write_lock() invokes btrfs_drw_try_write_lock(), > >>>>> which sees that lock->readers is zero and thus executes > >>>>> percpu_counter_inc(&lock->writers). > >>>>> > >>>>> o btrfs_drw_read_lock() increments lock->readers, does the > >>>>> smp_mb__after_atomic(), and then does the wait_event(). > >>>>> Because btrfs_drw_try_write_lock() incremented its CPU's > >>>>> lock->writers, the sum is the value one, so it blocks. > >>>>> > >>>>> o btrfs_drw_try_write_lock() checks lock->readers, sees that > >>>>> it is now nonzero, and thus invokes btrfs_drw_read_unlock() > >>>>> (which decrements the current CPU's counter, so that a future > >>>>> sum would get zero), and returns false. > >>>> > >>>> btrfs_drw_read_unlock is actually btrfs_drw_write_unlock, my bad, Filipe > >>>> already pointed that out and I've fixed it. > >>> > >>> Ah! I must then ask what you are using to test this. kernel/locktorture.c? > > > > Right... Make that kernel/locking/locktorture.c > > > >> At the moment - nothing. I rely on the fact that the original code I > >> extracted that from is bug-free (ha-ha). So perhahps hooking up > >> locktorture seems like a good suggestion. From a quick look I guess I > >> could mostly model that lock against the rwsem. The question is how do I > >> model the trylock semantics as well as the "double" part? > > > > Implementing a correct synchronization primitive is like committing the > > perfect crime. There are at least 50 things that can go wrong, and if > > you are a highly experienced genius, you -might- be able to anticipate > > and handle 25 of them. (With apologies to any Kathleen Turner fans who > > might still be alive.) Please note that this still applies to code > > ported from somewhere else because different environments likely have > > different assumptions and properties. > > I agree, I'm far from thinking that the locking scheme is actually bug > free (hence the 'ha-ha') I'm not that arrogant (yet). ;-) ;-) ;-) > > Therefore, heavy-duty stress testing is not optional. In fact, formal > > verification is becoming non-optional as well -- please see Catalin > > Marinas's work on verifying the Linux kernel's queued spinlock for > > an example. > > I assume you are referring to "Formal Methods for kernel hackers"? If > so, TLA+ has been on my radar ever since > https://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf . Yes, and good to hear. There are other options, including Promela/spin, cbmc, and so on. > However I've yet to invest the time to be able to properly model a real > protocol (be it locking or otherwise) in it. Perhahps I could use the > DRW lock as a learning opportunity, we'll see. The learning would likely be reusable, and might pay for itself in terms of bugs found more quickly and with less effort. Mileage can vary, as always, of course. > > You are right, current locktorture would get upset about having concurrent > > writers. To teach locktorture about this, I suggest adding a flag to > > the lock_torture_ops structure named something like concurrent_write, > > but hopefully shorter. Then this flag can be used to disable the "only > > one writer" check in lock_torture_writer(). > > > > Seem reasonable? > > Good idea, I'll see to extending lock-torture but this will happen in a > week or so because I'm about to go on a holiday. Have a great holiday, and looking forward to seeing your next version and locktorture modifications. Thanx, Paul > >>>> The idea here is that if a reader came after we've incremented out > >>>> percpu counter then it would have blocked, the writer would see that and > >>>> invoke btrfs_drw_write_unlock which will decrement the percpu counter > >>>> and will wakeup the reader that is now blocked on pending_readers. > >>> > >>> OK, I will await your next version. > >>> > >>> Thanx, Paul > >>> > >>>>> o btrfs_drw_write_lock() therefore does its wait_event(). > >>>>> Because lock->readers is nonzero, it blocks. > >>>>> > >>>>> o Both tasks are now blocked. In the absence of future calls > >>>>> to these functions (and perhaps even given such future calls), > >>>>> we have deadlock. > >>>>> > >>>>> So what am I missing here? > >>>>> > >>>>> Thanx, Paul > >>>>> > >>>>>> --- > >>>>>> fs/btrfs/Makefile | 2 +- > >>>>>> fs/btrfs/drw_lock.c | 71 +++++++++++++++++++++++++++++++++++++++++++++ > >>>>>> fs/btrfs/drw_lock.h | 23 +++++++++++++++ > >>>>>> 3 files changed, 95 insertions(+), 1 deletion(-) > >>>>>> create mode 100644 fs/btrfs/drw_lock.c > >>>>>> create mode 100644 fs/btrfs/drw_lock.h > >>>>>> > >>>>>> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile > >>>>>> index ca693dd554e9..dc60127791e6 100644 > >>>>>> --- a/fs/btrfs/Makefile > >>>>>> +++ b/fs/btrfs/Makefile > >>>>>> @@ -10,7 +10,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ > >>>>>> export.o tree-log.o free-space-cache.o zlib.o lzo.o zstd.o \ > >>>>>> compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ > >>>>>> reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ > >>>>>> - uuid-tree.o props.o free-space-tree.o tree-checker.o > >>>>>> + uuid-tree.o props.o free-space-tree.o tree-checker.o drw_lock.o > >>>>>> > >>>>>> btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o > >>>>>> btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o > >>>>>> diff --git a/fs/btrfs/drw_lock.c b/fs/btrfs/drw_lock.c > >>>>>> new file mode 100644 > >>>>>> index 000000000000..9681bf7544be > >>>>>> --- /dev/null > >>>>>> +++ b/fs/btrfs/drw_lock.c > >>>>>> @@ -0,0 +1,71 @@ > >>>>>> +#include "drw_lock.h" > >>>>>> +#include "ctree.h" > >>>>>> + > >>>>>> +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock) > >>>>>> +{ > >>>>>> + atomic_set(&lock->readers, 0); > >>>>>> + percpu_counter_init(&lock->writers, 0, GFP_KERNEL); > >>>>>> + init_waitqueue_head(&lock->pending_readers); > >>>>>> + init_waitqueue_head(&lock->pending_writers); > >>>>>> +} > >>>>>> + > >>>>>> +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock) > >>>>>> +{ > >>>>>> + percpu_counter_destroy(&lock->writers); > >>>>>> +} > >>>>>> + > >>>>>> +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock) > >>>>>> +{ > >>>>>> + if (atomic_read(&lock->readers)) > >>>>>> + return false; > >>>>>> + > >>>>>> + percpu_counter_inc(&lock->writers); > >>>>>> + > >>>>>> + /* > >>>>>> + * Ensure writers count is updated before we check for > >>>>>> + * pending readers > >>>>>> + */ > >>>>>> + smp_mb(); > >>>>>> + if (atomic_read(&lock->readers)) { > >>>>>> + btrfs_drw_read_unlock(lock); > >>>>>> + return false; > >>>>>> + } > >>>>>> + > >>>>>> + return true; > >>>>>> +} > >>>>>> + > >>>>>> +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock) > >>>>>> +{ > >>>>>> + while(true) { > >>>>>> + if (btrfs_drw_try_write_lock(lock)) > >>>>>> + return; > >>>>>> + wait_event(lock->pending_writers, !atomic_read(&lock->readers)); > >>>>>> + } > >>>>>> +} > >>>>>> + > >>>>>> +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock) > >>>>>> +{ > >>>>>> + percpu_counter_dec(&lock->writers); > >>>>>> + cond_wake_up(&lock->pending_readers); > >>>>>> +} > >>>>>> + > >>>>>> +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock) > >>>>>> +{ > >>>>>> + atomic_inc(&lock->readers); > >>>>>> + smp_mb__after_atomic(); > >>>>>> + > >>>>>> + wait_event(lock->pending_readers, > >>>>>> + percpu_counter_sum(&lock->writers) == 0); > >>>>>> +} > >>>>>> + > >>>>>> +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock) > >>>>>> +{ > >>>>>> + /* > >>>>>> + * Atomic RMW operations imply full barrier, so woken up writers > >>>>>> + * are guaranteed to see the decrement > >>>>>> + */ > >>>>>> + if (atomic_dec_and_test(&lock->readers)) > >>>>>> + wake_up(&lock->pending_writers); > >>>>>> +} > >>>>>> + > >>>>>> + > >>>>>> diff --git a/fs/btrfs/drw_lock.h b/fs/btrfs/drw_lock.h > >>>>>> new file mode 100644 > >>>>>> index 000000000000..baff59561c06 > >>>>>> --- /dev/null > >>>>>> +++ b/fs/btrfs/drw_lock.h > >>>>>> @@ -0,0 +1,23 @@ > >>>>>> +#ifndef BTRFS_DRW_LOCK_H > >>>>>> +#define BTRFS_DRW_LOCK_H > >>>>>> + > >>>>>> +#include <linux/atomic.h> > >>>>>> +#include <linux/wait.h> > >>>>>> +#include <linux/percpu_counter.h> > >>>>>> + > >>>>>> +struct btrfs_drw_lock { > >>>>>> + atomic_t readers; > >>>>>> + struct percpu_counter writers; > >>>>>> + wait_queue_head_t pending_writers; > >>>>>> + wait_queue_head_t pending_readers; > >>>>>> +}; > >>>>>> + > >>>>>> +void btrfs_drw_lock_init(struct btrfs_drw_lock *lock); > >>>>>> +void btrfs_drw_lock_destroy(struct btrfs_drw_lock *lock); > >>>>>> +void btrfs_drw_write_lock(struct btrfs_drw_lock *lock); > >>>>>> +bool btrfs_drw_try_write_lock(struct btrfs_drw_lock *lock); > >>>>>> +void btrfs_drw_write_unlock(struct btrfs_drw_lock *lock); > >>>>>> +void btrfs_drw_read_lock(struct btrfs_drw_lock *lock); > >>>>>> +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock); > >>>>>> + > >>>>>> +#endif > >>>>>> -- > >>>>>> 2.17.1 > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] btrfs: Implement DRW lock 2019-06-06 13:52 ` [PATCH 1/2] btrfs: Implement DRW lock Nikolay Borisov 2019-06-06 15:15 ` Filipe Manana 2019-06-07 10:52 ` Paul E. McKenney @ 2019-06-08 16:33 ` Andrea Parri 2019-06-12 14:05 ` David Sterba 3 siblings, 0 replies; 26+ messages in thread From: Andrea Parri @ 2019-06-08 16:33 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs, linux-kernel, peterz, paulmck On Thu, Jun 06, 2019 at 04:52:18PM +0300, Nikolay Borisov wrote: > A (D)ouble (R)eader (W)riter lock is a locking primitive that allows > to have multiple readers or multiple writers but not multiple readers > and writers holding it concurrently. The code is factored out from > the existing open-coded locking scheme used to exclude pending > snapshots from nocow writers and vice-versa. Current implementation > actually favors Readers (that is snapshot creaters) to writers (nocow > writers of the filesystem). > > Signed-off-by: Nikolay Borisov <nborisov@suse.com> Interesting! Thank you for sending this over, Nikolay. I only have a couple of nits (below) to add: [...] > + > +void btrfs_drw_read_unlock(struct btrfs_drw_lock *lock) > +{ > + /* > + * Atomic RMW operations imply full barrier, so woken up writers > + * are guaranteed to see the decrement > + */ Not every atomic RMW operations imply a full barrier (as exemplified, e.g., by the atomic_inc() in btrfs_drw_read_lock()); maybe simply s/Atomic RMW operations imply/atomic_dec_and_test() implies/ FYI, checkpatch.pl issues a few warnings on this patch (you may want to address some of them in the next version). Thanks, Andrea > + if (atomic_dec_and_test(&lock->readers)) > + wake_up(&lock->pending_writers); > +} > + > + ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH 1/2] btrfs: Implement DRW lock 2019-06-06 13:52 ` [PATCH 1/2] btrfs: Implement DRW lock Nikolay Borisov ` (2 preceding siblings ...) 2019-06-08 16:33 ` Andrea Parri @ 2019-06-12 14:05 ` David Sterba 3 siblings, 0 replies; 26+ messages in thread From: David Sterba @ 2019-06-12 14:05 UTC (permalink / raw) To: Nikolay Borisov; +Cc: linux-btrfs, linux-kernel, andrea.parri, peterz, paulmck On Thu, Jun 06, 2019 at 04:52:18PM +0300, Nikolay Borisov wrote: > A (D)ouble (R)eader (W)riter lock is a locking primitive that allows > to have multiple readers or multiple writers but not multiple readers > and writers holding it concurrently. The code is factored out from > the existing open-coded locking scheme used to exclude pending > snapshots from nocow writers and vice-versa. Current implementation > actually favors Readers (that is snapshot creaters) to writers (nocow > writers of the filesystem). > > Signed-off-by: Nikolay Borisov <nborisov@suse.com> > --- > fs/btrfs/Makefile | 2 +- > fs/btrfs/drw_lock.c | 71 +++++++++++++++++++++++++++++++++++++++++++++ > fs/btrfs/drw_lock.h | 23 +++++++++++++++ In v2, please use locking.[ch] instead of new files. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2020-02-24 15:32 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-01-30 12:59 [PATCH 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov 2020-01-30 12:59 ` [PATCH 1/2] btrfs: Implement DRW lock Nikolay Borisov 2020-01-30 13:41 ` Christoph Hellwig 2020-01-30 13:42 ` Nikolay Borisov 2020-01-30 13:43 ` Christoph Hellwig 2020-01-30 12:59 ` [PATCH 2/2] btrfs: convert snapshot/nocow exlcusion to drw lock Nikolay Borisov 2020-01-30 12:59 ` [RESEND PATCH v2 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov 2020-01-30 14:21 ` David Sterba 2020-02-21 14:51 ` David Sterba 2020-01-30 12:59 ` [PATCH v2 1/2] btrfs: Implement DRW lock Nikolay Borisov 2020-01-30 13:37 ` Nikolay Borisov 2020-01-30 14:06 ` David Sterba 2020-01-30 14:11 ` Nikolay Borisov 2020-01-30 12:59 ` [PATCH v2 2/2] btrfs: convert snapshot/nocow exlcusion to drw lock Nikolay Borisov -- strict thread matches above, loose matches on Subject: below -- 2020-02-24 15:26 [PATCH v3 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov 2020-02-24 15:32 ` [PATCH 1/2] btrfs: Implement DRW lock Nikolay Borisov 2019-06-06 13:52 [PATCH 0/2] Refactor snapshot vs nocow writers locking Nikolay Borisov 2019-06-06 13:52 ` [PATCH 1/2] btrfs: Implement DRW lock Nikolay Borisov 2019-06-06 15:15 ` Filipe Manana 2019-06-07 10:52 ` Paul E. McKenney 2019-06-07 11:59 ` Nikolay Borisov 2019-06-08 15:13 ` Paul E. McKenney 2019-06-08 15:44 ` Nikolay Borisov 2019-06-08 16:06 ` Paul E. McKenney 2019-06-08 16:21 ` Nikolay Borisov 2019-06-08 16:39 ` Paul E. McKenney 2019-06-08 16:33 ` Andrea Parri 2019-06-12 14:05 ` David Sterba
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).