* [BUG] fs/super: a possible sleep-in-atomic bug in put_super @ 2017-10-06 8:59 Jia-Ju Bai 2017-10-06 9:06 ` Michal Hocko 2017-10-06 12:19 ` Al Viro 0 siblings, 2 replies; 11+ messages in thread From: Jia-Ju Bai @ 2017-10-06 8:59 UTC (permalink / raw) To: viro, torbjorn.lindh, rgooch; +Cc: linux-fsdevel, linux-kernel According to fs/super.c, the kernel may sleep under a spinlock. The function call path is: put_super (acquire the spinlock) __put_super destroy_super list_lru_destroy list_lru_unregister mutex_lock --> may sleep memcg_get_cache_ids down_read --> may sleep This bug is found by my static analysis tool and my code review. Thanks, Jia-Ju Bai ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super 2017-10-06 8:59 [BUG] fs/super: a possible sleep-in-atomic bug in put_super Jia-Ju Bai @ 2017-10-06 9:06 ` Michal Hocko 2017-10-07 11:56 ` Vladimir Davydov 2017-10-06 12:19 ` Al Viro 1 sibling, 1 reply; 11+ messages in thread From: Michal Hocko @ 2017-10-06 9:06 UTC (permalink / raw) To: Jia-Ju Bai Cc: viro, torbjorn.lindh, rgooch, linux-fsdevel, linux-kernel, Vladimir Davydov [CC Vladimir] On Fri 06-10-17 16:59:18, Jia-Ju Bai wrote: > According to fs/super.c, the kernel may sleep under a spinlock. > The function call path is: > put_super (acquire the spinlock) > __put_super > destroy_super > list_lru_destroy > list_lru_unregister > mutex_lock --> may sleep > memcg_get_cache_ids > down_read --> may sleep > > This bug is found by my static analysis tool and my code review. > > Thanks, > Jia-Ju Bai -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super 2017-10-06 9:06 ` Michal Hocko @ 2017-10-07 11:56 ` Vladimir Davydov 2017-10-07 17:06 ` Al Viro 0 siblings, 1 reply; 11+ messages in thread From: Vladimir Davydov @ 2017-10-07 11:56 UTC (permalink / raw) To: Michal Hocko Cc: Jia-Ju Bai, viro, torbjorn.lindh, rgooch, linux-fsdevel, linux-kernel Hello, On Fri, Oct 06, 2017 at 11:06:04AM +0200, Michal Hocko wrote: > On Fri 06-10-17 16:59:18, Jia-Ju Bai wrote: > > According to fs/super.c, the kernel may sleep under a spinlock. > > The function call path is: > > put_super (acquire the spinlock) > > __put_super > > destroy_super > > list_lru_destroy > > list_lru_unregister > > mutex_lock --> may sleep > > memcg_get_cache_ids > > down_read --> may sleep > > > > This bug is found by my static analysis tool and my code review. This is false-positive: by the time we get to destroy_super(), the lru lists have already been destroyed - see deactivate_locked_super() - so list_lru_destroy() will retrun right away without attempting to take any locks. That's why there's no lockdep warnings regarding this issue. I think we can move list_lru_destroy() to destroy_super_work() to suppress this warning. Not sure if it's really worth the trouble though. Thanks, Vladimir ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super 2017-10-07 11:56 ` Vladimir Davydov @ 2017-10-07 17:06 ` Al Viro 2017-10-07 21:14 ` Al Viro 0 siblings, 1 reply; 11+ messages in thread From: Al Viro @ 2017-10-07 17:06 UTC (permalink / raw) To: Vladimir Davydov Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, rgooch, linux-fsdevel, linux-kernel On Sat, Oct 07, 2017 at 02:56:40PM +0300, Vladimir Davydov wrote: > Hello, > > On Fri, Oct 06, 2017 at 11:06:04AM +0200, Michal Hocko wrote: > > On Fri 06-10-17 16:59:18, Jia-Ju Bai wrote: > > > According to fs/super.c, the kernel may sleep under a spinlock. > > > The function call path is: > > > put_super (acquire the spinlock) > > > __put_super > > > destroy_super > > > list_lru_destroy > > > list_lru_unregister > > > mutex_lock --> may sleep > > > memcg_get_cache_ids > > > down_read --> may sleep > > > > > > This bug is found by my static analysis tool and my code review. > > This is false-positive: by the time we get to destroy_super(), the lru > lists have already been destroyed - see deactivate_locked_super() - so > list_lru_destroy() will retrun right away without attempting to take any > locks. That's why there's no lockdep warnings regarding this issue. > > I think we can move list_lru_destroy() to destroy_super_work() to > suppress this warning. Not sure if it's really worth the trouble though. It's a bit trickier than that (callers of destroy_super() prior to superblock getting reachable via shared data structures do not have that lru_list_destroy() a no-op, but they are not called under spinlocks). Locking in mm/list_lru.c looks excessive, but then I'm not well familiar with that code. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super 2017-10-07 17:06 ` Al Viro @ 2017-10-07 21:14 ` Al Viro 2017-10-08 0:56 ` Al Viro 0 siblings, 1 reply; 11+ messages in thread From: Al Viro @ 2017-10-07 21:14 UTC (permalink / raw) To: Vladimir Davydov Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, linux-fsdevel, linux-kernel On Sat, Oct 07, 2017 at 06:06:51PM +0100, Al Viro wrote: > On Sat, Oct 07, 2017 at 02:56:40PM +0300, Vladimir Davydov wrote: > > Hello, > > > > On Fri, Oct 06, 2017 at 11:06:04AM +0200, Michal Hocko wrote: > > > On Fri 06-10-17 16:59:18, Jia-Ju Bai wrote: > > > > According to fs/super.c, the kernel may sleep under a spinlock. > > > > The function call path is: > > > > put_super (acquire the spinlock) > > > > __put_super > > > > destroy_super > > > > list_lru_destroy > > > > list_lru_unregister > > > > mutex_lock --> may sleep > > > > memcg_get_cache_ids > > > > down_read --> may sleep > > > > > > > > This bug is found by my static analysis tool and my code review. > > > > This is false-positive: by the time we get to destroy_super(), the lru > > lists have already been destroyed - see deactivate_locked_super() - so > > list_lru_destroy() will retrun right away without attempting to take any > > locks. That's why there's no lockdep warnings regarding this issue. > > > > I think we can move list_lru_destroy() to destroy_super_work() to > > suppress this warning. Not sure if it's really worth the trouble though. > > It's a bit trickier than that (callers of destroy_super() prior to superblock > getting reachable via shared data structures do not have that lru_list_destroy() > a no-op, but they are not called under spinlocks). > > Locking in mm/list_lru.c looks excessive, but then I'm not well familiar with > that code. It *is* excessive. 1) coallocate struct list_lru and array of struct list_lru_node hanging off it. Turn all existing variables and struct members of that type into pointers. init would allocate and return a pointer, destroy would free (and leave it for callers to clear their pointers, of course). 2) store the value of memcg_nr_cache_ids as of the last creation or resize in list_lru. Pass that through to memcg_destroy_list_lru_node(). Result: no need for memcg_get_cache_ids() in list_lru_destroy(). 3) add static list_lru *currently_resized, protected by list_lru_mutex. NULL when memcg_update_all_list_lrus() is not run, points to currently resized list_lru when it is. 4) have lru_list_destroy() check (under list_lru_mutex) whether it's being asked to kill the currently resized one. If it is, do victim->list.prev->next = victim->list.next; victim->list.next->prev = victim->list.prev; victim->list.prev = NULL; and bugger off, otherwise act as now. Turn the loop in memcg_update_all_list_lrus() into mutex_lock(&list_lrus_mutex); lru = list_lrus.next; while (lru != &list_lrus) { currently_resized = list_entry(lru, struct list_lru, list); mutex_unlock(&list_lrus_mutex); ret = memcg_update_list_lru(lru, old_size, new_size); mutex_lock(&list_lrus_mutex); if (unlikely(!lru->prev)) { lru = lru->next; free currently_resized as list_lru_destroy() would have continue; } if (ret) goto fail; lru = lru->next; } currently_resized = NULL; mutex_unlock(&list_lrus_mutex); 5) replace list_lrus_mutex with a spinlock. At that point list_lru_destroy() becomes non-blocking. I think it should work, but that's from several hours of looking through that code, so I might be missing something subtle here... ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super 2017-10-07 21:14 ` Al Viro @ 2017-10-08 0:56 ` Al Viro 2017-10-08 2:03 ` Al Viro 0 siblings, 1 reply; 11+ messages in thread From: Al Viro @ 2017-10-08 0:56 UTC (permalink / raw) To: Vladimir Davydov Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, linux-fsdevel, linux-kernel On Sat, Oct 07, 2017 at 10:14:44PM +0100, Al Viro wrote: > 1) coallocate struct list_lru and array of struct list_lru_node > hanging off it. Turn all existing variables and struct members of that > type into pointers. init would allocate and return a pointer, destroy > would free (and leave it for callers to clear their pointers, of course). Better yet, keep list_lru containing just the pointer to list_lru_node array. And put that array into the tail of struct list_lru_nodes. That way normal accesses are kept exactly as-is and we don't need to update the users of that thing at all. > 4) have lru_list_destroy() check (under list_lru_mutex) whether it's > being asked to kill the currently resized one. If it is, do > victim->list.prev->next = victim->list.next; > victim->list.next->prev = victim->list.prev; > victim->list.prev = NULL; Doesn't work, unfortunately - it needs to stay on the list and be marked in some other way. > and bugger off, otherwise act as now. Turn the loop in > memcg_update_all_list_lrus() into > mutex_lock(&list_lrus_mutex); > lru = list_lrus.next; > while (lru != &list_lrus) { > currently_resized = list_entry(lru, struct list_lru, list); > mutex_unlock(&list_lrus_mutex); > ret = memcg_update_list_lru(lru, old_size, new_size); > mutex_lock(&list_lrus_mutex); > if (unlikely(!lru->prev)) { > lru = lru->next; ... because this might very well be pointing to already freed object. > free currently_resized as list_lru_destroy() would have > continue; What's more, we need to be careful about resize vs. drain. Right now it's on list_lrus_mutex, but if we drop that around actual resize of an individual list_lru, we'll need something else. Would there be any problem if we took memcg_cache_ids_sem shared in memcg_offline_kmem()? The first problem is not fatal - we can e.g. use the sign of the field used to store the number of ->memcg_lrus elements (i.e. stashed value of memcg_nr_cache_ids at allocation or last resize) to indicate that actual freeing is left for resizer... ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super 2017-10-08 0:56 ` Al Viro @ 2017-10-08 2:03 ` Al Viro 2017-10-08 15:47 ` Vladimir Davydov 0 siblings, 1 reply; 11+ messages in thread From: Al Viro @ 2017-10-08 2:03 UTC (permalink / raw) To: Vladimir Davydov Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, linux-fsdevel, linux-kernel On Sun, Oct 08, 2017 at 01:56:08AM +0100, Al Viro wrote: > What's more, we need to be careful about resize vs. drain. Right now it's > on list_lrus_mutex, but if we drop that around actual resize of an individual > list_lru, we'll need something else. Would there be any problem if we > took memcg_cache_ids_sem shared in memcg_offline_kmem()? > > The first problem is not fatal - we can e.g. use the sign of the field used > to store the number of ->memcg_lrus elements (i.e. stashed value of > memcg_nr_cache_ids at allocation or last resize) to indicate that actual > freeing is left for resizer... Ugh. That spinlock would have to be held over too much work, or bounced back and forth a lot on memcg shutdowns ;-/ Gets especially nasty if we want list_lru_destroy() callable from rcu callbacks. Oh, well... I still suspect that locking there is too heavy, but it looks like I don't have a better replacement. What are the realistic numbers of memcg on a big system? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super 2017-10-08 2:03 ` Al Viro @ 2017-10-08 15:47 ` Vladimir Davydov 2017-10-08 21:13 ` Al Viro 0 siblings, 1 reply; 11+ messages in thread From: Vladimir Davydov @ 2017-10-08 15:47 UTC (permalink / raw) To: Al Viro Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, linux-fsdevel, linux-kernel On Sun, Oct 08, 2017 at 03:03:32AM +0100, Al Viro wrote: > On Sun, Oct 08, 2017 at 01:56:08AM +0100, Al Viro wrote: > > > What's more, we need to be careful about resize vs. drain. Right now it's > > on list_lrus_mutex, but if we drop that around actual resize of an individual > > list_lru, we'll need something else. Would there be any problem if we > > took memcg_cache_ids_sem shared in memcg_offline_kmem()? > > > > The first problem is not fatal - we can e.g. use the sign of the field used > > to store the number of ->memcg_lrus elements (i.e. stashed value of > > memcg_nr_cache_ids at allocation or last resize) to indicate that actual > > freeing is left for resizer... > > Ugh. That spinlock would have to be held over too much work, or bounced back > and forth a lot on memcg shutdowns ;-/ Gets especially nasty if we want > list_lru_destroy() callable from rcu callbacks. Oh, well... > > I still suspect that locking there is too heavy, but it looks like I don't have > a better replacement. > > What are the realistic numbers of memcg on a big system? Several thousand. I guess we could turn list_lrus_mutex into a spin lock by making resize/drain procedures handle list_lru destruction as you suggested above, but list_lru_destroy() would still have to iterate over all elements of list_lru_node->memcg_lrus array to free per-memcg objects, which is too heavy to be performed under sb_lock IMHO. Thanks, Vladimir ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super 2017-10-08 15:47 ` Vladimir Davydov @ 2017-10-08 21:13 ` Al Viro 2017-10-09 8:43 ` Vladimir Davydov 0 siblings, 1 reply; 11+ messages in thread From: Al Viro @ 2017-10-08 21:13 UTC (permalink / raw) To: Vladimir Davydov Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, linux-fsdevel, linux-kernel On Sun, Oct 08, 2017 at 06:47:46PM +0300, Vladimir Davydov wrote: > On Sun, Oct 08, 2017 at 03:03:32AM +0100, Al Viro wrote: > > On Sun, Oct 08, 2017 at 01:56:08AM +0100, Al Viro wrote: > > > > > What's more, we need to be careful about resize vs. drain. Right now it's > > > on list_lrus_mutex, but if we drop that around actual resize of an individual > > > list_lru, we'll need something else. Would there be any problem if we > > > took memcg_cache_ids_sem shared in memcg_offline_kmem()? > > > > > > The first problem is not fatal - we can e.g. use the sign of the field used > > > to store the number of ->memcg_lrus elements (i.e. stashed value of > > > memcg_nr_cache_ids at allocation or last resize) to indicate that actual > > > freeing is left for resizer... > > > > Ugh. That spinlock would have to be held over too much work, or bounced back > > and forth a lot on memcg shutdowns ;-/ Gets especially nasty if we want > > list_lru_destroy() callable from rcu callbacks. Oh, well... > > > > I still suspect that locking there is too heavy, but it looks like I don't have > > a better replacement. > > > > What are the realistic numbers of memcg on a big system? > > Several thousand. I guess we could turn list_lrus_mutex into a spin lock > by making resize/drain procedures handle list_lru destruction as you > suggested above, but list_lru_destroy() would still have to iterate over > all elements of list_lru_node->memcg_lrus array to free per-memcg > objects, which is too heavy to be performed under sb_lock IMHO. Hmm... Some observations: * struct list_lru_one is fairly small and it doesn't pack well - you get 25% overhead on it. Dedicated kmem_cache might've be worth doing. * in addition to list_lru_one (all allocated one by one), you have an array of pointers to them. And that array never shrinks. What's more, the objects pointed to are only 3 times bigger than pointers... * how hot is memcg destruction codepath? The only real argument in favour of list instead of hlist is list_splice(); if that's not that critical, hlist would do just fine. And that drives the size of struct list_lru_one down to two words, at which point getting rid of that array of pointers becomes rather attractive. Amount of cacheline bouncing might be an issue, but then it might be an overall win; can't tell without experiments. It certainly would've simplified the things a whole lot, including the rollback on allocation failure during resize, etc. And list_lru_destroy() would get several orders of magnitude cheaper... * coallocating ->list with ->node[] is almost certain worth doing - AFAICS, it's a clear optimization, no matter whether we do anything else or not. Loops by list_lrus would be better off without fetching lru->node for every entry. _And_ the objects containing list_lru wouldn't be reachable via list_lrus. As for fs/super.c side... IMO destroy_super() ought to WARN_ON when it sees non-NULL ->node. Let alloc_super()/sget_userns() do those list_lru_destroy() directly for instances that get killed before becoming reachable via shared data structures; everything else must go through deactivate_locked_super(). The only reason for list_lru_destroy() in destroy_super() is the use of the latter for disposal of never-seen-by-anyone struct super_block instances. Come to think of that, something like this might be a good approach: diff --git a/fs/super.c b/fs/super.c index 166c4ee0d0ed..8ca15415351a 100644 --- a/fs/super.c +++ b/fs/super.c @@ -154,21 +154,19 @@ static void destroy_super_rcu(struct rcu_head *head) schedule_work(&s->destroy_work); } -/** - * destroy_super - frees a superblock - * @s: superblock to free - * - * Frees a superblock. - */ -static void destroy_super(struct super_block *s) +/* Free a superblock that has never been seen by anyone */ +static void destroy_unused_super(struct super_block *s) { + if (!s) + return; + up_write(&s->s_umount); list_lru_destroy(&s->s_dentry_lru); list_lru_destroy(&s->s_inode_lru); security_sb_free(s); - WARN_ON(!list_empty(&s->s_mounts)); put_user_ns(s->s_user_ns); kfree(s->s_subtype); - call_rcu(&s->rcu, destroy_super_rcu); + /* no delays needed */ + destroy_super_work(&s->destroy_work); } /** @@ -256,7 +254,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, return s; fail: - destroy_super(s); + destroy_unused_super(s); return NULL; } @@ -265,11 +263,17 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, /* * Drop a superblock's refcount. The caller must hold sb_lock. */ -static void __put_super(struct super_block *sb) +static void __put_super(struct super_block *s) { - if (!--sb->s_count) { - list_del_init(&sb->s_list); - destroy_super(sb); + if (!--s->s_count) { + list_del_init(&s->s_list); + WARN_ON(s->s_dentry_lru.node); + WARN_ON(s->s_inode_lru.node); + WARN_ON(!list_empty(&s->s_mounts)); + security_sb_free(s); + put_user_ns(s->s_user_ns); + kfree(s->s_subtype); + call_rcu(&s->rcu, destroy_super_rcu); } } @@ -484,19 +488,12 @@ struct super_block *sget_userns(struct file_system_type *type, continue; if (user_ns != old->s_user_ns) { spin_unlock(&sb_lock); - if (s) { - up_write(&s->s_umount); - destroy_super(s); - } + destroy_unused_super(s); return ERR_PTR(-EBUSY); } if (!grab_super(old)) goto retry; - if (s) { - up_write(&s->s_umount); - destroy_super(s); - s = NULL; - } + destroy_unused_super(s); return old; } } @@ -511,8 +508,7 @@ struct super_block *sget_userns(struct file_system_type *type, err = set(s, data); if (err) { spin_unlock(&sb_lock); - up_write(&s->s_umount); - destroy_super(s); + destroy_unused_super(s); return ERR_PTR(err); } s->s_type = type; ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super 2017-10-08 21:13 ` Al Viro @ 2017-10-09 8:43 ` Vladimir Davydov 0 siblings, 0 replies; 11+ messages in thread From: Vladimir Davydov @ 2017-10-09 8:43 UTC (permalink / raw) To: Al Viro Cc: Michal Hocko, Jia-Ju Bai, torbjorn.lindh, linux-fsdevel, linux-kernel On Sun, Oct 08, 2017 at 10:13:57PM +0100, Al Viro wrote: > On Sun, Oct 08, 2017 at 06:47:46PM +0300, Vladimir Davydov wrote: > > On Sun, Oct 08, 2017 at 03:03:32AM +0100, Al Viro wrote: > > > On Sun, Oct 08, 2017 at 01:56:08AM +0100, Al Viro wrote: > > > > > > > What's more, we need to be careful about resize vs. drain. Right now it's > > > > on list_lrus_mutex, but if we drop that around actual resize of an individual > > > > list_lru, we'll need something else. Would there be any problem if we > > > > took memcg_cache_ids_sem shared in memcg_offline_kmem()? > > > > > > > > The first problem is not fatal - we can e.g. use the sign of the field used > > > > to store the number of ->memcg_lrus elements (i.e. stashed value of > > > > memcg_nr_cache_ids at allocation or last resize) to indicate that actual > > > > freeing is left for resizer... > > > > > > Ugh. That spinlock would have to be held over too much work, or bounced back > > > and forth a lot on memcg shutdowns ;-/ Gets especially nasty if we want > > > list_lru_destroy() callable from rcu callbacks. Oh, well... > > > > > > I still suspect that locking there is too heavy, but it looks like I don't have > > > a better replacement. > > > > > > What are the realistic numbers of memcg on a big system? > > > > Several thousand. I guess we could turn list_lrus_mutex into a spin lock > > by making resize/drain procedures handle list_lru destruction as you > > suggested above, but list_lru_destroy() would still have to iterate over > > all elements of list_lru_node->memcg_lrus array to free per-memcg > > objects, which is too heavy to be performed under sb_lock IMHO. > > Hmm... Some observations: > * struct list_lru_one is fairly small and it doesn't pack well - you > get 25% overhead on it. Dedicated kmem_cache might've be worth doing. > * in addition to list_lru_one (all allocated one by one), you have > an array of pointers to them. And that array never shrinks. What's more, > the objects pointed to are only 3 times bigger than pointers... > * how hot is memcg destruction codepath? The only real argument > in favour of list instead of hlist is list_splice(); if that's not that This is an LRU list so we add objects to the tail and remove them from the head (or vice versa). AFAICS hlist wouldn't be enough here as it doesn't store a pointer to the list tail. Anyway, no matter which list implementation we use, list or hlist, we have to synchronize resizing of memcg_lrus against concurrent list modifications. AFAIU we can achieve that either by holding list_lru_node->lock while copying the array or using an array of pointers, in which case we can copy the array without holding the lock and only take the lock to swap pointers. Since the array may grow quite big, I decided to use an array of pointers to avoid holding the lock for too long. > critical, hlist would do just fine. And that drives the size of > struct list_lru_one down to two words, at which point getting rid of > that array of pointers becomes rather attractive. Amount of cacheline > bouncing might be an issue, but then it might be an overall win; can't > tell without experiments. It certainly would've simplified the things > a whole lot, including the rollback on allocation failure during resize, > etc. And list_lru_destroy() would get several orders of magnitude cheaper... IMHO at the same time it would complicate list_lru_walk(), because the isolate() callback may release the list_lru_node->lock, which opens a time window for the list_lru_one to be reallocated. Currently, we don't have to care about this, because once we dereferenced a list_lru_one, we can be sure it won't go away. > * coallocating ->list with ->node[] is almost certain worth doing - > AFAICS, it's a clear optimization, no matter whether we do anything else > or not. Loops by list_lrus would be better off without fetching lru->node > for every entry. _And_ the objects containing list_lru wouldn't be > reachable via list_lrus. By co-allocating ->node[] with list_lru we would gain nothing, because then we wouldn't be able to embed list_lru in super_block and would have to use a pointer there. BTW, I think we need to remove alignment from super_block->s_dentry_lru and s_inode_lru definitions. It seems to be left there from the time when list_lru wasn't numa aware. Now, the alignment is guaranteed by the definition of strcut list_lru_node. > > As for fs/super.c side... IMO destroy_super() ought to WARN_ON when > it sees non-NULL ->node. Let alloc_super()/sget_userns() do those > list_lru_destroy() directly for instances that get killed before > becoming reachable via shared data structures; everything else must > go through deactivate_locked_super(). The only reason for list_lru_destroy() > in destroy_super() is the use of the latter for disposal of never-seen-by-anyone > struct super_block instances. Come to think of that, something like this > might be a good approach: Makes sense to me. > > diff --git a/fs/super.c b/fs/super.c > index 166c4ee0d0ed..8ca15415351a 100644 > --- a/fs/super.c > +++ b/fs/super.c > @@ -154,21 +154,19 @@ static void destroy_super_rcu(struct rcu_head *head) > schedule_work(&s->destroy_work); > } > > -/** > - * destroy_super - frees a superblock > - * @s: superblock to free > - * > - * Frees a superblock. > - */ > -static void destroy_super(struct super_block *s) > +/* Free a superblock that has never been seen by anyone */ > +static void destroy_unused_super(struct super_block *s) > { > + if (!s) > + return; > + up_write(&s->s_umount); > list_lru_destroy(&s->s_dentry_lru); > list_lru_destroy(&s->s_inode_lru); > security_sb_free(s); > - WARN_ON(!list_empty(&s->s_mounts)); > put_user_ns(s->s_user_ns); > kfree(s->s_subtype); > - call_rcu(&s->rcu, destroy_super_rcu); > + /* no delays needed */ > + destroy_super_work(&s->destroy_work); > } > > /** > @@ -256,7 +254,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, > return s; > > fail: > - destroy_super(s); > + destroy_unused_super(s); > return NULL; > } > > @@ -265,11 +263,17 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, > /* > * Drop a superblock's refcount. The caller must hold sb_lock. > */ > -static void __put_super(struct super_block *sb) > +static void __put_super(struct super_block *s) > { > - if (!--sb->s_count) { > - list_del_init(&sb->s_list); > - destroy_super(sb); > + if (!--s->s_count) { > + list_del_init(&s->s_list); > + WARN_ON(s->s_dentry_lru.node); > + WARN_ON(s->s_inode_lru.node); > + WARN_ON(!list_empty(&s->s_mounts)); > + security_sb_free(s); > + put_user_ns(s->s_user_ns); > + kfree(s->s_subtype); > + call_rcu(&s->rcu, destroy_super_rcu); > } > } > > @@ -484,19 +488,12 @@ struct super_block *sget_userns(struct file_system_type *type, > continue; > if (user_ns != old->s_user_ns) { > spin_unlock(&sb_lock); > - if (s) { > - up_write(&s->s_umount); > - destroy_super(s); > - } > + destroy_unused_super(s); > return ERR_PTR(-EBUSY); > } > if (!grab_super(old)) > goto retry; > - if (s) { > - up_write(&s->s_umount); > - destroy_super(s); > - s = NULL; > - } > + destroy_unused_super(s); > return old; > } > } > @@ -511,8 +508,7 @@ struct super_block *sget_userns(struct file_system_type *type, > err = set(s, data); > if (err) { > spin_unlock(&sb_lock); > - up_write(&s->s_umount); > - destroy_super(s); > + destroy_unused_super(s); > return ERR_PTR(err); > } > s->s_type = type; > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [BUG] fs/super: a possible sleep-in-atomic bug in put_super 2017-10-06 8:59 [BUG] fs/super: a possible sleep-in-atomic bug in put_super Jia-Ju Bai 2017-10-06 9:06 ` Michal Hocko @ 2017-10-06 12:19 ` Al Viro 1 sibling, 0 replies; 11+ messages in thread From: Al Viro @ 2017-10-06 12:19 UTC (permalink / raw) To: Jia-Ju Bai; +Cc: torbjorn.lindh, rgooch, linux-fsdevel, linux-kernel On Fri, Oct 06, 2017 at 04:59:18PM +0800, Jia-Ju Bai wrote: > According to fs/super.c, the kernel may sleep under a spinlock. > The function call path is: > put_super (acquire the spinlock) > __put_super > destroy_super > list_lru_destroy > list_lru_unregister > mutex_lock --> may sleep > memcg_get_cache_ids > down_read --> may sleep > > This bug is found by my static analysis tool and my code review. Invariant to watch is this: s->s_active > 0 => s->s_count > 0. In other words, anything that passes from __put_super() to destroy_super() has already been through deactivate_locked_super() to list_lru_destroy(). And list_lru_destroy() called twice without list_lru_init() between those will quietly do nothing on the second call. All other callers of destroy_super() are from alloc_super()/sget_userns(). The former is the only place where instances are created, the latter is the only caller of the former. Direct calls of destroy_super() in there a) happen only to instances that had not been visible in any shared data structures yet and b) are done with no spinlocks held. struct super_block has probably the most complex lifecycle in the entire VFS. These days the nastiness is fortunately limited to fs/super.c guts, but it's very definitely still there. I've posted sketches of description on fsdevel several times, but never managed to turn that into coherent text ;-/ ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2017-10-09 8:43 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-10-06 8:59 [BUG] fs/super: a possible sleep-in-atomic bug in put_super Jia-Ju Bai 2017-10-06 9:06 ` Michal Hocko 2017-10-07 11:56 ` Vladimir Davydov 2017-10-07 17:06 ` Al Viro 2017-10-07 21:14 ` Al Viro 2017-10-08 0:56 ` Al Viro 2017-10-08 2:03 ` Al Viro 2017-10-08 15:47 ` Vladimir Davydov 2017-10-08 21:13 ` Al Viro 2017-10-09 8:43 ` Vladimir Davydov 2017-10-06 12:19 ` Al Viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).