* Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations [not found] ` <20200221120618.GA194360@google.com> @ 2020-02-21 13:28 ` Joel Fernandes 2020-02-21 19:21 ` Uladzislau Rezki 0 siblings, 1 reply; 7+ messages in thread From: Joel Fernandes @ 2020-02-21 13:28 UTC (permalink / raw) To: Uladzislau Rezki Cc: Theodore Y. Ts'o, Paul E. McKenney, Ext4 Developers List, Suraj Jitindar Singh, LKML, rcu On Fri, Feb 21, 2020 at 07:06:18AM -0500, Joel Fernandes wrote: > On Tue, Feb 18, 2020 at 06:08:57PM +0100, Uladzislau Rezki wrote: > > On Mon, Feb 17, 2020 at 02:33:14PM -0500, Theodore Y. Ts'o wrote: > > > On Mon, Feb 17, 2020 at 05:08:27PM +0100, Uladzislau Rezki wrote: > > > > Hello, Joel, Paul, Ted. > > > > > > > > > > > > > > Good point! > > > > > > > > > > Now that kfree_rcu() is on its way to being less of a hack deeply > > > > > entangled into the bowels of RCU, this might be fairly easy to implement. > > > > > It might well be simply a matter of a function pointer and a kvfree_rcu() > > > > > API. Adding Uladzislau Rezki and Joel Fernandez on CC for their thoughts. > > > > > > > > > I think it makes sense. For example i see there is a similar demand in > > > > the mm/list_lru.c too. As for implementation, it will not be hard, i think. > > > > > > > > The easiest way is to inject kvfree() support directly into existing kfree_call_rcu() > > > > logic(probably will need to rename that function), i.e. to free vmalloc() allocations > > > > only in "emergency path" by just calling kvfree(). So that function in its turn will > > > > figure out if it is _vmalloc_ address or not and trigger corresponding "free" path. > > > > > > The other difference between ext4_kvfree_array_rcu() and kfree_rcu() > > > is that kfree_rcu() is a magic macro which frees a structure, which > > > has to contain a struct rcu_head. In this case, I'm freeing a pointer > > > to set of structures, or in another case, a set of buffer_heads, which > > > do *not* have an rcu_head structure. > > > > > We can implement kvfree_rcu() that will deal with pointer only, it is not > > an issue. I mean without embedding rcu_head to the structure or whatever > > else. > > > > I tried to implement it with less number of changes to make it more clear > > and readable. Please have a look: > > > > <snip> > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > Overall this implementation is nice. You are basically avoiding allocating > rcu_head like Ted did by using the array-of-pointers technique we used for > the previous kfree_rcu() work. > > One thing stands out, the path where we could not allocate a page for the new > block node: > > > @@ -3061,6 +3148,11 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) > > if (krcp->initialized) > > spin_unlock(&krcp->lock); > > local_irq_restore(flags); > > + > > + if (!skip_call_rcu) { > > + synchronize_rcu(); > > + kvfree(ptr_to_free); > > We can't block, it has to be async otherwise everything else blocks, and I > think this can also be used from interrupt handlers which would at least be > an SWA violation. So perhaps it needs to allocate an rcu_head wrapper object > itself for the 'emergeny case' and use the regular techniques. > > Another thing that stands out is the code duplication, if we can make this > reuse as much as of the previous code as possible, that'd be great. I'd like > to avoid bvcached and bvhead if possible. Maybe we can store information > about the fact that this is a 'special object' in some of the lower-order > bits of the pointer. Then we can detect that it is 'special' and free it > using kvfree() during the reclaim I was thinking something like the following, only build-tested -- just to show the idea. Perhaps the synchronize_rcu() should be done from a workqueue handler to prevent IRQ crapping out? Basically what I did different is: 1. Use the existing kfree_rcu_bulk_data::records array to store the to-be-freed array. 2. In case of emergency, allocate a new wrapper and tag the pointer. Read the tag later to figure its an array wrapper and do additional kvfree. debug_objects bits wouldn't work obviously for the !emergency kvfree case, not sure what we can do there. ---8<----------------------- diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 2678a37c31696..19fd7c74ad532 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -805,7 +805,7 @@ static inline notrace void rcu_read_unlock_sched_notrace(void) #define __kfree_rcu(head, offset) \ do { \ BUILD_BUG_ON(!__is_kfree_rcu_offset(offset)); \ - kfree_call_rcu(head, (rcu_callback_t)(unsigned long)(offset)); \ + kfree_call_rcu(head, (rcu_callback_t)(unsigned long)(offset), NULL); \ } while (0) /** @@ -842,6 +842,14 @@ do { \ __kfree_rcu(&((___p)->rhf), offsetof(typeof(*(ptr)), rhf)); \ } while (0) +#define kvfree_rcu(ptr) \ +do { \ + typeof (ptr) ___p = (ptr); \ + \ + if (___p) \ + kfree_call_rcu(NULL, (rcu_callback_t)(unsigned long)(0), ___p); \ +} while (0) + /* * Place this after a lock-acquisition primitive to guarantee that * an UNLOCK+LOCK pair acts as a full barrier. This guarantee applies diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h index 045c28b71f4f3..a12ecc1645fa9 100644 --- a/include/linux/rcutiny.h +++ b/include/linux/rcutiny.h @@ -34,7 +34,7 @@ static inline void synchronize_rcu_expedited(void) synchronize_rcu(); } -static inline void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) +static inline void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func, void *ptr) { call_rcu(head, func); } diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h index 45f3f66bb04df..1e445b566c019 100644 --- a/include/linux/rcutree.h +++ b/include/linux/rcutree.h @@ -33,7 +33,7 @@ static inline void rcu_virt_note_context_switch(int cpu) } void synchronize_rcu_expedited(void); -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func); +void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func, void *ptr); void rcu_barrier(void); bool rcu_eqs_special_set(int cpu); diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index ec81139cc4c6a..7b6ab4160f080 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2814,6 +2814,15 @@ struct kfree_rcu_cpu { bool initialized; }; +/* + * Used in a situation where array of pointers could + * not be put onto a kfree_bulk_data::records array. + */ +struct kfree_rcu_wrap_kvarray { + struct rcu_head head; + void *ptr; +}; + static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc); static __always_inline void @@ -2873,12 +2882,25 @@ static void kfree_rcu_work(struct work_struct *work) */ for (; head; head = next) { unsigned long offset = (unsigned long)head->func; + bool is_array_ptr = false; + + if (((unsigned long)head - offset) & BIT(0)) { + is_array_ptr = true; + offset = offset - 1; + } next = head->next; - debug_rcu_head_unqueue(head); + if (!is_array_ptr) + debug_rcu_head_unqueue(head); + rcu_lock_acquire(&rcu_callback_map); trace_rcu_invoke_kfree_callback(rcu_state.name, head, offset); + if (is_array_ptr) { + struct kfree_rcu_wrap_kvarray *kv = (void *)head - offset; + kvfree((void *)kv->ptr); + } + if (!WARN_ON_ONCE(!__is_kfree_rcu_offset(offset))) kfree((void *)head - offset); @@ -2975,7 +2997,7 @@ static void kfree_rcu_monitor(struct work_struct *work) static inline bool kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, - struct rcu_head *head, rcu_callback_t func) + struct rcu_head *head, rcu_callback_t func, void *ptr) { struct kfree_rcu_bulk_data *bnode; @@ -3009,14 +3031,20 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, } #ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD - head->func = func; - head->next = krcp->bhead->head_free_debug; - krcp->bhead->head_free_debug = head; + /* debug_objects doesn't work for kvfree */ + if (!ptr) { + head->func = func; + head->next = krcp->bhead->head_free_debug; + krcp->bhead->head_free_debug = head; + } #endif /* Finally insert. */ - krcp->bhead->records[krcp->bhead->nr_records++] = - (void *) head - (unsigned long) func; + if (ptr) + krcp->bhead->records[krcp->bhead->nr_records++] = ptr; + else + krcp->bhead->records[krcp->bhead->nr_records++] = + (void *) head - (unsigned long) func; return true; } @@ -3033,10 +3061,11 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, * be free'd in workqueue context. This allows us to: batch requests together to * reduce the number of grace periods during heavy kfree_rcu() load. */ -void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) +void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func, void *ptr) { unsigned long flags; struct kfree_rcu_cpu *krcp; + bool ret; local_irq_save(flags); // For safely calling this_cpu_ptr(). krcp = this_cpu_ptr(&krc); @@ -3044,7 +3073,8 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) spin_lock(&krcp->lock); // Queue the object but don't yet schedule the batch. - if (debug_rcu_head_queue(head)) { + // NOTE: debug objects doesn't work for kvfree. + if (!ptr && debug_rcu_head_queue(head)) { // Probable double kfree_rcu(), just leak. WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n", __func__, head); @@ -3055,7 +3085,29 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) * Under high memory pressure GFP_NOWAIT can fail, * in that case the emergency path is maintained. */ - if (unlikely(!kfree_call_rcu_add_ptr_to_bulk(krcp, head, func))) { + ret = !kfree_call_rcu_add_ptr_to_bulk(krcp, head, func, ptr); + if (unlikely(!ret)) { + if (ptr) { + struct kfree_rcu_wrap_kvarray *kvwrap; + + kvwrap = kzalloc(sizeof(*kvwrap), GFP_KERNEL); + + // If memory is really low, just try inline-freeing. + if (!kvwrap) { + // NOT SURE if permitted due to IRQ. Maybe we + // should try doing this from WQ? + synchronize_rcu(); + kvfree(ptr); + } + + kvwrap->ptr = ptr; + ptr = NULL; + head = &(kvwrap->head); + func = offsetof(typeof(*kvwrap), head); + // Tag the array as wrapper + func = (rcu_callback_t)((unsigned long)func + 1); + } + head->func = func; head->next = krcp->head; krcp->head = head; ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations 2020-02-21 13:28 ` [PATCH RFC] ext4: fix potential race between online resizing and write operations Joel Fernandes @ 2020-02-21 19:21 ` Uladzislau Rezki 2020-02-21 19:25 ` Uladzislau Rezki 2020-02-22 22:12 ` Joel Fernandes 0 siblings, 2 replies; 7+ messages in thread From: Uladzislau Rezki @ 2020-02-21 19:21 UTC (permalink / raw) To: Joel Fernandes Cc: Uladzislau Rezki, Theodore Y. Ts'o, Paul E. McKenney, Ext4 Developers List, Suraj Jitindar Singh, LKML, rcu > > > > Overall this implementation is nice. You are basically avoiding allocating > > rcu_head like Ted did by using the array-of-pointers technique we used for > > the previous kfree_rcu() work. > > > > One thing stands out, the path where we could not allocate a page for the new > > block node: > > > > > @@ -3061,6 +3148,11 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) > > > if (krcp->initialized) > > > spin_unlock(&krcp->lock); > > > local_irq_restore(flags); > > > + > > > + if (!skip_call_rcu) { > > > + synchronize_rcu(); > > > + kvfree(ptr_to_free); > > > > We can't block, it has to be async otherwise everything else blocks, and I > > think this can also be used from interrupt handlers which would at least be > > an SWA violation. So perhaps it needs to allocate an rcu_head wrapper object > > itself for the 'emergeny case' and use the regular techniques. > > > > Another thing that stands out is the code duplication, if we can make this > > reuse as much as of the previous code as possible, that'd be great. I'd like > > to avoid bvcached and bvhead if possible. Maybe we can store information > > about the fact that this is a 'special object' in some of the lower-order > > bits of the pointer. Then we can detect that it is 'special' and free it > > using kvfree() during the reclaim > > Basically what I did different is: > 1. Use the existing kfree_rcu_bulk_data::records array to store the > to-be-freed array. > 2. In case of emergency, allocate a new wrapper and tag the pointer. > Read the tag later to figure its an array wrapper and do additional kvfree. > I see your point and agree that duplication is odd and we should avoid it as much as possible. Also, i like the idea of using the wrapper as one more chance to build a "head" for headless object. I did not mix pointers because then you will need to understand what is what. It is OK for "emergency" path, because we simply can just serialize it by kvfree() call, it checks inside what the ptr address belong to: <snip> void kvfree(const void *addr) { if (is_vmalloc_addr(addr)) vfree(addr); else kfree(addr); } <snip> whereas normal path, i mean "bulk one" where we store pointers into array would be broken. We can not call kfree_bulk(array, nr_entries) if the passed array contains "vmalloc" pointers, because it is different allocator. Therefore, i deliberately have made it as a special case. > > Perhaps the synchronize_rcu() should be done from a workqueue handler > to prevent IRQ crapping out? > I think so. For example one approach would be: <snip> struct free_deferred { struct llist_head list; struct work_struct wq; }; static DEFINE_PER_CPU(struct free_deferred, free_deferred); static void free_work(struct work_struct *w) { struct free_deferred *p = container_of(w, struct free_deferred, wq); struct llist_node *t, *llnode; synchronize_rcu(); llist_for_each_safe(llnode, t, llist_del_all(&p->list)) vfree((void *)llnode, 1); } static inline void free_deferred_common(void *ptr_to_free) { struct free_deferred *p = raw_cpu_ptr(&free_deferred); if (llist_add((struct llist_node *)ptr_to_free, &p->list)) schedule_work(&p->wq); } <snip> and it seems it should work. Because we know that KMALLOC_MIN_SIZE can not be less then machine word: /* * Kmalloc subsystem. */ #ifndef KMALLOC_MIN_SIZE #define KMALLOC_MIN_SIZE (1 << KMALLOC_SHIFT_LOW) #endif when it comes to vmalloc pointer it can not be less then one PAGE_SIZE :) Another thing: we are talking about "headless" variant that is special, therefore it implies to have some restrictions, since we need a dynamic memory to drive it. For example "headless" object can be freed from preemptible context only, because freeing can be inlined: <snip> + // NOT SURE if permitted due to IRQ. Maybe we + // should try doing this from WQ? + synchronize_rcu(); + kvfree(ptr); <snip> Calling synchronize_rcu() from the IRQ context will screw the system up :) Because the current CPU will never pass the QS state if i do not miss something. Also kvfree() itself can be called from the preemptible context only, excluding IRQ, there is a special path for it, otherwise vfree() can sleep. > > debug_objects bits wouldn't work obviously for the !emergency kvfree case, > not sure what we can do there. > Agree. Thank you, Joel, for your comments! -- Vlad Rezki ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations 2020-02-21 19:21 ` Uladzislau Rezki @ 2020-02-21 19:25 ` Uladzislau Rezki 2020-02-22 22:12 ` Joel Fernandes 1 sibling, 0 replies; 7+ messages in thread From: Uladzislau Rezki @ 2020-02-21 19:25 UTC (permalink / raw) To: Uladzislau Rezki Cc: Joel Fernandes, Theodore Y. Ts'o, Paul E. McKenney, Ext4 Developers List, Suraj Jitindar Singh, LKML, rcu > > llist_for_each_safe(llnode, t, llist_del_all(&p->list)) > vfree((void *)llnode, 1); > Should be kvfree((void *)llnode, 1); -- Vlad Rezki ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations 2020-02-21 19:21 ` Uladzislau Rezki 2020-02-21 19:25 ` Uladzislau Rezki @ 2020-02-22 22:12 ` Joel Fernandes 2020-02-24 17:02 ` Uladzislau Rezki 1 sibling, 1 reply; 7+ messages in thread From: Joel Fernandes @ 2020-02-22 22:12 UTC (permalink / raw) To: Uladzislau Rezki Cc: Theodore Y. Ts'o, Paul E. McKenney, Ext4 Developers List, Suraj Jitindar Singh, LKML, rcu On Fri, Feb 21, 2020 at 08:21:52PM +0100, Uladzislau Rezki wrote: > > > > > > Overall this implementation is nice. You are basically avoiding allocating > > > rcu_head like Ted did by using the array-of-pointers technique we used for > > > the previous kfree_rcu() work. > > > > > > One thing stands out, the path where we could not allocate a page for the new > > > block node: > > > > > > > @@ -3061,6 +3148,11 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) > > > > if (krcp->initialized) > > > > spin_unlock(&krcp->lock); > > > > local_irq_restore(flags); > > > > + > > > > + if (!skip_call_rcu) { > > > > + synchronize_rcu(); > > > > + kvfree(ptr_to_free); > > > > > > We can't block, it has to be async otherwise everything else blocks, and I > > > think this can also be used from interrupt handlers which would at least be > > > an SWA violation. So perhaps it needs to allocate an rcu_head wrapper object > > > itself for the 'emergeny case' and use the regular techniques. > > > > > > Another thing that stands out is the code duplication, if we can make this > > > reuse as much as of the previous code as possible, that'd be great. I'd like > > > to avoid bvcached and bvhead if possible. Maybe we can store information > > > about the fact that this is a 'special object' in some of the lower-order > > > bits of the pointer. Then we can detect that it is 'special' and free it > > > using kvfree() during the reclaim > > > > Basically what I did different is: > > 1. Use the existing kfree_rcu_bulk_data::records array to store the > > to-be-freed array. > > 2. In case of emergency, allocate a new wrapper and tag the pointer. > > Read the tag later to figure its an array wrapper and do additional kvfree. > > > I see your point and agree that duplication is odd and we should avoid > it as much as possible. Also, i like the idea of using the wrapper as > one more chance to build a "head" for headless object. > > I did not mix pointers because then you will need to understand what is what. Well that's why I brought up the whole tagging idea. Then you don't need separate pointers to manage either (edit: but maybe you do as you mentioned vfree below..). > It is OK for "emergency" path, because we simply can just serialize it by kvfree() > call, it checks inside what the ptr address belong to: > > <snip> > void kvfree(const void *addr) > { > if (is_vmalloc_addr(addr)) > vfree(addr); > else > kfree(addr); > } > <snip> > > whereas normal path, i mean "bulk one" where we store pointers into array > would be broken. We can not call kfree_bulk(array, nr_entries) if the passed > array contains "vmalloc" pointers, because it is different allocator. Therefore, > i deliberately have made it as a special case. Ok, it would be nice if you can verify that ptr_to_free passed to kfree_call_rcu() is infact a vmalloc pointer. > > Perhaps the synchronize_rcu() should be done from a workqueue handler > > to prevent IRQ crapping out? > > > I think so. For example one approach would be: > > <snip> > struct free_deferred { > struct llist_head list; > struct work_struct wq; > }; > static DEFINE_PER_CPU(struct free_deferred, free_deferred); > > static void free_work(struct work_struct *w) > { > struct free_deferred *p = container_of(w, struct free_deferred, wq); > struct llist_node *t, *llnode; > > synchronize_rcu(); > > llist_for_each_safe(llnode, t, llist_del_all(&p->list)) > vfree((void *)llnode, 1); > } > > static inline void free_deferred_common(void *ptr_to_free) > { > struct free_deferred *p = raw_cpu_ptr(&free_deferred); > > if (llist_add((struct llist_node *)ptr_to_free, &p->list)) Would this not corrupt the ptr_to_free pointer which readers might still be accessing since grace period has not yet ended? We cannot touch the ptr_to_free pointer until after the grace period has ended. > schedule_work(&p->wq); > } > <snip> > > and it seems it should work. Because we know that KMALLOC_MIN_SIZE > can not be less then machine word: > > /* > * Kmalloc subsystem. > */ > #ifndef KMALLOC_MIN_SIZE > #define KMALLOC_MIN_SIZE (1 << KMALLOC_SHIFT_LOW) > #endif > > when it comes to vmalloc pointer it can not be less then one PAGE_SIZE :) > > Another thing: > > we are talking about "headless" variant that is special, therefore it > implies to have some restrictions, since we need a dynamic memory to > drive it. For example "headless" object can be freed from preemptible > context only, because freeing can be inlined: > > <snip> > + // NOT SURE if permitted due to IRQ. Maybe we > + // should try doing this from WQ? > + synchronize_rcu(); > + kvfree(ptr); > <snip> > > Calling synchronize_rcu() from the IRQ context will screw the system up :) > Because the current CPU will never pass the QS state if i do not miss something. Yes are you right, calling synchronize_rcu() from IRQ context is a strict no-no. I believe we could tap into the GFP_ATOMIC emergency memory pool for this emergency situation. This pool is used for emergency cases. I think in emergency we can grow an rcu_head on this pool. > Also kvfree() itself can be called from the preemptible context only, excluding IRQ, > there is a special path for it, otherwise vfree() can sleep. Ok that's good to know. > > debug_objects bits wouldn't work obviously for the !emergency kvfree case, > > not sure what we can do there. > > > Agree. > > Thank you, Joel, for your comments! No problem, I think we have a couple of ideas here. What I also wanted to do was (may be after all this), see if we can create an API for head-less kfree based on the same ideas. Not just for arrays for for any object. Calling it, say, kfree_rcu_headless() and then use the bulk array as we have been doing. That would save any users from having an rcu_head -- of course with all needed warnings about memory allocation failure. Vlad, What do you think? Paul, any thoughts on this? thanks, - Joel ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations 2020-02-22 22:12 ` Joel Fernandes @ 2020-02-24 17:02 ` Uladzislau Rezki 2020-02-24 23:14 ` Paul E. McKenney 2020-02-25 1:48 ` Joel Fernandes 0 siblings, 2 replies; 7+ messages in thread From: Uladzislau Rezki @ 2020-02-24 17:02 UTC (permalink / raw) To: Joel Fernandes Cc: Uladzislau Rezki, Theodore Y. Ts'o, Paul E. McKenney, Ext4 Developers List, Suraj Jitindar Singh, LKML, rcu > > > > > > > > Overall this implementation is nice. You are basically avoiding allocating > > > > rcu_head like Ted did by using the array-of-pointers technique we used for > > > > the previous kfree_rcu() work. > > > > > > > > One thing stands out, the path where we could not allocate a page for the new > > > > block node: > > > > > > > > > @@ -3061,6 +3148,11 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) > > > > > if (krcp->initialized) > > > > > spin_unlock(&krcp->lock); > > > > > local_irq_restore(flags); > > > > > + > > > > > + if (!skip_call_rcu) { > > > > > + synchronize_rcu(); > > > > > + kvfree(ptr_to_free); > > > > > > > > We can't block, it has to be async otherwise everything else blocks, and I > > > > think this can also be used from interrupt handlers which would at least be > > > > an SWA violation. So perhaps it needs to allocate an rcu_head wrapper object > > > > itself for the 'emergeny case' and use the regular techniques. > > > > > > > > Another thing that stands out is the code duplication, if we can make this > > > > reuse as much as of the previous code as possible, that'd be great. I'd like > > > > to avoid bvcached and bvhead if possible. Maybe we can store information > > > > about the fact that this is a 'special object' in some of the lower-order > > > > bits of the pointer. Then we can detect that it is 'special' and free it > > > > using kvfree() during the reclaim > > > > > > Basically what I did different is: > > > 1. Use the existing kfree_rcu_bulk_data::records array to store the > > > to-be-freed array. > > > 2. In case of emergency, allocate a new wrapper and tag the pointer. > > > Read the tag later to figure its an array wrapper and do additional kvfree. > > > > > I see your point and agree that duplication is odd and we should avoid > > it as much as possible. Also, i like the idea of using the wrapper as > > one more chance to build a "head" for headless object. > > > > I did not mix pointers because then you will need to understand what is what. > > Well that's why I brought up the whole tagging idea. Then you don't need > separate pointers to manage either (edit: but maybe you do as you mentioned > vfree below..). > Right. We can use tagging idea to separate kmalloc/vmalloc pointers to place them into different arrays. Because kvmalloc() can return either SLAB pointer or vmalloc one. > > It is OK for "emergency" path, because we simply can just serialize it by kvfree() > > call, it checks inside what the ptr address belong to: > > > > <snip> > > void kvfree(const void *addr) > > { > > if (is_vmalloc_addr(addr)) > > vfree(addr); > > else > > kfree(addr); > > } > > <snip> > > > > whereas normal path, i mean "bulk one" where we store pointers into array > > would be broken. We can not call kfree_bulk(array, nr_entries) if the passed > > array contains "vmalloc" pointers, because it is different allocator. Therefore, > > i deliberately have made it as a special case. > > Ok, it would be nice if you can verify that ptr_to_free passed to > kfree_call_rcu() is infact a vmalloc pointer. > We can do that. We can check it by calling is_vmalloc_addr() on ptr. So it is possible to differentiate. > > > Perhaps the synchronize_rcu() should be done from a workqueue handler > > > to prevent IRQ crapping out? > > > > > I think so. For example one approach would be: > > > > <snip> > > struct free_deferred { > > struct llist_head list; > > struct work_struct wq; > > }; > > static DEFINE_PER_CPU(struct free_deferred, free_deferred); > > > > static void free_work(struct work_struct *w) > > { > > struct free_deferred *p = container_of(w, struct free_deferred, wq); > > struct llist_node *t, *llnode; > > > > synchronize_rcu(); > > > > llist_for_each_safe(llnode, t, llist_del_all(&p->list)) > > vfree((void *)llnode, 1); > > } > > > > static inline void free_deferred_common(void *ptr_to_free) > > { > > struct free_deferred *p = raw_cpu_ptr(&free_deferred); > > > > if (llist_add((struct llist_node *)ptr_to_free, &p->list)) > > Would this not corrupt the ptr_to_free pointer which readers might still be > accessing since grace period has not yet ended? > > We cannot touch the ptr_to_free pointer until after the grace period has > ended. > Right you are. We can do that only after grace period is passed, after synchronize_rcu(). Good point :) > > } > > <snip> > > > > and it seems it should work. Because we know that KMALLOC_MIN_SIZE > > can not be less then machine word: > > > > /* > > * Kmalloc subsystem. > > */ > > #ifndef KMALLOC_MIN_SIZE > > #define KMALLOC_MIN_SIZE (1 << KMALLOC_SHIFT_LOW) > > #endif > > > > when it comes to vmalloc pointer it can not be less then one PAGE_SIZE :) > > > > Another thing: > > > > we are talking about "headless" variant that is special, therefore it > > implies to have some restrictions, since we need a dynamic memory to > > drive it. For example "headless" object can be freed from preemptible > > context only, because freeing can be inlined: > > > > <snip> > > + // NOT SURE if permitted due to IRQ. Maybe we > > + // should try doing this from WQ? > > + synchronize_rcu(); > > + kvfree(ptr); > > <snip> > > > > Calling synchronize_rcu() from the IRQ context will screw the system up :) > > Because the current CPU will never pass the QS state if i do not miss something. > > Yes are you right, calling synchronize_rcu() from IRQ context is a strict no-no. > > I believe we could tap into the GFP_ATOMIC emergency memory pool for this > emergency situation. This pool is used for emergency cases. I think in > emergency we can grow an rcu_head on this pool. > > > Also kvfree() itself can be called from the preemptible context only, excluding IRQ, > > there is a special path for it, otherwise vfree() can sleep. > > Ok that's good to know. > > > > debug_objects bits wouldn't work obviously for the !emergency kvfree case, > > > not sure what we can do there. > > > > > Agree. > > > > Thank you, Joel, for your comments! > > No problem, I think we have a couple of ideas here. > > What I also wanted to do was (may be after all this), see if we can create an > API for head-less kfree based on the same ideas. Not just for arrays for for > any object. Calling it, say, kfree_rcu_headless() and then use the bulk array > as we have been doing. That would save any users from having an rcu_head -- > of course with all needed warnings about memory allocation failure. Vlad, > What do you think? Paul, any thoughts on this? > I like it. It would be more clean interface. Also there are places where people do not embed the rcu_head into their stuctures for some reason and do like: <snip> synchronize_rcu(); kfree(p); <snip> <snip> urezki@pc636:~/data/ssd/coding/linux-rcu$ find ./ -name "*.c" | xargs grep -C 1 -rn "synchronize_rcu" | grep kfree ./arch/x86/mm/mmio-mod.c-314- kfree(found_trace); ./kernel/module.c-3910- kfree(mod->args); ./kernel/trace/ftrace.c-5078- kfree(direct); ./kernel/trace/ftrace.c-5155- kfree(direct); ./kernel/trace/trace_probe.c-1087- kfree(link); ./fs/nfs/sysfs.c-113- kfree(old); ./fs/ext4/super.c-1701- kfree(old_qname); ./net/ipv4/gre.mod.c-36- { 0xfc3fcca2, "kfree_skb" }, ./net/core/sysctl_net_core.c-143- kfree(cur); ./drivers/crypto/nx/nx-842-pseries.c-1010- kfree(old_devdata); ./drivers/misc/vmw_vmci/vmci_context.c-692- kfree(notifier); ./drivers/misc/vmw_vmci/vmci_event.c-213- kfree(s); ./drivers/infiniband/core/device.c:2162: * synchronize_rcu before the netdev is kfreed, so we ./drivers/infiniband/hw/hfi1/sdma.c-1337- kfree(dd->per_sdma); ./drivers/net/ethernet/myricom/myri10ge/myri10ge.c-3582- kfree(mgp->ss); ./drivers/net/ethernet/myricom/myri10ge/myri10ge.mod.c-156- { 0x37a0cba, "kfree" }, ./drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c:286: synchronize_rcu(); /* before kfree(flow) */ ./drivers/net/ethernet/mellanox/mlxsw/core.c-1504- kfree(rxl_item); ./drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c-6648- kfree(adapter->mbox_log); ./drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c-6650- kfree(adapter); ./drivers/block/drbd/drbd_receiver.c-3804- kfree(old_net_conf); ./drivers/block/drbd/drbd_receiver.c-4176- kfree(old_disk_conf); ./drivers/block/drbd/drbd_state.c-2074- kfree(old_conf); ./drivers/block/drbd/drbd_nl.c-1689- kfree(old_disk_conf); ./drivers/block/drbd/drbd_nl.c-2522- kfree(old_net_conf); ./drivers/block/drbd/drbd_nl.c-2935- kfree(old_disk_conf); ./drivers/mfd/dln2.c-178- kfree(i); ./drivers/staging/fwserial/fwserial.c-2122- kfree(peer); <snip> -- Vlad Rezki ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations 2020-02-24 17:02 ` Uladzislau Rezki @ 2020-02-24 23:14 ` Paul E. McKenney 2020-02-25 1:48 ` Joel Fernandes 1 sibling, 0 replies; 7+ messages in thread From: Paul E. McKenney @ 2020-02-24 23:14 UTC (permalink / raw) To: Uladzislau Rezki Cc: Joel Fernandes, Theodore Y. Ts'o, Ext4 Developers List, Suraj Jitindar Singh, LKML, rcu On Mon, Feb 24, 2020 at 06:02:19PM +0100, Uladzislau Rezki wrote: > > > > > > > > > > Overall this implementation is nice. You are basically avoiding allocating > > > > > rcu_head like Ted did by using the array-of-pointers technique we used for > > > > > the previous kfree_rcu() work. > > > > > > > > > > One thing stands out, the path where we could not allocate a page for the new > > > > > block node: > > > > > > > > > > > @@ -3061,6 +3148,11 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func) > > > > > > if (krcp->initialized) > > > > > > spin_unlock(&krcp->lock); > > > > > > local_irq_restore(flags); > > > > > > + > > > > > > + if (!skip_call_rcu) { > > > > > > + synchronize_rcu(); > > > > > > + kvfree(ptr_to_free); > > > > > > > > > > We can't block, it has to be async otherwise everything else blocks, and I > > > > > think this can also be used from interrupt handlers which would at least be > > > > > an SWA violation. So perhaps it needs to allocate an rcu_head wrapper object > > > > > itself for the 'emergeny case' and use the regular techniques. > > > > > > > > > > Another thing that stands out is the code duplication, if we can make this > > > > > reuse as much as of the previous code as possible, that'd be great. I'd like > > > > > to avoid bvcached and bvhead if possible. Maybe we can store information > > > > > about the fact that this is a 'special object' in some of the lower-order > > > > > bits of the pointer. Then we can detect that it is 'special' and free it > > > > > using kvfree() during the reclaim > > > > > > > > Basically what I did different is: > > > > 1. Use the existing kfree_rcu_bulk_data::records array to store the > > > > to-be-freed array. > > > > 2. In case of emergency, allocate a new wrapper and tag the pointer. > > > > Read the tag later to figure its an array wrapper and do additional kvfree. > > > > > > > I see your point and agree that duplication is odd and we should avoid > > > it as much as possible. Also, i like the idea of using the wrapper as > > > one more chance to build a "head" for headless object. > > > > > > I did not mix pointers because then you will need to understand what is what. > > > > Well that's why I brought up the whole tagging idea. Then you don't need > > separate pointers to manage either (edit: but maybe you do as you mentioned > > vfree below..). > > > Right. We can use tagging idea to separate kmalloc/vmalloc pointers to > place them into different arrays. Because kvmalloc() can return either > SLAB pointer or vmalloc one. > > > > It is OK for "emergency" path, because we simply can just serialize it by kvfree() > > > call, it checks inside what the ptr address belong to: > > > > > > <snip> > > > void kvfree(const void *addr) > > > { > > > if (is_vmalloc_addr(addr)) > > > vfree(addr); > > > else > > > kfree(addr); > > > } > > > <snip> > > > > > > whereas normal path, i mean "bulk one" where we store pointers into array > > > would be broken. We can not call kfree_bulk(array, nr_entries) if the passed > > > array contains "vmalloc" pointers, because it is different allocator. Therefore, > > > i deliberately have made it as a special case. > > > > Ok, it would be nice if you can verify that ptr_to_free passed to > > kfree_call_rcu() is infact a vmalloc pointer. > > > We can do that. We can check it by calling is_vmalloc_addr() on ptr. > So it is possible to differentiate. > > > > > Perhaps the synchronize_rcu() should be done from a workqueue handler > > > > to prevent IRQ crapping out? > > > > > > > I think so. For example one approach would be: > > > > > > <snip> > > > struct free_deferred { > > > struct llist_head list; > > > struct work_struct wq; > > > }; > > > static DEFINE_PER_CPU(struct free_deferred, free_deferred); > > > > > > static void free_work(struct work_struct *w) > > > { > > > struct free_deferred *p = container_of(w, struct free_deferred, wq); > > > struct llist_node *t, *llnode; > > > > > > synchronize_rcu(); > > > > > > llist_for_each_safe(llnode, t, llist_del_all(&p->list)) > > > vfree((void *)llnode, 1); > > > } > > > > > > static inline void free_deferred_common(void *ptr_to_free) > > > { > > > struct free_deferred *p = raw_cpu_ptr(&free_deferred); > > > > > > if (llist_add((struct llist_node *)ptr_to_free, &p->list)) > > > > Would this not corrupt the ptr_to_free pointer which readers might still be > > accessing since grace period has not yet ended? > > > > We cannot touch the ptr_to_free pointer until after the grace period has > > ended. > > > Right you are. We can do that only after grace period is passed, > after synchronize_rcu(). Good point :) > > > > } > > > <snip> > > > > > > and it seems it should work. Because we know that KMALLOC_MIN_SIZE > > > can not be less then machine word: > > > > > > /* > > > * Kmalloc subsystem. > > > */ > > > #ifndef KMALLOC_MIN_SIZE > > > #define KMALLOC_MIN_SIZE (1 << KMALLOC_SHIFT_LOW) > > > #endif > > > > > > when it comes to vmalloc pointer it can not be less then one PAGE_SIZE :) > > > > > > Another thing: > > > > > > we are talking about "headless" variant that is special, therefore it > > > implies to have some restrictions, since we need a dynamic memory to > > > drive it. For example "headless" object can be freed from preemptible > > > context only, because freeing can be inlined: > > > > > > <snip> > > > + // NOT SURE if permitted due to IRQ. Maybe we > > > + // should try doing this from WQ? > > > + synchronize_rcu(); > > > + kvfree(ptr); > > > <snip> > > > > > > Calling synchronize_rcu() from the IRQ context will screw the system up :) > > > Because the current CPU will never pass the QS state if i do not miss something. > > > > Yes are you right, calling synchronize_rcu() from IRQ context is a strict no-no. > > > > I believe we could tap into the GFP_ATOMIC emergency memory pool for this > > emergency situation. This pool is used for emergency cases. I think in > > emergency we can grow an rcu_head on this pool. > > > > > Also kvfree() itself can be called from the preemptible context only, excluding IRQ, > > > there is a special path for it, otherwise vfree() can sleep. > > > > Ok that's good to know. > > > > > > debug_objects bits wouldn't work obviously for the !emergency kvfree case, > > > > not sure what we can do there. > > > > > > > Agree. > > > > > > Thank you, Joel, for your comments! > > > > No problem, I think we have a couple of ideas here. > > > > What I also wanted to do was (may be after all this), see if we can create an > > API for head-less kfree based on the same ideas. Not just for arrays for for > > any object. Calling it, say, kfree_rcu_headless() and then use the bulk array > > as we have been doing. That would save any users from having an rcu_head -- > > of course with all needed warnings about memory allocation failure. Vlad, > > What do you think? Paul, any thoughts on this? > > > I like it. It would be more clean interface. Also there are places where > people do not embed the rcu_head into their stuctures for some reason > and do like: > > > <snip> > synchronize_rcu(); > kfree(p); > <snip> > > <snip> > urezki@pc636:~/data/ssd/coding/linux-rcu$ find ./ -name "*.c" | xargs grep -C 1 -rn "synchronize_rcu" | grep kfree > ./arch/x86/mm/mmio-mod.c-314- kfree(found_trace); > ./kernel/module.c-3910- kfree(mod->args); > ./kernel/trace/ftrace.c-5078- kfree(direct); > ./kernel/trace/ftrace.c-5155- kfree(direct); > ./kernel/trace/trace_probe.c-1087- kfree(link); > ./fs/nfs/sysfs.c-113- kfree(old); > ./fs/ext4/super.c-1701- kfree(old_qname); > ./net/ipv4/gre.mod.c-36- { 0xfc3fcca2, "kfree_skb" }, > ./net/core/sysctl_net_core.c-143- kfree(cur); > ./drivers/crypto/nx/nx-842-pseries.c-1010- kfree(old_devdata); > ./drivers/misc/vmw_vmci/vmci_context.c-692- kfree(notifier); > ./drivers/misc/vmw_vmci/vmci_event.c-213- kfree(s); > ./drivers/infiniband/core/device.c:2162: * synchronize_rcu before the netdev is kfreed, so we > ./drivers/infiniband/hw/hfi1/sdma.c-1337- kfree(dd->per_sdma); > ./drivers/net/ethernet/myricom/myri10ge/myri10ge.c-3582- kfree(mgp->ss); > ./drivers/net/ethernet/myricom/myri10ge/myri10ge.mod.c-156- { 0x37a0cba, "kfree" }, > ./drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c:286: synchronize_rcu(); /* before kfree(flow) */ > ./drivers/net/ethernet/mellanox/mlxsw/core.c-1504- kfree(rxl_item); > ./drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c-6648- kfree(adapter->mbox_log); > ./drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c-6650- kfree(adapter); > ./drivers/block/drbd/drbd_receiver.c-3804- kfree(old_net_conf); > ./drivers/block/drbd/drbd_receiver.c-4176- kfree(old_disk_conf); > ./drivers/block/drbd/drbd_state.c-2074- kfree(old_conf); > ./drivers/block/drbd/drbd_nl.c-1689- kfree(old_disk_conf); > ./drivers/block/drbd/drbd_nl.c-2522- kfree(old_net_conf); > ./drivers/block/drbd/drbd_nl.c-2935- kfree(old_disk_conf); > ./drivers/mfd/dln2.c-178- kfree(i); > ./drivers/staging/fwserial/fwserial.c-2122- kfree(peer); > <snip> That certainly is enough use cases to justify a new API. And given that they are already calling synchronize_rcu(), having a function that only sometimes calls synchronize_rcu() should be OK from a latency viewpoint. The trick will be verifying that the existing calls to synchronize_rcu() aren't providing some other guarantee... But submitting the patch and seeing if the maintainers are willing to ack is not without merit. Plus, the cases where both the synchronize_rcu() and the kfree() are under some "if" should be easier to verify as safe. Thanx, Paul ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH RFC] ext4: fix potential race between online resizing and write operations 2020-02-24 17:02 ` Uladzislau Rezki 2020-02-24 23:14 ` Paul E. McKenney @ 2020-02-25 1:48 ` Joel Fernandes 1 sibling, 0 replies; 7+ messages in thread From: Joel Fernandes @ 2020-02-25 1:48 UTC (permalink / raw) To: Uladzislau Rezki Cc: Theodore Y. Ts'o, Paul E. McKenney, Ext4 Developers List, Suraj Jitindar Singh, LKML, rcu [..] > > > > debug_objects bits wouldn't work obviously for the !emergency kvfree case, > > > > not sure what we can do there. > > > > > > > Agree. > > > > > > Thank you, Joel, for your comments! > > > > No problem, I think we have a couple of ideas here. > > > > What I also wanted to do was (may be after all this), see if we can create an > > API for head-less kfree based on the same ideas. Not just for arrays for for > > any object. Calling it, say, kfree_rcu_headless() and then use the bulk array > > as we have been doing. That would save any users from having an rcu_head -- > > of course with all needed warnings about memory allocation failure. Vlad, > > What do you think? Paul, any thoughts on this? > > > I like it. It would be more clean interface. Also there are places where > people do not embed the rcu_head into their stuctures for some reason > and do like: > > > <snip> > synchronize_rcu(); > kfree(p); > <snip> > > <snip> > urezki@pc636:~/data/ssd/coding/linux-rcu$ find ./ -name "*.c" | xargs grep -C 1 -rn "synchronize_rcu" | grep kfree > ./arch/x86/mm/mmio-mod.c-314- kfree(found_trace); > ./kernel/module.c-3910- kfree(mod->args); > ./kernel/trace/ftrace.c-5078- kfree(direct); > ./kernel/trace/ftrace.c-5155- kfree(direct); > ./kernel/trace/trace_probe.c-1087- kfree(link); > ./fs/nfs/sysfs.c-113- kfree(old); > ./fs/ext4/super.c-1701- kfree(old_qname); > ./net/ipv4/gre.mod.c-36- { 0xfc3fcca2, "kfree_skb" }, > ./net/core/sysctl_net_core.c-143- kfree(cur); > ./drivers/crypto/nx/nx-842-pseries.c-1010- kfree(old_devdata); > ./drivers/misc/vmw_vmci/vmci_context.c-692- kfree(notifier); > ./drivers/misc/vmw_vmci/vmci_event.c-213- kfree(s); > ./drivers/infiniband/core/device.c:2162: * synchronize_rcu before the netdev is kfreed, so we > ./drivers/infiniband/hw/hfi1/sdma.c-1337- kfree(dd->per_sdma); > ./drivers/net/ethernet/myricom/myri10ge/myri10ge.c-3582- kfree(mgp->ss); > ./drivers/net/ethernet/myricom/myri10ge/myri10ge.mod.c-156- { 0x37a0cba, "kfree" }, > ./drivers/net/ethernet/mellanox/mlx5/core/fpga/tls.c:286: synchronize_rcu(); /* before kfree(flow) */ > ./drivers/net/ethernet/mellanox/mlxsw/core.c-1504- kfree(rxl_item); > ./drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c-6648- kfree(adapter->mbox_log); > ./drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c-6650- kfree(adapter); > ./drivers/block/drbd/drbd_receiver.c-3804- kfree(old_net_conf); > ./drivers/block/drbd/drbd_receiver.c-4176- kfree(old_disk_conf); > ./drivers/block/drbd/drbd_state.c-2074- kfree(old_conf); > ./drivers/block/drbd/drbd_nl.c-1689- kfree(old_disk_conf); > ./drivers/block/drbd/drbd_nl.c-2522- kfree(old_net_conf); > ./drivers/block/drbd/drbd_nl.c-2935- kfree(old_disk_conf); > ./drivers/mfd/dln2.c-178- kfree(i); > ./drivers/staging/fwserial/fwserial.c-2122- kfree(peer); > <snip> Wow that's pretty amazing, looks like could be very useful. Do you want to continue your patch and then post it, and we can discuss it? Or happy to take it as well. We could split work into 3 parts: 1. Make changes for 2 separate per-CPU arrays. One for vfree and one for kfree. 2. Add headless support for both as alternative API. 3. Handle the low memory case somehow (I'll reply to the other thread). May be we can start with 1. where you can clean up your patch and post, and then I/we can work based on that? thanks, - Joel ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-02-25 1:48 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20200215233817.GA670792@mit.edu>
[not found] ` <20200216121246.GG2935@paulmck-ThinkPad-P72>
[not found] ` <20200217160827.GA5685@pc636>
[not found] ` <20200217193314.GA12604@mit.edu>
[not found] ` <20200218170857.GA28774@pc636>
[not found] ` <20200221120618.GA194360@google.com>
2020-02-21 13:28 ` [PATCH RFC] ext4: fix potential race between online resizing and write operations Joel Fernandes
2020-02-21 19:21 ` Uladzislau Rezki
2020-02-21 19:25 ` Uladzislau Rezki
2020-02-22 22:12 ` Joel Fernandes
2020-02-24 17:02 ` Uladzislau Rezki
2020-02-24 23:14 ` Paul E. McKenney
2020-02-25 1:48 ` Joel Fernandes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox