* [PATCH 0/2] Objpool performance improvements
@ 2024-04-24 21:52 Andrii Nakryiko
2024-04-24 21:52 ` [PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations Andrii Nakryiko
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Andrii Nakryiko @ 2024-04-24 21:52 UTC (permalink / raw)
To: linux-trace-kernel, rostedt, mhiramat; +Cc: bpf, Andrii Nakryiko
Improve objpool (used heavily in kretprobe hot path) performance with two
improvements:
- inlining performance critical objpool_push()/objpool_pop() operations;
- avoiding re-calculating relatively expensive nr_possible_cpus().
These opportunities were found when benchmarking and profiling kprobes and
kretprobes with BPF-based benchmarks. See individual patches for details and
results.
Andrii Nakryiko (2):
objpool: enable inlining objpool_push() and objpool_pop() operations
objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids
include/linux/objpool.h | 105 +++++++++++++++++++++++++++++++++++--
lib/objpool.c | 112 +++-------------------------------------
2 files changed, 107 insertions(+), 110 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations 2024-04-24 21:52 [PATCH 0/2] Objpool performance improvements Andrii Nakryiko @ 2024-04-24 21:52 ` Andrii Nakryiko 2024-05-07 13:55 ` Vlastimil Babka 2024-04-24 21:52 ` [PATCH 2/2] objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids Andrii Nakryiko 2024-04-26 14:25 ` [PATCH 0/2] Objpool performance improvements Masami Hiramatsu 2 siblings, 1 reply; 11+ messages in thread From: Andrii Nakryiko @ 2024-04-24 21:52 UTC (permalink / raw) To: linux-trace-kernel, rostedt, mhiramat; +Cc: bpf, Andrii Nakryiko, Matt Wu objpool_push() and objpool_pop() are very performance-critical functions and can be called very frequently in kretprobe triggering path. As such, it makes sense to allow compiler to inline them completely to eliminate function calls overhead. Luckily, their logic is quite well isolated and doesn't have any sprawling dependencies. This patch moves both objpool_push() and objpool_pop() into include/linux/objpool.h and marks them as static inline functions, enabling inlining. To avoid anyone using internal helpers (objpool_try_get_slot, objpool_try_add_slot), rename them to use leading underscores. We used kretprobe microbenchmark from BPF selftests (bench trig-kprobe and trig-kprobe-multi benchmarks) running no-op BPF kretprobe/kretprobe.multi programs in a tight loop to evaluate the effect. BPF own overhead in this case is minimal and it mostly stresses the rest of in-kernel kretprobe infrastructure overhead. Results are in millions of calls per second. This is not super scientific, but shows the trend nevertheless. BEFORE ====== kretprobe : 9.794 ± 0.086M/s kretprobe-multi: 10.219 ± 0.032M/s AFTER ===== kretprobe : 9.937 ± 0.174M/s (+1.5%) kretprobe-multi: 10.440 ± 0.108M/s (+2.2%) Cc: Matt (Qiang) Wu <wuqiang.matt@bytedance.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- include/linux/objpool.h | 101 +++++++++++++++++++++++++++++++++++++++- lib/objpool.c | 100 --------------------------------------- 2 files changed, 99 insertions(+), 102 deletions(-) diff --git a/include/linux/objpool.h b/include/linux/objpool.h index 15aff4a17f0c..d8b1f7b91128 100644 --- a/include/linux/objpool.h +++ b/include/linux/objpool.h @@ -5,6 +5,10 @@ #include <linux/types.h> #include <linux/refcount.h> +#include <linux/atomic.h> +#include <linux/cpumask.h> +#include <linux/irqflags.h> +#include <linux/smp.h> /* * objpool: ring-array based lockless MPMC queue @@ -118,13 +122,94 @@ int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, gfp_t gfp, void *context, objpool_init_obj_cb objinit, objpool_fini_cb release); +/* try to retrieve object from slot */ +static inline void *__objpool_try_get_slot(struct objpool_head *pool, int cpu) +{ + struct objpool_slot *slot = pool->cpu_slots[cpu]; + /* load head snapshot, other cpus may change it */ + uint32_t head = smp_load_acquire(&slot->head); + + while (head != READ_ONCE(slot->last)) { + void *obj; + + /* + * data visibility of 'last' and 'head' could be out of + * order since memory updating of 'last' and 'head' are + * performed in push() and pop() independently + * + * before any retrieving attempts, pop() must guarantee + * 'last' is behind 'head', that is to say, there must + * be available objects in slot, which could be ensured + * by condition 'last != head && last - head <= nr_objs' + * that is equivalent to 'last - head - 1 < nr_objs' as + * 'last' and 'head' are both unsigned int32 + */ + if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) { + head = READ_ONCE(slot->head); + continue; + } + + /* obj must be retrieved before moving forward head */ + obj = READ_ONCE(slot->entries[head & slot->mask]); + + /* move head forward to mark it's consumption */ + if (try_cmpxchg_release(&slot->head, &head, head + 1)) + return obj; + } + + return NULL; +} + /** * objpool_pop() - allocate an object from objpool * @pool: object pool * * return value: object ptr or NULL if failed */ -void *objpool_pop(struct objpool_head *pool); +static inline void *objpool_pop(struct objpool_head *pool) +{ + void *obj = NULL; + unsigned long flags; + int i, cpu; + + /* disable local irq to avoid preemption & interruption */ + raw_local_irq_save(flags); + + cpu = raw_smp_processor_id(); + for (i = 0; i < num_possible_cpus(); i++) { + obj = __objpool_try_get_slot(pool, cpu); + if (obj) + break; + cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1); + } + raw_local_irq_restore(flags); + + return obj; +} + +/* adding object to slot, abort if the slot was already full */ +static inline int +__objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu) +{ + struct objpool_slot *slot = pool->cpu_slots[cpu]; + uint32_t head, tail; + + /* loading tail and head as a local snapshot, tail first */ + tail = READ_ONCE(slot->tail); + + do { + head = READ_ONCE(slot->head); + /* fault caught: something must be wrong */ + WARN_ON_ONCE(tail - head > pool->nr_objs); + } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1)); + + /* now the tail position is reserved for the given obj */ + WRITE_ONCE(slot->entries[tail & slot->mask], obj); + /* update sequence to make this obj available for pop() */ + smp_store_release(&slot->last, tail + 1); + + return 0; +} /** * objpool_push() - reclaim the object and return back to objpool @@ -134,7 +219,19 @@ void *objpool_pop(struct objpool_head *pool); * return: 0 or error code (it fails only when user tries to push * the same object multiple times or wrong "objects" into objpool) */ -int objpool_push(void *obj, struct objpool_head *pool); +static inline int objpool_push(void *obj, struct objpool_head *pool) +{ + unsigned long flags; + int rc; + + /* disable local irq to avoid preemption & interruption */ + raw_local_irq_save(flags); + rc = __objpool_try_add_slot(obj, pool, raw_smp_processor_id()); + raw_local_irq_restore(flags); + + return rc; +} + /** * objpool_drop() - discard the object and deref objpool diff --git a/lib/objpool.c b/lib/objpool.c index cfdc02420884..f696308fc026 100644 --- a/lib/objpool.c +++ b/lib/objpool.c @@ -152,106 +152,6 @@ int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, } EXPORT_SYMBOL_GPL(objpool_init); -/* adding object to slot, abort if the slot was already full */ -static inline int -objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu) -{ - struct objpool_slot *slot = pool->cpu_slots[cpu]; - uint32_t head, tail; - - /* loading tail and head as a local snapshot, tail first */ - tail = READ_ONCE(slot->tail); - - do { - head = READ_ONCE(slot->head); - /* fault caught: something must be wrong */ - WARN_ON_ONCE(tail - head > pool->nr_objs); - } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1)); - - /* now the tail position is reserved for the given obj */ - WRITE_ONCE(slot->entries[tail & slot->mask], obj); - /* update sequence to make this obj available for pop() */ - smp_store_release(&slot->last, tail + 1); - - return 0; -} - -/* reclaim an object to object pool */ -int objpool_push(void *obj, struct objpool_head *pool) -{ - unsigned long flags; - int rc; - - /* disable local irq to avoid preemption & interruption */ - raw_local_irq_save(flags); - rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id()); - raw_local_irq_restore(flags); - - return rc; -} -EXPORT_SYMBOL_GPL(objpool_push); - -/* try to retrieve object from slot */ -static inline void *objpool_try_get_slot(struct objpool_head *pool, int cpu) -{ - struct objpool_slot *slot = pool->cpu_slots[cpu]; - /* load head snapshot, other cpus may change it */ - uint32_t head = smp_load_acquire(&slot->head); - - while (head != READ_ONCE(slot->last)) { - void *obj; - - /* - * data visibility of 'last' and 'head' could be out of - * order since memory updating of 'last' and 'head' are - * performed in push() and pop() independently - * - * before any retrieving attempts, pop() must guarantee - * 'last' is behind 'head', that is to say, there must - * be available objects in slot, which could be ensured - * by condition 'last != head && last - head <= nr_objs' - * that is equivalent to 'last - head - 1 < nr_objs' as - * 'last' and 'head' are both unsigned int32 - */ - if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) { - head = READ_ONCE(slot->head); - continue; - } - - /* obj must be retrieved before moving forward head */ - obj = READ_ONCE(slot->entries[head & slot->mask]); - - /* move head forward to mark it's consumption */ - if (try_cmpxchg_release(&slot->head, &head, head + 1)) - return obj; - } - - return NULL; -} - -/* allocate an object from object pool */ -void *objpool_pop(struct objpool_head *pool) -{ - void *obj = NULL; - unsigned long flags; - int i, cpu; - - /* disable local irq to avoid preemption & interruption */ - raw_local_irq_save(flags); - - cpu = raw_smp_processor_id(); - for (i = 0; i < num_possible_cpus(); i++) { - obj = objpool_try_get_slot(pool, cpu); - if (obj) - break; - cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1); - } - raw_local_irq_restore(flags); - - return obj; -} -EXPORT_SYMBOL_GPL(objpool_pop); - /* release whole objpool forcely */ void objpool_free(struct objpool_head *pool) { -- 2.43.0 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations 2024-04-24 21:52 ` [PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations Andrii Nakryiko @ 2024-05-07 13:55 ` Vlastimil Babka 2024-05-10 7:59 ` wuqiang.matt 0 siblings, 1 reply; 11+ messages in thread From: Vlastimil Babka @ 2024-05-07 13:55 UTC (permalink / raw) To: Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat Cc: bpf, Matt Wu, linux-mm@kvack.org On 4/24/24 11:52 PM, Andrii Nakryiko wrote: > objpool_push() and objpool_pop() are very performance-critical functions > and can be called very frequently in kretprobe triggering path. > > As such, it makes sense to allow compiler to inline them completely to > eliminate function calls overhead. Luckily, their logic is quite well > isolated and doesn't have any sprawling dependencies. > > This patch moves both objpool_push() and objpool_pop() into > include/linux/objpool.h and marks them as static inline functions, > enabling inlining. To avoid anyone using internal helpers > (objpool_try_get_slot, objpool_try_add_slot), rename them to use leading > underscores. > > We used kretprobe microbenchmark from BPF selftests (bench trig-kprobe > and trig-kprobe-multi benchmarks) running no-op BPF kretprobe/kretprobe.multi > programs in a tight loop to evaluate the effect. BPF own overhead in > this case is minimal and it mostly stresses the rest of in-kernel > kretprobe infrastructure overhead. Results are in millions of calls per > second. This is not super scientific, but shows the trend nevertheless. > > BEFORE > ====== > kretprobe : 9.794 ± 0.086M/s > kretprobe-multi: 10.219 ± 0.032M/s > > AFTER > ===== > kretprobe : 9.937 ± 0.174M/s (+1.5%) > kretprobe-multi: 10.440 ± 0.108M/s (+2.2%) > > Cc: Matt (Qiang) Wu <wuqiang.matt@bytedance.com> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Hello, this question is not specific to your patch, but since it's a recent thread, I'll ask it here instead of digging up the original objpool patches. I'm trying to understand how objpool works and if it could be integrated into SLUB, for the LSF/MM discussion next week: https://lore.kernel.org/all/b929d5fb-8e88-4f23-8ec7-6bdaf61f84f9@suse.cz/ > +/* adding object to slot, abort if the slot was already full */ I don't see any actual abort in the code (not in this code nor in the deleted code - it's the same code, just moved for inlining purposes). > +static inline int > +__objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu) > +{ > + struct objpool_slot *slot = pool->cpu_slots[cpu]; > + uint32_t head, tail; > + > + /* loading tail and head as a local snapshot, tail first */ > + tail = READ_ONCE(slot->tail); > + > + do { > + head = READ_ONCE(slot->head); > + /* fault caught: something must be wrong */ > + WARN_ON_ONCE(tail - head > pool->nr_objs); So this will only WARN if we go over the capacity, but continue and overwrite a pointer that was already there, effectively leaking said object, no? > + } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1)); > + > + /* now the tail position is reserved for the given obj */ > + WRITE_ONCE(slot->entries[tail & slot->mask], obj); > + /* update sequence to make this obj available for pop() */ > + smp_store_release(&slot->last, tail + 1); > + > + return 0; > +} > > /** > * objpool_push() - reclaim the object and return back to objpool > @@ -134,7 +219,19 @@ void *objpool_pop(struct objpool_head *pool); > * return: 0 or error code (it fails only when user tries to push > * the same object multiple times or wrong "objects" into objpool) > */ > -int objpool_push(void *obj, struct objpool_head *pool); > +static inline int objpool_push(void *obj, struct objpool_head *pool) > +{ > + unsigned long flags; > + int rc; > + > + /* disable local irq to avoid preemption & interruption */ > + raw_local_irq_save(flags); > + rc = __objpool_try_add_slot(obj, pool, raw_smp_processor_id()); And IIUC, we could in theory objpool_pop() on one cpu, then later another cpu might do objpool_push() and cause the latter cpu's pool to go over capacity? Is there some implicit requirements of objpool users to take care of having matched cpu for pop and push? Are the current objpool users obeying this requirement? (I can see the selftests do, not sure about the actual users). Or am I missing something? Thanks. > + raw_local_irq_restore(flags); > + > + return rc; > +} > + > > /** > * objpool_drop() - discard the object and deref objpool > diff --git a/lib/objpool.c b/lib/objpool.c > index cfdc02420884..f696308fc026 100644 > --- a/lib/objpool.c > +++ b/lib/objpool.c > @@ -152,106 +152,6 @@ int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, > } > EXPORT_SYMBOL_GPL(objpool_init); > > -/* adding object to slot, abort if the slot was already full */ > -static inline int > -objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu) > -{ > - struct objpool_slot *slot = pool->cpu_slots[cpu]; > - uint32_t head, tail; > - > - /* loading tail and head as a local snapshot, tail first */ > - tail = READ_ONCE(slot->tail); > - > - do { > - head = READ_ONCE(slot->head); > - /* fault caught: something must be wrong */ > - WARN_ON_ONCE(tail - head > pool->nr_objs); > - } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1)); > - > - /* now the tail position is reserved for the given obj */ > - WRITE_ONCE(slot->entries[tail & slot->mask], obj); > - /* update sequence to make this obj available for pop() */ > - smp_store_release(&slot->last, tail + 1); > - > - return 0; > -} > - > -/* reclaim an object to object pool */ > -int objpool_push(void *obj, struct objpool_head *pool) > -{ > - unsigned long flags; > - int rc; > - > - /* disable local irq to avoid preemption & interruption */ > - raw_local_irq_save(flags); > - rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id()); > - raw_local_irq_restore(flags); > - > - return rc; > -} > -EXPORT_SYMBOL_GPL(objpool_push); > - > -/* try to retrieve object from slot */ > -static inline void *objpool_try_get_slot(struct objpool_head *pool, int cpu) > -{ > - struct objpool_slot *slot = pool->cpu_slots[cpu]; > - /* load head snapshot, other cpus may change it */ > - uint32_t head = smp_load_acquire(&slot->head); > - > - while (head != READ_ONCE(slot->last)) { > - void *obj; > - > - /* > - * data visibility of 'last' and 'head' could be out of > - * order since memory updating of 'last' and 'head' are > - * performed in push() and pop() independently > - * > - * before any retrieving attempts, pop() must guarantee > - * 'last' is behind 'head', that is to say, there must > - * be available objects in slot, which could be ensured > - * by condition 'last != head && last - head <= nr_objs' > - * that is equivalent to 'last - head - 1 < nr_objs' as > - * 'last' and 'head' are both unsigned int32 > - */ > - if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) { > - head = READ_ONCE(slot->head); > - continue; > - } > - > - /* obj must be retrieved before moving forward head */ > - obj = READ_ONCE(slot->entries[head & slot->mask]); > - > - /* move head forward to mark it's consumption */ > - if (try_cmpxchg_release(&slot->head, &head, head + 1)) > - return obj; > - } > - > - return NULL; > -} > - > -/* allocate an object from object pool */ > -void *objpool_pop(struct objpool_head *pool) > -{ > - void *obj = NULL; > - unsigned long flags; > - int i, cpu; > - > - /* disable local irq to avoid preemption & interruption */ > - raw_local_irq_save(flags); > - > - cpu = raw_smp_processor_id(); > - for (i = 0; i < num_possible_cpus(); i++) { > - obj = objpool_try_get_slot(pool, cpu); > - if (obj) > - break; > - cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1); > - } > - raw_local_irq_restore(flags); > - > - return obj; > -} > -EXPORT_SYMBOL_GPL(objpool_pop); > - > /* release whole objpool forcely */ > void objpool_free(struct objpool_head *pool) > { ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations 2024-05-07 13:55 ` Vlastimil Babka @ 2024-05-10 7:59 ` wuqiang.matt 2024-05-10 8:20 ` Vlastimil Babka 0 siblings, 1 reply; 11+ messages in thread From: wuqiang.matt @ 2024-05-10 7:59 UTC (permalink / raw) To: Vlastimil Babka, Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat Cc: bpf, linux-mm@kvack.org On 2024/5/7 21:55, Vlastimil Babka wrote: > On 4/24/24 11:52 PM, Andrii Nakryiko wrote: >> objpool_push() and objpool_pop() are very performance-critical functions >> and can be called very frequently in kretprobe triggering path. >> >> As such, it makes sense to allow compiler to inline them completely to >> eliminate function calls overhead. Luckily, their logic is quite well >> isolated and doesn't have any sprawling dependencies. >> >> This patch moves both objpool_push() and objpool_pop() into >> include/linux/objpool.h and marks them as static inline functions, >> enabling inlining. To avoid anyone using internal helpers >> (objpool_try_get_slot, objpool_try_add_slot), rename them to use leading >> underscores. >> >> We used kretprobe microbenchmark from BPF selftests (bench trig-kprobe >> and trig-kprobe-multi benchmarks) running no-op BPF kretprobe/kretprobe.multi >> programs in a tight loop to evaluate the effect. BPF own overhead in >> this case is minimal and it mostly stresses the rest of in-kernel >> kretprobe infrastructure overhead. Results are in millions of calls per >> second. This is not super scientific, but shows the trend nevertheless. >> >> BEFORE >> ====== >> kretprobe : 9.794 ± 0.086M/s >> kretprobe-multi: 10.219 ± 0.032M/s >> >> AFTER >> ===== >> kretprobe : 9.937 ± 0.174M/s (+1.5%) >> kretprobe-multi: 10.440 ± 0.108M/s (+2.2%) >> >> Cc: Matt (Qiang) Wu <wuqiang.matt@bytedance.com> >> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > Hello, > Hi Vlastimil, > this question is not specific to your patch, but since it's a recent thread, > I'll ask it here instead of digging up the original objpool patches. > > I'm trying to understand how objpool works and if it could be integrated > into SLUB, for the LSF/MM discussion next week: > > https://lore.kernel.org/all/b929d5fb-8e88-4f23-8ec7-6bdaf61f84f9@suse.cz/ > >> +/* adding object to slot, abort if the slot was already full */ > > I don't see any actual abort in the code (not in this code nor in the > deleted code - it's the same code, just moved for inlining purposes). > >> +static inline int >> +__objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu) >> +{ >> + struct objpool_slot *slot = pool->cpu_slots[cpu]; >> + uint32_t head, tail; >> + >> + /* loading tail and head as a local snapshot, tail first */ >> + tail = READ_ONCE(slot->tail); >> + >> + do { >> + head = READ_ONCE(slot->head); >> + /* fault caught: something must be wrong */ >> + WARN_ON_ONCE(tail - head > pool->nr_objs); > > So this will only WARN if we go over the capacity, but continue and > overwrite a pointer that was already there, effectively leaking said object, no? Yes, the WARN is only for error-catch in caller side (try to push one same object multiple times for example). > >> + } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1)); >> + >> + /* now the tail position is reserved for the given obj */ >> + WRITE_ONCE(slot->entries[tail & slot->mask], obj); >> + /* update sequence to make this obj available for pop() */ >> + smp_store_release(&slot->last, tail + 1); >> + >> + return 0; >> +} >> >> /** >> * objpool_push() - reclaim the object and return back to objpool >> @@ -134,7 +219,19 @@ void *objpool_pop(struct objpool_head *pool); >> * return: 0 or error code (it fails only when user tries to push >> * the same object multiple times or wrong "objects" into objpool) >> */ >> -int objpool_push(void *obj, struct objpool_head *pool); >> +static inline int objpool_push(void *obj, struct objpool_head *pool) >> +{ >> + unsigned long flags; >> + int rc; >> + >> + /* disable local irq to avoid preemption & interruption */ >> + raw_local_irq_save(flags); >> + rc = __objpool_try_add_slot(obj, pool, raw_smp_processor_id()); > > And IIUC, we could in theory objpool_pop() on one cpu, then later another > cpu might do objpool_push() and cause the latter cpu's pool to go over > capacity? Is there some implicit requirements of objpool users to take care > of having matched cpu for pop and push? Are the current objpool users > obeying this requirement? (I can see the selftests do, not sure about the > actual users). > Or am I missing something? Thanks. The objects are all pre-allocated along with creation of the new objpool and the total number of objects never exceeds the capacity on local node. So objpool_push() would always find an available slot from the ring-array for the given object to insert back. objpool_pop() would try looping all the percpu slots until an object is found or whole objpool is empty. Currently kretprobe is the only actual usecase of objpool. I'm testing an updated objpool in our HIDS project for critical pathes, which is widely deployed on servers inside my company. The new version eliminates the raw_local_irq_save and raw_local_irq_restore pair of objpool_push and gains up to 5% of performance boost. > >> + raw_local_irq_restore(flags); >> + >> + return rc; >> +} >> + >> >> /** >> * objpool_drop() - discard the object and deref objpool >> diff --git a/lib/objpool.c b/lib/objpool.c >> index cfdc02420884..f696308fc026 100644 >> --- a/lib/objpool.c >> +++ b/lib/objpool.c >> @@ -152,106 +152,6 @@ int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, >> } >> EXPORT_SYMBOL_GPL(objpool_init); >> >> -/* adding object to slot, abort if the slot was already full */ >> -static inline int >> -objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu) >> -{ >> - struct objpool_slot *slot = pool->cpu_slots[cpu]; >> - uint32_t head, tail; >> - >> - /* loading tail and head as a local snapshot, tail first */ >> - tail = READ_ONCE(slot->tail); >> - >> - do { >> - head = READ_ONCE(slot->head); >> - /* fault caught: something must be wrong */ >> - WARN_ON_ONCE(tail - head > pool->nr_objs); >> - } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1)); >> - >> - /* now the tail position is reserved for the given obj */ >> - WRITE_ONCE(slot->entries[tail & slot->mask], obj); >> - /* update sequence to make this obj available for pop() */ >> - smp_store_release(&slot->last, tail + 1); >> - >> - return 0; >> -} >> - >> -/* reclaim an object to object pool */ >> -int objpool_push(void *obj, struct objpool_head *pool) >> -{ >> - unsigned long flags; >> - int rc; >> - >> - /* disable local irq to avoid preemption & interruption */ >> - raw_local_irq_save(flags); >> - rc = objpool_try_add_slot(obj, pool, raw_smp_processor_id()); >> - raw_local_irq_restore(flags); >> - >> - return rc; >> -} >> -EXPORT_SYMBOL_GPL(objpool_push); >> - >> -/* try to retrieve object from slot */ >> -static inline void *objpool_try_get_slot(struct objpool_head *pool, int cpu) >> -{ >> - struct objpool_slot *slot = pool->cpu_slots[cpu]; >> - /* load head snapshot, other cpus may change it */ >> - uint32_t head = smp_load_acquire(&slot->head); >> - >> - while (head != READ_ONCE(slot->last)) { >> - void *obj; >> - >> - /* >> - * data visibility of 'last' and 'head' could be out of >> - * order since memory updating of 'last' and 'head' are >> - * performed in push() and pop() independently >> - * >> - * before any retrieving attempts, pop() must guarantee >> - * 'last' is behind 'head', that is to say, there must >> - * be available objects in slot, which could be ensured >> - * by condition 'last != head && last - head <= nr_objs' >> - * that is equivalent to 'last - head - 1 < nr_objs' as >> - * 'last' and 'head' are both unsigned int32 >> - */ >> - if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) { >> - head = READ_ONCE(slot->head); >> - continue; >> - } >> - >> - /* obj must be retrieved before moving forward head */ >> - obj = READ_ONCE(slot->entries[head & slot->mask]); >> - >> - /* move head forward to mark it's consumption */ >> - if (try_cmpxchg_release(&slot->head, &head, head + 1)) >> - return obj; >> - } >> - >> - return NULL; >> -} >> - >> -/* allocate an object from object pool */ >> -void *objpool_pop(struct objpool_head *pool) >> -{ >> - void *obj = NULL; >> - unsigned long flags; >> - int i, cpu; >> - >> - /* disable local irq to avoid preemption & interruption */ >> - raw_local_irq_save(flags); >> - >> - cpu = raw_smp_processor_id(); >> - for (i = 0; i < num_possible_cpus(); i++) { >> - obj = objpool_try_get_slot(pool, cpu); >> - if (obj) >> - break; >> - cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1); >> - } >> - raw_local_irq_restore(flags); >> - >> - return obj; >> -} >> -EXPORT_SYMBOL_GPL(objpool_pop); >> - >> /* release whole objpool forcely */ >> void objpool_free(struct objpool_head *pool) >> { > Regards, Matt Wu ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations 2024-05-10 7:59 ` wuqiang.matt @ 2024-05-10 8:20 ` Vlastimil Babka 2024-05-10 9:15 ` wuqiang.matt 2024-05-28 16:41 ` Masami Hiramatsu 0 siblings, 2 replies; 11+ messages in thread From: Vlastimil Babka @ 2024-05-10 8:20 UTC (permalink / raw) To: wuqiang.matt, Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat Cc: bpf, linux-mm@kvack.org On 5/10/24 9:59 AM, wuqiang.matt wrote: > On 2024/5/7 21:55, Vlastimil Babka wrote: >> >>> + } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1)); >>> + >>> + /* now the tail position is reserved for the given obj */ >>> + WRITE_ONCE(slot->entries[tail & slot->mask], obj); >>> + /* update sequence to make this obj available for pop() */ >>> + smp_store_release(&slot->last, tail + 1); >>> + >>> + return 0; >>> +} >>> >>> /** >>> * objpool_push() - reclaim the object and return back to objpool >>> @@ -134,7 +219,19 @@ void *objpool_pop(struct objpool_head *pool); >>> * return: 0 or error code (it fails only when user tries to push >>> * the same object multiple times or wrong "objects" into objpool) >>> */ >>> -int objpool_push(void *obj, struct objpool_head *pool); >>> +static inline int objpool_push(void *obj, struct objpool_head *pool) >>> +{ >>> + unsigned long flags; >>> + int rc; >>> + >>> + /* disable local irq to avoid preemption & interruption */ >>> + raw_local_irq_save(flags); >>> + rc = __objpool_try_add_slot(obj, pool, raw_smp_processor_id()); >> >> And IIUC, we could in theory objpool_pop() on one cpu, then later another >> cpu might do objpool_push() and cause the latter cpu's pool to go over >> capacity? Is there some implicit requirements of objpool users to take care >> of having matched cpu for pop and push? Are the current objpool users >> obeying this requirement? (I can see the selftests do, not sure about the >> actual users). >> Or am I missing something? Thanks. > > The objects are all pre-allocated along with creation of the new objpool > and the total number of objects never exceeds the capacity on local node. Aha, I see, the capacity of entries is enough to hold objects from all nodes in the most unfortunate case they all end up freed from a single cpu. > So objpool_push() would always find an available slot from the ring-array > for the given object to insert back. objpool_pop() would try looping all > the percpu slots until an object is found or whole objpool is empty. So it's correct, but seems rather wasteful to have the whole capacity for entries replicated on every cpu? It does make objpool_push() simple and fast, but as you say, objpool_pop() still has to search potentially all non-local percpu slots, with disabled irqs, which is far from ideal. And the "abort if the slot was already full" comment for objpool_try_add_slot() seems still misleading? Maybe that was your initial idea but changed later? > Currently kretprobe is the only actual usecase of objpool. > > I'm testing an updated objpool in our HIDS project for critical pathes, > which is widely deployed on servers inside my company. The new version > eliminates the raw_local_irq_save and raw_local_irq_restore pair of > objpool_push and gains up to 5% of performance boost. Mind Ccing me and linux-mm once you are posting that? Thanks, Vlastimil ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations 2024-05-10 8:20 ` Vlastimil Babka @ 2024-05-10 9:15 ` wuqiang.matt 2024-05-28 16:41 ` Masami Hiramatsu 1 sibling, 0 replies; 11+ messages in thread From: wuqiang.matt @ 2024-05-10 9:15 UTC (permalink / raw) To: Vlastimil Babka, Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat Cc: bpf, linux-mm@kvack.org On 2024/5/10 16:20, Vlastimil Babka wrote: > On 5/10/24 9:59 AM, wuqiang.matt wrote: >> On 2024/5/7 21:55, Vlastimil Babka wrote: > >> >>>> + } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1)); >>>> + >>>> + /* now the tail position is reserved for the given obj */ >>>> + WRITE_ONCE(slot->entries[tail & slot->mask], obj); >>>> + /* update sequence to make this obj available for pop() */ >>>> + smp_store_release(&slot->last, tail + 1); >>>> + >>>> + return 0; >>>> +} >>>> >>>> /** >>>> * objpool_push() - reclaim the object and return back to objpool >>>> @@ -134,7 +219,19 @@ void *objpool_pop(struct objpool_head *pool); >>>> * return: 0 or error code (it fails only when user tries to push >>>> * the same object multiple times or wrong "objects" into objpool) >>>> */ >>>> -int objpool_push(void *obj, struct objpool_head *pool); >>>> +static inline int objpool_push(void *obj, struct objpool_head *pool) >>>> +{ >>>> + unsigned long flags; >>>> + int rc; >>>> + >>>> + /* disable local irq to avoid preemption & interruption */ >>>> + raw_local_irq_save(flags); >>>> + rc = __objpool_try_add_slot(obj, pool, raw_smp_processor_id()); >>> >>> And IIUC, we could in theory objpool_pop() on one cpu, then later another >>> cpu might do objpool_push() and cause the latter cpu's pool to go over >>> capacity? Is there some implicit requirements of objpool users to take care >>> of having matched cpu for pop and push? Are the current objpool users >>> obeying this requirement? (I can see the selftests do, not sure about the >>> actual users). >>> Or am I missing something? Thanks. >> >> The objects are all pre-allocated along with creation of the new objpool >> and the total number of objects never exceeds the capacity on local node. > > Aha, I see, the capacity of entries is enough to hold objects from all nodes > in the most unfortunate case they all end up freed from a single cpu. > >> So objpool_push() would always find an available slot from the ring-array >> for the given object to insert back. objpool_pop() would try looping all >> the percpu slots until an object is found or whole objpool is empty. > > So it's correct, but seems rather wasteful to have the whole capacity for > entries replicated on every cpu? It does make objpool_push() simple and > fast, but as you say, objpool_pop() still has to search potentially all > non-local percpu slots, with disabled irqs, which is far from ideal. Yes, it's a trade-off between performance and memory usage, with a slight increase of memory consumption for a significant improvement of performance. The reason of disabling local irqs is objpool uses a 32bit sequence number as the state description of each element. It could likely overflow and go back with the same value for extreme cases. 64bit value could eliminate the collision but seems too heavy. > And the "abort if the slot was already full" comment for > objpool_try_add_slot() seems still misleading? Maybe that was your initial > idea but changed later? Right, the comments are just left unchanged during iterations. The original implementation kept each percpu ring-array very compact and objpool_push will try looping all cpu nodes to return the given object to objpool. Actually my new update would remove objpool_try_add_slot and integrate it's functionality into objpool_push. I'll submit the new patch when I finish the verification. > >> Currently kretprobe is the only actual usecase of objpool. >> >> I'm testing an updated objpool in our HIDS project for critical pathes, >> which is widely deployed on servers inside my company. The new version >> eliminates the raw_local_irq_save and raw_local_irq_restore pair of >> objpool_push and gains up to 5% of performance boost. > > Mind Ccing me and linux-mm once you are posting that? Sure, I'll make sure to let you know. > Thanks, > Vlastimil > Regards, Matt Wu ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations 2024-05-10 8:20 ` Vlastimil Babka 2024-05-10 9:15 ` wuqiang.matt @ 2024-05-28 16:41 ` Masami Hiramatsu 1 sibling, 0 replies; 11+ messages in thread From: Masami Hiramatsu @ 2024-05-28 16:41 UTC (permalink / raw) To: Vlastimil Babka Cc: wuqiang.matt, Andrii Nakryiko, linux-trace-kernel, rostedt, mhiramat, bpf, linux-mm@kvack.org Hi, Sorry for late reply. On Fri, 10 May 2024 10:20:56 +0200 Vlastimil Babka <vbabka@suse.cz> wrote: > On 5/10/24 9:59 AM, wuqiang.matt wrote: > > On 2024/5/7 21:55, Vlastimil Babka wrote: > >> > >>> + } while (!try_cmpxchg_acquire(&slot->tail, &tail, tail + 1)); > >>> + > >>> + /* now the tail position is reserved for the given obj */ > >>> + WRITE_ONCE(slot->entries[tail & slot->mask], obj); > >>> + /* update sequence to make this obj available for pop() */ > >>> + smp_store_release(&slot->last, tail + 1); > >>> + > >>> + return 0; > >>> +} > >>> > >>> /** > >>> * objpool_push() - reclaim the object and return back to objpool > >>> @@ -134,7 +219,19 @@ void *objpool_pop(struct objpool_head *pool); > >>> * return: 0 or error code (it fails only when user tries to push > >>> * the same object multiple times or wrong "objects" into objpool) > >>> */ > >>> -int objpool_push(void *obj, struct objpool_head *pool); > >>> +static inline int objpool_push(void *obj, struct objpool_head *pool) > >>> +{ > >>> + unsigned long flags; > >>> + int rc; > >>> + > >>> + /* disable local irq to avoid preemption & interruption */ > >>> + raw_local_irq_save(flags); > >>> + rc = __objpool_try_add_slot(obj, pool, raw_smp_processor_id()); > >> > >> And IIUC, we could in theory objpool_pop() on one cpu, then later another > >> cpu might do objpool_push() and cause the latter cpu's pool to go over > >> capacity? Is there some implicit requirements of objpool users to take care > >> of having matched cpu for pop and push? Are the current objpool users > >> obeying this requirement? (I can see the selftests do, not sure about the > >> actual users). > >> Or am I missing something? Thanks. > > > > The objects are all pre-allocated along with creation of the new objpool > > and the total number of objects never exceeds the capacity on local node. > > Aha, I see, the capacity of entries is enough to hold objects from all nodes > in the most unfortunate case they all end up freed from a single cpu. > > > So objpool_push() would always find an available slot from the ring-array > > for the given object to insert back. objpool_pop() would try looping all > > the percpu slots until an object is found or whole objpool is empty. > > So it's correct, but seems rather wasteful to have the whole capacity for > entries replicated on every cpu? It does make objpool_push() simple and > fast, but as you say, objpool_pop() still has to search potentially all > non-local percpu slots, with disabled irqs, which is far from ideal. For the kretprobe/fprobe use-case, it is important to push (return) object fast. We can reservce enough number of objects when registering but push operation will happen always on random CPU. > > And the "abort if the slot was already full" comment for > objpool_try_add_slot() seems still misleading? Maybe that was your initial > idea but changed later? Ah, it should not happen... > > > Currently kretprobe is the only actual usecase of objpool. Note that fprobe is also using this objpool, but currently I'm working on integrating fprobe on function-graph tracer[1] which will make fprobe not using objpool. And also I'm planning to replace kretprobe with the new fprobe eventually. So if SLUB will use objpool for frontend caching, it sounds good to me. (Maybe it can speed up the object allocation/free) > > > > I'm testing an updated objpool in our HIDS project for critical pathes, > > which is widely deployed on servers inside my company. The new version > > eliminates the raw_local_irq_save and raw_local_irq_restore pair of > > objpool_push and gains up to 5% of performance boost. > > Mind Ccing me and linux-mm once you are posting that? Can you add me too? Thank you, > > Thanks, > Vlastimil > -- Masami Hiramatsu (Google) <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 2/2] objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids 2024-04-24 21:52 [PATCH 0/2] Objpool performance improvements Andrii Nakryiko 2024-04-24 21:52 ` [PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations Andrii Nakryiko @ 2024-04-24 21:52 ` Andrii Nakryiko 2024-05-28 20:30 ` Jiri Olsa 2024-04-26 14:25 ` [PATCH 0/2] Objpool performance improvements Masami Hiramatsu 2 siblings, 1 reply; 11+ messages in thread From: Andrii Nakryiko @ 2024-04-24 21:52 UTC (permalink / raw) To: linux-trace-kernel, rostedt, mhiramat; +Cc: bpf, Andrii Nakryiko, Matt Wu Profiling shows that calling nr_possible_cpus() in objpool_pop() takes a noticeable amount of CPU (when profiled on 80-core machine), as we need to recalculate number of set bits in a CPU bit mask. This number can't change, so there is no point in paying the price for recalculating it. As such, cache this value in struct objpool_head and use it in objpool_pop(). On the other hand, cached pool->nr_cpus isn't necessary, as it's not used in hot path and is also a pretty trivial value to retrieve. So drop pool->nr_cpus in favor of using nr_cpu_ids everywhere. This way the size of struct objpool_head remains the same, which is a nice bonus. Same BPF selftests benchmarks were used to evaluate the effect. Using changes in previous patch (inlining of objpool_pop/objpool_push) as baseline, here are the differences: BASELINE ======== kretprobe : 9.937 ± 0.174M/s kretprobe-multi: 10.440 ± 0.108M/s AFTER ===== kretprobe : 10.106 ± 0.120M/s (+1.7%) kretprobe-multi: 10.515 ± 0.180M/s (+0.7%) Cc: Matt (Qiang) Wu <wuqiang.matt@bytedance.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- include/linux/objpool.h | 6 +++--- lib/objpool.c | 12 ++++++------ 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/include/linux/objpool.h b/include/linux/objpool.h index d8b1f7b91128..cb1758eaa2d3 100644 --- a/include/linux/objpool.h +++ b/include/linux/objpool.h @@ -73,7 +73,7 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context); * struct objpool_head - object pooling metadata * @obj_size: object size, aligned to sizeof(void *) * @nr_objs: total objs (to be pre-allocated with objpool) - * @nr_cpus: local copy of nr_cpu_ids + * @nr_possible_cpus: cached value of num_possible_cpus() * @capacity: max objs can be managed by one objpool_slot * @gfp: gfp flags for kmalloc & vmalloc * @ref: refcount of objpool @@ -85,7 +85,7 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context); struct objpool_head { int obj_size; int nr_objs; - int nr_cpus; + int nr_possible_cpus; int capacity; gfp_t gfp; refcount_t ref; @@ -176,7 +176,7 @@ static inline void *objpool_pop(struct objpool_head *pool) raw_local_irq_save(flags); cpu = raw_smp_processor_id(); - for (i = 0; i < num_possible_cpus(); i++) { + for (i = 0; i < pool->nr_possible_cpus; i++) { obj = __objpool_try_get_slot(pool, cpu); if (obj) break; diff --git a/lib/objpool.c b/lib/objpool.c index f696308fc026..234f9d0bd081 100644 --- a/lib/objpool.c +++ b/lib/objpool.c @@ -50,7 +50,7 @@ objpool_init_percpu_slots(struct objpool_head *pool, int nr_objs, { int i, cpu_count = 0; - for (i = 0; i < pool->nr_cpus; i++) { + for (i = 0; i < nr_cpu_ids; i++) { struct objpool_slot *slot; int nodes, size, rc; @@ -60,8 +60,8 @@ objpool_init_percpu_slots(struct objpool_head *pool, int nr_objs, continue; /* compute how many objects to be allocated with this slot */ - nodes = nr_objs / num_possible_cpus(); - if (cpu_count < (nr_objs % num_possible_cpus())) + nodes = nr_objs / pool->nr_possible_cpus; + if (cpu_count < (nr_objs % pool->nr_possible_cpus)) nodes++; cpu_count++; @@ -103,7 +103,7 @@ static void objpool_fini_percpu_slots(struct objpool_head *pool) if (!pool->cpu_slots) return; - for (i = 0; i < pool->nr_cpus; i++) + for (i = 0; i < nr_cpu_ids; i++) kvfree(pool->cpu_slots[i]); kfree(pool->cpu_slots); } @@ -130,13 +130,13 @@ int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, /* initialize objpool pool */ memset(pool, 0, sizeof(struct objpool_head)); - pool->nr_cpus = nr_cpu_ids; + pool->nr_possible_cpus = num_possible_cpus(); pool->obj_size = object_size; pool->capacity = capacity; pool->gfp = gfp & ~__GFP_ZERO; pool->context = context; pool->release = release; - slot_size = pool->nr_cpus * sizeof(struct objpool_slot); + slot_size = nr_cpu_ids * sizeof(struct objpool_slot); pool->cpu_slots = kzalloc(slot_size, pool->gfp); if (!pool->cpu_slots) return -ENOMEM; -- 2.43.0 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids 2024-04-24 21:52 ` [PATCH 2/2] objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids Andrii Nakryiko @ 2024-05-28 20:30 ` Jiri Olsa 0 siblings, 0 replies; 11+ messages in thread From: Jiri Olsa @ 2024-05-28 20:30 UTC (permalink / raw) To: Andrii Nakryiko; +Cc: linux-trace-kernel, rostedt, mhiramat, bpf, Matt Wu On Wed, Apr 24, 2024 at 02:52:14PM -0700, Andrii Nakryiko wrote: > Profiling shows that calling nr_possible_cpus() in objpool_pop() takes > a noticeable amount of CPU (when profiled on 80-core machine), as we > need to recalculate number of set bits in a CPU bit mask. This number > can't change, so there is no point in paying the price for recalculating > it. As such, cache this value in struct objpool_head and use it in > objpool_pop(). > > On the other hand, cached pool->nr_cpus isn't necessary, as it's not > used in hot path and is also a pretty trivial value to retrieve. So drop > pool->nr_cpus in favor of using nr_cpu_ids everywhere. This way the size > of struct objpool_head remains the same, which is a nice bonus. > > Same BPF selftests benchmarks were used to evaluate the effect. Using > changes in previous patch (inlining of objpool_pop/objpool_push) as > baseline, here are the differences: > > BASELINE > ======== > kretprobe : 9.937 ± 0.174M/s > kretprobe-multi: 10.440 ± 0.108M/s > > AFTER > ===== > kretprobe : 10.106 ± 0.120M/s (+1.7%) > kretprobe-multi: 10.515 ± 0.180M/s (+0.7%) nice, overall lgtm jirka > > Cc: Matt (Qiang) Wu <wuqiang.matt@bytedance.com> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > include/linux/objpool.h | 6 +++--- > lib/objpool.c | 12 ++++++------ > 2 files changed, 9 insertions(+), 9 deletions(-) > > diff --git a/include/linux/objpool.h b/include/linux/objpool.h > index d8b1f7b91128..cb1758eaa2d3 100644 > --- a/include/linux/objpool.h > +++ b/include/linux/objpool.h > @@ -73,7 +73,7 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context); > * struct objpool_head - object pooling metadata > * @obj_size: object size, aligned to sizeof(void *) > * @nr_objs: total objs (to be pre-allocated with objpool) > - * @nr_cpus: local copy of nr_cpu_ids > + * @nr_possible_cpus: cached value of num_possible_cpus() > * @capacity: max objs can be managed by one objpool_slot > * @gfp: gfp flags for kmalloc & vmalloc > * @ref: refcount of objpool > @@ -85,7 +85,7 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context); > struct objpool_head { > int obj_size; > int nr_objs; > - int nr_cpus; > + int nr_possible_cpus; > int capacity; > gfp_t gfp; > refcount_t ref; > @@ -176,7 +176,7 @@ static inline void *objpool_pop(struct objpool_head *pool) > raw_local_irq_save(flags); > > cpu = raw_smp_processor_id(); > - for (i = 0; i < num_possible_cpus(); i++) { > + for (i = 0; i < pool->nr_possible_cpus; i++) { > obj = __objpool_try_get_slot(pool, cpu); > if (obj) > break; > diff --git a/lib/objpool.c b/lib/objpool.c > index f696308fc026..234f9d0bd081 100644 > --- a/lib/objpool.c > +++ b/lib/objpool.c > @@ -50,7 +50,7 @@ objpool_init_percpu_slots(struct objpool_head *pool, int nr_objs, > { > int i, cpu_count = 0; > > - for (i = 0; i < pool->nr_cpus; i++) { > + for (i = 0; i < nr_cpu_ids; i++) { > > struct objpool_slot *slot; > int nodes, size, rc; > @@ -60,8 +60,8 @@ objpool_init_percpu_slots(struct objpool_head *pool, int nr_objs, > continue; > > /* compute how many objects to be allocated with this slot */ > - nodes = nr_objs / num_possible_cpus(); > - if (cpu_count < (nr_objs % num_possible_cpus())) > + nodes = nr_objs / pool->nr_possible_cpus; > + if (cpu_count < (nr_objs % pool->nr_possible_cpus)) > nodes++; > cpu_count++; > > @@ -103,7 +103,7 @@ static void objpool_fini_percpu_slots(struct objpool_head *pool) > if (!pool->cpu_slots) > return; > > - for (i = 0; i < pool->nr_cpus; i++) > + for (i = 0; i < nr_cpu_ids; i++) > kvfree(pool->cpu_slots[i]); > kfree(pool->cpu_slots); > } > @@ -130,13 +130,13 @@ int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, > > /* initialize objpool pool */ > memset(pool, 0, sizeof(struct objpool_head)); > - pool->nr_cpus = nr_cpu_ids; > + pool->nr_possible_cpus = num_possible_cpus(); > pool->obj_size = object_size; > pool->capacity = capacity; > pool->gfp = gfp & ~__GFP_ZERO; > pool->context = context; > pool->release = release; > - slot_size = pool->nr_cpus * sizeof(struct objpool_slot); > + slot_size = nr_cpu_ids * sizeof(struct objpool_slot); > pool->cpu_slots = kzalloc(slot_size, pool->gfp); > if (!pool->cpu_slots) > return -ENOMEM; > -- > 2.43.0 > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] Objpool performance improvements 2024-04-24 21:52 [PATCH 0/2] Objpool performance improvements Andrii Nakryiko 2024-04-24 21:52 ` [PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations Andrii Nakryiko 2024-04-24 21:52 ` [PATCH 2/2] objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids Andrii Nakryiko @ 2024-04-26 14:25 ` Masami Hiramatsu 2024-04-26 16:05 ` Andrii Nakryiko 2 siblings, 1 reply; 11+ messages in thread From: Masami Hiramatsu @ 2024-04-26 14:25 UTC (permalink / raw) To: Andrii Nakryiko; +Cc: linux-trace-kernel, rostedt, bpf, wuqiang.matt Hi Andrii, On Wed, 24 Apr 2024 14:52:12 -0700 Andrii Nakryiko <andrii@kernel.org> wrote: > Improve objpool (used heavily in kretprobe hot path) performance with two > improvements: > - inlining performance critical objpool_push()/objpool_pop() operations; > - avoiding re-calculating relatively expensive nr_possible_cpus(). Thanks for optimizing objpool. Both looks good to me. BTW, I don't intend to stop this short-term optimization attempts, but I would like to ask you check the new fgraph based fprobe (kretprobe-multi)[1] instead of objpool/rethook. [1] https://lore.kernel.org/all/171318533841.254850.15841395205784342850.stgit@devnote2/ I'm considering to obsolete the kretprobe (and rethook) with fprobe and eventually remove it. Those have similar feature and we should choose safer one. Thank you, > > These opportunities were found when benchmarking and profiling kprobes and > kretprobes with BPF-based benchmarks. See individual patches for details and > results. > > Andrii Nakryiko (2): > objpool: enable inlining objpool_push() and objpool_pop() operations > objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids > > include/linux/objpool.h | 105 +++++++++++++++++++++++++++++++++++-- > lib/objpool.c | 112 +++------------------------------------- > 2 files changed, 107 insertions(+), 110 deletions(-) > > -- > 2.43.0 > -- Masami Hiramatsu (Google) <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 0/2] Objpool performance improvements 2024-04-26 14:25 ` [PATCH 0/2] Objpool performance improvements Masami Hiramatsu @ 2024-04-26 16:05 ` Andrii Nakryiko 0 siblings, 0 replies; 11+ messages in thread From: Andrii Nakryiko @ 2024-04-26 16:05 UTC (permalink / raw) To: Masami Hiramatsu Cc: Andrii Nakryiko, linux-trace-kernel, rostedt, bpf, wuqiang.matt On Fri, Apr 26, 2024 at 7:25 AM Masami Hiramatsu <mhiramat@kernel.org> wrote: > > Hi Andrii, > > On Wed, 24 Apr 2024 14:52:12 -0700 > Andrii Nakryiko <andrii@kernel.org> wrote: > > > Improve objpool (used heavily in kretprobe hot path) performance with two > > improvements: > > - inlining performance critical objpool_push()/objpool_pop() operations; > > - avoiding re-calculating relatively expensive nr_possible_cpus(). > > Thanks for optimizing objpool. Both looks good to me. Great, thanks for applying. > > BTW, I don't intend to stop this short-term optimization attempts, > but I would like to ask you check the new fgraph based fprobe > (kretprobe-multi)[1] instead of objpool/rethook. You can see that I did :) There is tons of code and I'm not familiar with internals of function_graph infra, but you can see I did run benchmarks, so I'm paying attention. But as for the objpool itself, I think it's a performant and useful internal building block, and we might use it outside of rethook as well, so I think making it as fast as possible is good regardless. > > [1] https://lore.kernel.org/all/171318533841.254850.15841395205784342850.stgit@devnote2/ > > I'm considering to obsolete the kretprobe (and rethook) with fprobe > and eventually remove it. Those have similar feature and we should > choose safer one. > Yep, I had a few more semi-ready patches, but I'll hold off for now given this move to function graph, plus some of the changes that Jiri is making in multi-kprobe code. I'll wait for things to settle down a bit before looking again. But just to give you some context, I'm an author of retsnoop tool, and one of the killer features of it is LBR capture in kretprobes, which is a tremendous help in investigating kernel failures, especially in unfamiliar code (LBR allows to "look back" and figure out "how did we get to this condition" after the fact). And so it's important to minimize the amount of wasted LBR records between some kernel function returns error (and thus is "an interesting event" and BPF program that captures LBR is triggered). Big part of that is ftrace/fprobe/rethook infra, so I was looking into making that part as "minimal" as possible, in the sense of eliminating as many function calls and jump as possible. This has an added benefit of making this hot path faster, but my main motivation is LBR. Anyways, just a bit of context for some of the other patches (like inlining arch_rethook_trampoline_callback). > Thank you, > > > > > These opportunities were found when benchmarking and profiling kprobes and > > kretprobes with BPF-based benchmarks. See individual patches for details and > > results. > > > > Andrii Nakryiko (2): > > objpool: enable inlining objpool_push() and objpool_pop() operations > > objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids > > > > include/linux/objpool.h | 105 +++++++++++++++++++++++++++++++++++-- > > lib/objpool.c | 112 +++------------------------------------- > > 2 files changed, 107 insertions(+), 110 deletions(-) > > > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org> ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-05-28 20:30 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-04-24 21:52 [PATCH 0/2] Objpool performance improvements Andrii Nakryiko 2024-04-24 21:52 ` [PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations Andrii Nakryiko 2024-05-07 13:55 ` Vlastimil Babka 2024-05-10 7:59 ` wuqiang.matt 2024-05-10 8:20 ` Vlastimil Babka 2024-05-10 9:15 ` wuqiang.matt 2024-05-28 16:41 ` Masami Hiramatsu 2024-04-24 21:52 ` [PATCH 2/2] objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids Andrii Nakryiko 2024-05-28 20:30 ` Jiri Olsa 2024-04-26 14:25 ` [PATCH 0/2] Objpool performance improvements Masami Hiramatsu 2024-04-26 16:05 ` Andrii Nakryiko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).