From: Waiman Long <longman@redhat.com>
To: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
cgroups@vger.kernel.org, linux-mm@kvack.org
Cc: Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>,
Peter Zijlstra <peterz@infradead.org>
Subject: Re: [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object.
Date: Thu, 23 Dec 2021 16:38:35 -0500 [thread overview]
Message-ID: <4fe30c89-df34-bbdb-a9a1-5519e0363cc5@redhat.com> (raw)
In-Reply-To: <20211222114111.2206248-3-bigeasy@linutronix.de>
On 12/22/21 06:41, Sebastian Andrzej Siewior wrote:
> The members of the per-CPU structure memcg_stock_pcp are protected
> either by disabling interrupts or by disabling preemption if the
> invocation occurred in process context.
> Disabling interrupts protects most of the structure excluding task_obj
> while disabling preemption protects only task_obj.
> This schema is incompatible with PREEMPT_RT because it creates atomic
> context in which actions are performed which require preemptible
> context. One example is obj_cgroup_release().
>
> The IRQ-disable and preempt-disable sections can be replaced with
> local_lock_t which preserves the explicit disabling of interrupts while
> keeps the code preemptible on PREEMPT_RT.
>
> The task_obj has been added for performance reason on non-preemptible
> kernels where preempt_disable() is a NOP. On the PREEMPT_RT preemption
> model preempt_disable() is always implemented. Also there are no memory
> allocations in_irq() context and softirqs are processed in (preemptible)
> process context. Therefore it makes sense to avoid using task_obj.
>
> Don't use task_obj on PREEMPT_RT and replace manual disabling of
> interrupts with a local_lock_t. This change requires some factoring:
>
> - drain_obj_stock() drops a reference on obj_cgroup which leads to an
> invocation of obj_cgroup_release() if it is the last object. This in
> turn leads to recursive locking of the local_lock_t. To avoid this,
> obj_cgroup_release() is invoked outside of the locked section.
>
> - drain_obj_stock() gets a memcg_stock_pcp passed if the stock_lock has been
> acquired (instead of the task_obj_lock) to avoid recursive locking later
> in refill_stock().
>
> - drain_all_stock() disables preemption via get_cpu() and then invokes
> drain_local_stock() if it is the local CPU to avoid scheduling a worker
> (which invokes the same function). Disabling preemption here is
> problematic due to the sleeping locks in drain_local_stock().
> This can be avoided by always scheduling a worker, even for the local
> CPU. Using cpus_read_lock() stabilizes cpu_online_mask which ensures
> that no worker is scheduled for an offline CPU. Since there is no
> flush_work(), it is still possible that a worker is invoked on the wrong
> CPU but it is okay since it operates always on the local-CPU data.
>
> - drain_local_stock() is always invoked as a worker so it can be optimized
> by removing in_task() (it is always true) and avoiding the "irq_save"
> variant because interrupts are always enabled here. Operating on
> task_obj first allows to acquire the lock_lock_t without lockdep
> complains.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> mm/memcontrol.c | 171 +++++++++++++++++++++++++++++++-----------------
> 1 file changed, 112 insertions(+), 59 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d2687d5ed544b..1e76f26be2c15 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void)
> return cgroup_memory_nokmem;
> }
>
> +struct memcg_stock_pcp;
> static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
> - unsigned int nr_pages);
> + unsigned int nr_pages,
> + struct memcg_stock_pcp *stock_pcp);
AFAICS, stock_pcp is set to indicate that the stock_lock has been
acquired. Since stock_pcp, if set, should be the same as
this_cpu_ptr(&memcg_stock), it is a bit confusing to pass it to a
function that also does a percpu access to memcg_stock. Why don't you
just pass a boolean, say, stock_locked to indicate this instead. It will
be more clear and less confusing.
>
> static void obj_cgroup_release(struct percpu_ref *ref)
> {
> @@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)
> nr_pages = nr_bytes >> PAGE_SHIFT;
>
> if (nr_pages)
> - obj_cgroup_uncharge_pages(objcg, nr_pages);
> + obj_cgroup_uncharge_pages(objcg, nr_pages, NULL);
>
> spin_lock_irqsave(&css_set_lock, flags);
> list_del(&objcg->list);
> @@ -2120,26 +2122,40 @@ struct obj_stock {
> };
>
> struct memcg_stock_pcp {
> + /* Protects memcg_stock_pcp */
> + local_lock_t stock_lock;
> struct mem_cgroup *cached; /* this never be root cgroup */
> unsigned int nr_pages;
> +#ifndef CONFIG_PREEMPT_RT
> + /* Protects only task_obj */
> + local_lock_t task_obj_lock;
> struct obj_stock task_obj;
> +#endif
> struct obj_stock irq_obj;
>
> struct work_struct work;
> unsigned long flags;
> #define FLUSHING_CACHED_CHARGE 0
> };
> -static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
> + .stock_lock = INIT_LOCAL_LOCK(stock_lock),
> +#ifndef CONFIG_PREEMPT_RT
> + .task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock),
> +#endif
> +};
> static DEFINE_MUTEX(percpu_charge_mutex);
>
> #ifdef CONFIG_MEMCG_KMEM
> -static void drain_obj_stock(struct obj_stock *stock);
> +static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> + struct memcg_stock_pcp *stock_pcp);
> static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> struct mem_cgroup *root_memcg);
>
> #else
> -static inline void drain_obj_stock(struct obj_stock *stock)
> +static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock,
> + struct memcg_stock_pcp *stock_pcp)
> {
> + return NULL;
> }
> static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> struct mem_cgroup *root_memcg)
> @@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> if (nr_pages > MEMCG_CHARGE_BATCH)
> return ret;
>
> - local_irq_save(flags);
> + local_lock_irqsave(&memcg_stock.stock_lock, flags);
>
> stock = this_cpu_ptr(&memcg_stock);
> if (memcg == stock->cached && stock->nr_pages >= nr_pages) {
> @@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> ret = true;
> }
>
> - local_irq_restore(flags);
> + local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
>
> return ret;
> }
> @@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>
> static void drain_local_stock(struct work_struct *dummy)
> {
> - struct memcg_stock_pcp *stock;
> - unsigned long flags;
> + struct memcg_stock_pcp *stock_pcp;
> + struct obj_cgroup *old;
>
> /*
> * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs.
> * drain_stock races is that we always operate on local CPU stock
> * here with IRQ disabled
> */
> - local_irq_save(flags);
> +#ifndef CONFIG_PREEMPT_RT
> + local_lock(&memcg_stock.task_obj_lock);
> + old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL);
> + local_unlock(&memcg_stock.task_obj_lock);
> + if (old)
> + obj_cgroup_put(old);
> +#endif
>
> - stock = this_cpu_ptr(&memcg_stock);
> - drain_obj_stock(&stock->irq_obj);
> - if (in_task())
> - drain_obj_stock(&stock->task_obj);
> - drain_stock(stock);
> - clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> + local_lock_irq(&memcg_stock.stock_lock);
> + stock_pcp = this_cpu_ptr(&memcg_stock);
> + old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp);
>
> - local_irq_restore(flags);
> + drain_stock(stock_pcp);
> + clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags);
> +
> + local_unlock_irq(&memcg_stock.stock_lock);
> + if (old)
> + obj_cgroup_put(old);
> }
>
> /*
> * Cache charges(val) to local per_cpu area.
> * This will be consumed by consume_stock() function, later.
> */
> -static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> +static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> + struct memcg_stock_pcp *stock)
> {
> - struct memcg_stock_pcp *stock;
> - unsigned long flags;
> + lockdep_assert_held(&stock->stock_lock);
>
> - local_irq_save(flags);
> -
> - stock = this_cpu_ptr(&memcg_stock);
> if (stock->cached != memcg) { /* reset if necessary */
> drain_stock(stock);
> css_get(&memcg->css);
> @@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>
> if (stock->nr_pages > MEMCG_CHARGE_BATCH)
> drain_stock(stock);
> +}
>
> - local_irq_restore(flags);
> +static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
> + struct memcg_stock_pcp *stock_pcp)
> +{
> + unsigned long flags;
> +
> + if (stock_pcp) {
> + __refill_stock(memcg, nr_pages, stock_pcp);
> + return;
> + }
> + local_lock_irqsave(&memcg_stock.stock_lock, flags);
> + __refill_stock(memcg, nr_pages, this_cpu_ptr(&memcg_stock));
> + local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
> }
>
> /*
> @@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
> */
> static void drain_all_stock(struct mem_cgroup *root_memcg)
> {
> - int cpu, curcpu;
> + int cpu;
>
> /* If someone's already draining, avoid adding running more workers. */
> if (!mutex_trylock(&percpu_charge_mutex))
> @@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> * as well as workers from this path always operate on the local
> * per-cpu data. CPU up doesn't touch memcg_stock at all.
> */
> - curcpu = get_cpu();
> + cpus_read_lock();
> for_each_online_cpu(cpu) {
> struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> struct mem_cgroup *memcg;
> @@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> rcu_read_unlock();
>
> if (flush &&
> - !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> - if (cpu == curcpu)
> - drain_local_stock(&stock->work);
> - else
> - schedule_work_on(cpu, &stock->work);
> - }
> + !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> + schedule_work_on(cpu, &stock->work);
> }
> - put_cpu();
> + cpus_read_unlock();
> mutex_unlock(&percpu_charge_mutex);
> }
>
> @@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
> done_restock:
> if (batch > nr_pages)
> - refill_stock(memcg, batch - nr_pages);
> + refill_stock(memcg, batch - nr_pages, NULL);
>
> /*
> * If the hierarchy is above the normal consumption range, schedule
> @@ -2803,28 +2832,35 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
> * can only be accessed after disabling interrupt. User context code can
> * access interrupt object stock, but not vice versa.
> */
> -static inline struct obj_stock *get_obj_stock(unsigned long *pflags)
> +static inline struct obj_stock *get_obj_stock(unsigned long *pflags,
> + struct memcg_stock_pcp **stock_pcp)
> {
> struct memcg_stock_pcp *stock;
>
> +#ifndef CONFIG_PREEMPT_RT
> if (likely(in_task())) {
> *pflags = 0UL;
> - preempt_disable();
> + *stock_pcp = NULL;
> + local_lock(&memcg_stock.task_obj_lock);
> stock = this_cpu_ptr(&memcg_stock);
> return &stock->task_obj;
> }
> -
> - local_irq_save(*pflags);
> +#endif
> + local_lock_irqsave(&memcg_stock.stock_lock, *pflags);
> stock = this_cpu_ptr(&memcg_stock);
> + *stock_pcp = stock;
> return &stock->irq_obj;
> }
>
> -static inline void put_obj_stock(unsigned long flags)
> +static inline void put_obj_stock(unsigned long flags,
> + struct memcg_stock_pcp *stock_pcp)
> {
> - if (likely(in_task()))
> - preempt_enable();
> +#ifndef CONFIG_PREEMPT_RT
> + if (likely(!stock_pcp))
> + local_unlock(&memcg_stock.task_obj_lock);
> else
> - local_irq_restore(flags);
If you skip the "else" and add a "return", there will be a more natural
indentation for the following lock_unlock_irqrestore(). Also we may not
really need the additional stock_pcp argument here.
> +#endif
> + local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
> }
Cheers,
Longman
next prev parent reply other threads:[~2021-12-23 21:38 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-22 11:41 [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it Sebastian Andrzej Siewior
2021-12-22 11:41 ` [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT Sebastian Andrzej Siewior
2021-12-23 2:31 ` Waiman Long
2021-12-23 7:34 ` Sebastian Andrzej Siewior
2021-12-23 16:01 ` Waiman Long
2022-01-05 14:16 ` Michal Koutný
2022-01-13 13:08 ` Sebastian Andrzej Siewior
2022-01-13 14:48 ` Michal Koutný
2022-01-14 9:09 ` Sebastian Andrzej Siewior
2022-01-18 18:26 ` [PATCH] mm/memcg: Do not check v1 event counter when not needed Michal Koutný
2022-01-18 19:57 ` Sebastian Andrzej Siewior
2021-12-22 11:41 ` [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object Sebastian Andrzej Siewior
2021-12-23 21:38 ` Waiman Long [this message]
2022-01-03 16:34 ` Sebastian Andrzej Siewior
2022-01-03 17:09 ` Waiman Long
2021-12-22 11:41 ` [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels Sebastian Andrzej Siewior
2021-12-23 21:48 ` Waiman Long
2022-01-03 14:44 ` Sebastian Andrzej Siewior
2022-01-03 15:04 ` Waiman Long
2022-01-05 20:22 ` Sebastian Andrzej Siewior
2022-01-06 3:28 ` Waiman Long
2022-01-13 15:26 ` Sebastian Andrzej Siewior
2022-01-05 14:59 ` [RFC PATCH 0/3] mm/memcg: Address PREEMPT_RT problems instead of disabling it Michal Koutný
2022-01-05 15:06 ` Sebastian Andrzej Siewior
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4fe30c89-df34-bbdb-a9a1-5519e0363cc5@redhat.com \
--to=longman@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=bigeasy@linutronix.de \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).