[PATCH] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.
@ 2023-03-30  9:27 Ziwei Dai
  2023-03-30  9:45 ` 答复: " 代子为 (Ziwei Dai)
  0 siblings, 1 reply; 3+ messages in thread
From: Ziwei Dai @ 2023-03-30  9:27 UTC (permalink / raw)
  To: paulmck, frederic, quic_neeraju, josh, rostedt, mathieu.desnoyers,
	jiangshanlai, joel, rcu
  Cc: linux-kernel, shuang.wang, yifan.xin, ke.wang, xuewen.yan,
	zhiguo.niu, ziwei.dai, zhaoyang.huang

From: 代子为 (Ziwei Dai) <ziwei.dai@ziwei-lenovo.spreadtrum.com>

In kfree_rcu_monitor(), new free business at krcp is attached to any free
channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
business is handled after the rcu grace period. But if there is any none-free
channel at krwp already, that means there is an on-going rcu work,
which will cause the kvfree_call_rcu()-triggered free business is done
before the wanted rcu grace period ends.

This commit ignores krwp which has non-free channel at kfree_rcu_monitor(),
to fix the issue that kvfree_call_rcu() loses effectiveness.

Below is the css_set obj "from_cset" use-after-free issue caused by
kvfree_call_rcu() losing effectiveness.
Core 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes.
Core 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
Core 2 frees "from_cset" after current gp end. "from_cset" is reallocated.
Core 0 references "from_cset"'s member, which causes crash.

Core 0					Core 1				       	Core 2
count_memcg_event_mm()
|rcu_read_lock()  <---
|mem_cgroup_from_task()
 |// <css_set ptr> is the "from_cset" mentioned on core 1
 |<css_set ptr> = rcu_dereference((task)->cgroups)
 |// Hard irq comes, current task is scheduled out.

			Core 1:
			cgroup_attach_task()
			|cgroup_migrate()
			 |cgroup_migrate_execute()
			  |css_set_move_task(task, from_cset, to_cset, true)
			  |cgroup_move_task(task, to_cset)
			   |rcu_assign_pointer(.., to_cset)
			   |...
			|cgroup_migrate_finish()
			 |put_css_set_locked(from_cset)
			  |from_cset->refcount return 0
			  |kfree_rcu(cset, rcu_head) <--- means to free from_cset after new gp
			   |add_ptr_to_bulk_krc_lock()
			   |schedule_delayed_work(&krcp->monitor_work, ..)

			kfree_rcu_monitor()
			|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
			|queue_rcu_work(system_wq, &krwp->rcu_work)
			 |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
			 |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp

								// There is a perious call_rcu(.., rcu_work_rcufn)
								// gp end, rcu_work_rcufn() is called.
								rcu_work_rcufn()
								|__queue_work(.., rwork->wq, &rwork->work);
								Core 2：
								// or there is a pending kfree_rcu_work() work called.
								|kfree_rcu_work()
								|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
								|The "from_cset" mentioned on core 1 is freed before new gp end.
Core 0:
// the task is schedule in after many ms.
 |<css_set ptr>->subsys[(subsys_id) <--- caused kernel crash, because <css_set ptr>="from_cset" is freed.

Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>

:#	modified:   tree.c
---
 kernel/rcu/tree.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8e880c0..f6451a8 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3107,15 +3107,16 @@ static void kfree_rcu_monitor(struct work_struct *work)
 	for (i = 0; i < KFREE_N_BATCHES; i++) {
 		struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
 
-		// Try to detach bulk_head or head and attach it over any
-		// available corresponding free channel. It can be that
-		// a previous RCU batch is in progress, it means that
-		// immediately to queue another one is not possible so
-		// in that case the monitor work is rearmed.
-		if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
-			(!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
-				(READ_ONCE(krcp->head) && !krwp->head_free)) {
-
+		// Try to detach bulk_head or head and attach it, only when
+		// all channels are free.  Any channel is not free means at krwp
+		// there is on-going rcu work to handle krwp's free business.
+		if (!list_empty(&krwp->bulk_head_free[0]) ||
+			!list_empty(&krwp->bulk_head_free[1]) ||
+				krwp->head_free)
+			continue;
+		if (!list_empty(&krcp->bulk_head[0]) ||
+			!list_empty(&krcp->bulk_head[1]) ||
+			READ_ONCE(krcp->head)) {
 			// Channel 1 corresponds to the SLAB-pointer bulk path.
 			// Channel 2 corresponds to vmalloc-pointer bulk path.
 			for (j = 0; j < FREE_N_CHANNELS; j++) {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* 答复: [PATCH] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.
  2023-03-30  9:27 [PATCH] rcu: Make sure new krcp free business is handled after the wanted rcu grace period Ziwei Dai
@ 2023-03-30  9:45 ` 代子为 (Ziwei Dai)
  2023-03-30 12:07   ` Uladzislau Rezki
  0 siblings, 1 reply; 3+ messages in thread
From: 代子为 (Ziwei Dai) @ 2023-03-30  9:45 UTC (permalink / raw)
  To: paulmck@kernel.org, frederic@kernel.org, quic_neeraju@quicinc.com,
	josh@joshtriplett.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, jiangshanlai@gmail.com,
	joel@joelfernandes.org, rcu@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, 王双 (Shuang Wang),
	辛依凡 (Yifan Xin), 王科 (Ke Wang),
	闫学文 (Xuewen Yan),
	牛志国 (Zhiguo Niu),
	黄朝阳 (Zhaoyang Huang)

Hi Uladzislau and all,

Sorry for the disclaimer in the original mail.
Please help comment in this new thread.

We found this issue at K5.15. We try to fix this issue on K5.15.
It seems mainline also has this issue.

Below is the first debug patch on k5.15 device, which is under stress test, issue not reproduce so far.
============================================================
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 66951e130c2fc..44759641f7234 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3342,15 +3342,21 @@ static void kfree_rcu_monitor(struct work_struct *work)
        // Attempt to start a new batch.
        for (i = 0; i < KFREE_N_BATCHES; i++) {
                struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
+               bool rcu_work_pending;

                // Try to detach bkvhead or head and attach it over any
                // available corresponding free channel. It can be that
                // a previous RCU batch is in progress, it means that
                // immediately to queue another one is not possible so
                // in that case the monitor work is rearmed.
-               if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
-                       (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
-                               (krcp->head && !krwp->head_free)) {
+               rcu_work_pending = test_bit(
+                       WORK_STRUCT_PENDING_BIT,
+                       work_data_bits(&krwp->rcu_work.work));
+               // If there is on-going rcu work, continue.
+               if (rcu_work_pending || krwp->bkvhead_free[0] ||
+                       krwp->bkvhead_free[1] || krwp->head_free)
+                       continue;
+               if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head) {
                        // Channel 1 corresponds to the SLAB-pointer bulk path.
                        // Channel 2 corresponds to vmalloc-pointer bulk path.
                        for (j = 0; j < FREE_N_CHANNELS; j++) {

As " rcu_work_pending" judgement seems redundant, I made the second patch below on k5.15. We will make stress test.
============================================================
Below is the first debug patch on k5.15 device, which is under stress test, issue not reproduce so far.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 66951e130c2fc..f219c60a8ec30 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3343,14 +3343,13 @@ static void kfree_rcu_monitor(struct work_struct *work)
        for (i = 0; i < KFREE_N_BATCHES; i++) {
                struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);

-               // Try to detach bkvhead or head and attach it over any
-               // available corresponding free channel. It can be that
-               // a previous RCU batch is in progress, it means that
-               // immediately to queue another one is not possible so
-               // in that case the monitor work is rearmed.
-               if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
-                       (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
-                               (krcp->head && !krwp->head_free)) {
+               // Try to detach bulk_head or head and attach it, only when
+               // all channels are free.  Any channel is not free means at krwp
+               // there is on-going rcu work to handle krwp's free business.
+               if (krwp->bkvhead_free[0] || krwp->bkvhead_free[1] ||
+                       krwp->head_free)
+                       continue;
+               if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head) {
                        // Channel 1 corresponds to the SLAB-pointer bulk path.
                        // Channel 2 corresponds to vmalloc-pointer bulk path.
                        for (j = 0; j < FREE_N_CHANNELS; j++) {


> -----邮件原件-----
> 发件人: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>
> 发送时间: 2023年3月30日 17:27
> 收件人: paulmck@kernel.org; frederic@kernel.org;
> quic_neeraju@quicinc.com; josh@joshtriplett.org; rostedt@goodmis.org;
> mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com;
> joel@joelfernandes.org; rcu@vger.kernel.org
> 抄送: linux-kernel@vger.kernel.org; 王双 (Shuang Wang)
> <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin) <Yifan.Xin@unisoc.com>;
> 王科 (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文 (Xuewen Yan)
> <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu) <Zhiguo.Niu@unisoc.com>;
> 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>; 黄朝阳 (Zhaoyang Huang)
> <zhaoyang.huang@unisoc.com>
> 主题: [PATCH] rcu: Make sure new krcp free business is handled after the
> wanted rcu grace period.
> 
> From: 代子为 (Ziwei Dai) <ziwei.dai@ziwei-lenovo.spreadtrum.com>
> 
> In kfree_rcu_monitor(), new free business at krcp is attached to any free
> channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
> business is handled after the rcu grace period. But if there is any none-free
> channel at krwp already, that means there is an on-going rcu work, which will
> cause the kvfree_call_rcu()-triggered free business is done before the wanted
> rcu grace period ends.
> 
> This commit ignores krwp which has non-free channel at kfree_rcu_monitor(),
> to fix the issue that kvfree_call_rcu() loses effectiveness.
> 
> Below is the css_set obj "from_cset" use-after-free issue caused by
> kvfree_call_rcu() losing effectiveness.
> Core 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes.
> Core 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new
> gp.
> Core 2 frees "from_cset" after current gp end. "from_cset" is reallocated.
> Core 0 references "from_cset"'s member, which causes crash.
> 
> Core 0					Core 1				       	Core 2
> count_memcg_event_mm()
> |rcu_read_lock()  <---
> |mem_cgroup_from_task()
>  |// <css_set ptr> is the "from_cset" mentioned on core 1  |<css_set ptr> =
> rcu_dereference((task)->cgroups)  |// Hard irq comes, current task is
> scheduled out.
> 
> 			Core 1:
> 			cgroup_attach_task()
> 			|cgroup_migrate()
> 			 |cgroup_migrate_execute()
> 			  |css_set_move_task(task, from_cset, to_cset, true)
> 			  |cgroup_move_task(task, to_cset)
> 			   |rcu_assign_pointer(.., to_cset)
> 			   |...
> 			|cgroup_migrate_finish()
> 			 |put_css_set_locked(from_cset)
> 			  |from_cset->refcount return 0
> 			  |kfree_rcu(cset, rcu_head) <--- means to free from_cset
> after new gp
> 			   |add_ptr_to_bulk_krc_lock()
> 			   |schedule_delayed_work(&krcp->monitor_work, ..)
> 
> 			kfree_rcu_monitor()
> 			|krcp->bulk_head[0]'s work attached to
> krwp->bulk_head_free[]
> 			|queue_rcu_work(system_wq, &krwp->rcu_work)
> 			 |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT
> state,
> 			 |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp
> 
> 								// There is a perious call_rcu(..,
> rcu_work_rcufn)
> 								// gp end, rcu_work_rcufn() is called.
> 								rcu_work_rcufn()
> 								|__queue_work(.., rwork->wq,
> &rwork->work);
> 								Core 2：
> 								// or there is a pending
> kfree_rcu_work() work called.
> 								|kfree_rcu_work()
> 								|krwp->bulk_head_free[0] bulk is
> freed before new gp end!!!
> 								|The "from_cset" mentioned on core
> 1 is freed before new gp end.
> Core 0:
> // the task is schedule in after many ms.
>  |<css_set ptr>->subsys[(subsys_id) <--- caused kernel crash, because
> <css_set ptr>="from_cset" is freed.
> 
> Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
> 
> :#	modified:   tree.c
> ---
>  kernel/rcu/tree.c | 19 ++++++++++---------
>  1 file changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 8e880c0..f6451a8
> 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3107,15 +3107,16 @@ static void kfree_rcu_monitor(struct
> work_struct *work)
>  	for (i = 0; i < KFREE_N_BATCHES; i++) {
>  		struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
> 
> -		// Try to detach bulk_head or head and attach it over any
> -		// available corresponding free channel. It can be that
> -		// a previous RCU batch is in progress, it means that
> -		// immediately to queue another one is not possible so
> -		// in that case the monitor work is rearmed.
> -		if ((!list_empty(&krcp->bulk_head[0]) &&
> list_empty(&krwp->bulk_head_free[0])) ||
> -			(!list_empty(&krcp->bulk_head[1]) &&
> list_empty(&krwp->bulk_head_free[1])) ||
> -				(READ_ONCE(krcp->head) && !krwp->head_free)) {
> -
> +		// Try to detach bulk_head or head and attach it, only when
> +		// all channels are free.  Any channel is not free means at krwp
> +		// there is on-going rcu work to handle krwp's free business.
> +		if (!list_empty(&krwp->bulk_head_free[0]) ||
> +			!list_empty(&krwp->bulk_head_free[1]) ||
> +				krwp->head_free)
> +			continue;
> +		if (!list_empty(&krcp->bulk_head[0]) ||
> +			!list_empty(&krcp->bulk_head[1]) ||
> +			READ_ONCE(krcp->head)) {
>  			// Channel 1 corresponds to the SLAB-pointer bulk path.
>  			// Channel 2 corresponds to vmalloc-pointer bulk path.
>  			for (j = 0; j < FREE_N_CHANNELS; j++) {
> --
> 1.9.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: 答复: [PATCH] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.
  2023-03-30  9:45 ` 答复: " 代子为 (Ziwei Dai)
@ 2023-03-30 12:07   ` Uladzislau Rezki
  0 siblings, 0 replies; 3+ messages in thread
From: Uladzislau Rezki @ 2023-03-30 12:07 UTC (permalink / raw)
  To: 代子为 (Ziwei Dai)
  Cc: paulmck@kernel.org, frederic@kernel.org, quic_neeraju@quicinc.com,
	josh@joshtriplett.org, rostedt@goodmis.org,
	mathieu.desnoyers@efficios.com, jiangshanlai@gmail.com,
	joel@joelfernandes.org, rcu@vger.kernel.org,
	linux-kernel@vger.kernel.org, 王双 (Shuang Wang),
	辛依凡 (Yifan Xin), 王科 (Ke Wang),
	闫学文 (Xuewen Yan),
	牛志国 (Zhiguo Niu),
	黄朝阳 (Zhaoyang Huang)

Hello, Ziwei Dai!

> Hi Uladzislau and all,
> 
> Sorry for the disclaimer in the original mail.
> Please help comment in this new thread.
> 
> We found this issue at K5.15. We try to fix this issue on K5.15.
> It seems mainline also has this issue.
> 
> Below is the first debug patch on k5.15 device, which is under stress test, issue not reproduce so far.
> ============================================================
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 66951e130c2fc..44759641f7234 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3342,15 +3342,21 @@ static void kfree_rcu_monitor(struct work_struct *work)
>         // Attempt to start a new batch.
>         for (i = 0; i < KFREE_N_BATCHES; i++) {
>                 struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
> +               bool rcu_work_pending;
> 
>                 // Try to detach bkvhead or head and attach it over any
>                 // available corresponding free channel. It can be that
>                 // a previous RCU batch is in progress, it means that
>                 // immediately to queue another one is not possible so
>                 // in that case the monitor work is rearmed.
> -               if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
> -                       (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
> -                               (krcp->head && !krwp->head_free)) {
> +               rcu_work_pending = test_bit(
> +                       WORK_STRUCT_PENDING_BIT,
> +                       work_data_bits(&krwp->rcu_work.work));
> +               // If there is on-going rcu work, continue.
> +               if (rcu_work_pending || krwp->bkvhead_free[0] ||
> +                       krwp->bkvhead_free[1] || krwp->head_free)
> +                       continue;
> +               if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head) {
>                         // Channel 1 corresponds to the SLAB-pointer bulk path.
>                         // Channel 2 corresponds to vmalloc-pointer bulk path.
>                         for (j = 0; j < FREE_N_CHANNELS; j++) {
> 
> As " rcu_work_pending" judgement seems redundant, I made the second patch below on k5.15. We will make stress test.
> ============================================================
> Below is the first debug patch on k5.15 device, which is under stress test, issue not reproduce so far.
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 66951e130c2fc..f219c60a8ec30 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3343,14 +3343,13 @@ static void kfree_rcu_monitor(struct work_struct *work)
>         for (i = 0; i < KFREE_N_BATCHES; i++) {
>                 struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
> 
> -               // Try to detach bkvhead or head and attach it over any
> -               // available corresponding free channel. It can be that
> -               // a previous RCU batch is in progress, it means that
> -               // immediately to queue another one is not possible so
> -               // in that case the monitor work is rearmed.
> -               if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
> -                       (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
> -                               (krcp->head && !krwp->head_free)) {
> +               // Try to detach bulk_head or head and attach it, only when
> +               // all channels are free.  Any channel is not free means at krwp
> +               // there is on-going rcu work to handle krwp's free business.
> +               if (krwp->bkvhead_free[0] || krwp->bkvhead_free[1] ||
> +                       krwp->head_free)
> +                       continue;
> +               if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head) {
>                         // Channel 1 corresponds to the SLAB-pointer bulk path.
>                         // Channel 2 corresponds to vmalloc-pointer bulk path.
>                         for (j = 0; j < FREE_N_CHANNELS; j++) {
> 
> 
> > -----邮件原件-----
> > 发件人: 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>
> > 发送时间: 2023年3月30日 17:27
> > 收件人: paulmck@kernel.org; frederic@kernel.org;
> > quic_neeraju@quicinc.com; josh@joshtriplett.org; rostedt@goodmis.org;
> > mathieu.desnoyers@efficios.com; jiangshanlai@gmail.com;
> > joel@joelfernandes.org; rcu@vger.kernel.org
> > 抄送: linux-kernel@vger.kernel.org; 王双 (Shuang Wang)
> > <shuang.wang@unisoc.com>; 辛依凡 (Yifan Xin) <Yifan.Xin@unisoc.com>;
> > 王科 (Ke Wang) <Ke.Wang@unisoc.com>; 闫学文 (Xuewen Yan)
> > <Xuewen.Yan@unisoc.com>; 牛志国 (Zhiguo Niu) <Zhiguo.Niu@unisoc.com>;
> > 代子为 (Ziwei Dai) <Ziwei.Dai@unisoc.com>; 黄朝阳 (Zhaoyang Huang)
> > <zhaoyang.huang@unisoc.com>
> > 主题: [PATCH] rcu: Make sure new krcp free business is handled after the
> > wanted rcu grace period.
> > 
> > From: 代子为 (Ziwei Dai) <ziwei.dai@ziwei-lenovo.spreadtrum.com>
> > 
> > In kfree_rcu_monitor(), new free business at krcp is attached to any free
> > channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
> > business is handled after the rcu grace period. But if there is any none-free
> > channel at krwp already, that means there is an on-going rcu work, which will
> > cause the kvfree_call_rcu()-triggered free business is done before the wanted
> > rcu grace period ends.
> > 
> > This commit ignores krwp which has non-free channel at kfree_rcu_monitor(),
> > to fix the issue that kvfree_call_rcu() loses effectiveness.
> > 
> > Below is the css_set obj "from_cset" use-after-free issue caused by
> > kvfree_call_rcu() losing effectiveness.
> > Core 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes.
> > Core 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new
> > gp.
> > Core 2 frees "from_cset" after current gp end. "from_cset" is reallocated.
> > Core 0 references "from_cset"'s member, which causes crash.
> > 
> > Core 0					Core 1				       	Core 2
> > count_memcg_event_mm()
> > |rcu_read_lock()  <---
> > |mem_cgroup_from_task()
> >  |// <css_set ptr> is the "from_cset" mentioned on core 1  |<css_set ptr> =
> > rcu_dereference((task)->cgroups)  |// Hard irq comes, current task is
> > scheduled out.
> > 
> > 			Core 1:
> > 			cgroup_attach_task()
> > 			|cgroup_migrate()
> > 			 |cgroup_migrate_execute()
> > 			  |css_set_move_task(task, from_cset, to_cset, true)
> > 			  |cgroup_move_task(task, to_cset)
> > 			   |rcu_assign_pointer(.., to_cset)
> > 			   |...
> > 			|cgroup_migrate_finish()
> > 			 |put_css_set_locked(from_cset)
> > 			  |from_cset->refcount return 0
> > 			  |kfree_rcu(cset, rcu_head) <--- means to free from_cset
> > after new gp
> > 			   |add_ptr_to_bulk_krc_lock()
> > 			   |schedule_delayed_work(&krcp->monitor_work, ..)
> > 
> > 			kfree_rcu_monitor()
> > 			|krcp->bulk_head[0]'s work attached to
> > krwp->bulk_head_free[]
> > 			|queue_rcu_work(system_wq, &krwp->rcu_work)
> > 			 |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT
> > state,
> > 			 |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp
> > 
> > 								// There is a perious call_rcu(..,
> > rcu_work_rcufn)
> > 								// gp end, rcu_work_rcufn() is called.
> > 								rcu_work_rcufn()
> > 								|__queue_work(.., rwork->wq,
> > &rwork->work);
> > 								Core 2：
> > 								// or there is a pending
> > kfree_rcu_work() work called.
> > 								|kfree_rcu_work()
> > 								|krwp->bulk_head_free[0] bulk is
> > freed before new gp end!!!
> > 								|The "from_cset" mentioned on core
> > 1 is freed before new gp end.
> > Core 0:
> > // the task is schedule in after many ms.
> >  |<css_set ptr>->subsys[(subsys_id) <--- caused kernel crash, because
> > <css_set ptr>="from_cset" is freed.
> > 
> > Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
> > 
> > :#	modified:   tree.c
> > ---
> >  kernel/rcu/tree.c | 19 ++++++++++---------
> >  1 file changed, 10 insertions(+), 9 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 8e880c0..f6451a8
> > 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3107,15 +3107,16 @@ static void kfree_rcu_monitor(struct
> > work_struct *work)
> >  	for (i = 0; i < KFREE_N_BATCHES; i++) {
> >  		struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
> > 
> > -		// Try to detach bulk_head or head and attach it over any
> > -		// available corresponding free channel. It can be that
> > -		// a previous RCU batch is in progress, it means that
> > -		// immediately to queue another one is not possible so
> > -		// in that case the monitor work is rearmed.
> > -		if ((!list_empty(&krcp->bulk_head[0]) &&
> > list_empty(&krwp->bulk_head_free[0])) ||
> > -			(!list_empty(&krcp->bulk_head[1]) &&
> > list_empty(&krwp->bulk_head_free[1])) ||
> > -				(READ_ONCE(krcp->head) && !krwp->head_free)) {
> > -
> > +		// Try to detach bulk_head or head and attach it, only when
> > +		// all channels are free.  Any channel is not free means at krwp
> > +		// there is on-going rcu work to handle krwp's free business.
> > +		if (!list_empty(&krwp->bulk_head_free[0]) ||
> > +			!list_empty(&krwp->bulk_head_free[1]) ||
> > +				krwp->head_free)
> > +			continue;
Can we replace it with a new helper, for example: 

+static bool
+can_krcw_offload(struct kfree_rcu_cpu_work *krwp)
+{
+       int i;
+
+       for (i = 0; i < FREE_N_CHANNELS; i++)
+               if (!list_empty(&krwp->bulk_head_free[i]))
+                       return false;
+
+       return (READ_ONCE(krwp->head_free) == NULL);
+}

and then

+               if (!can_krcw_offload(krwp))
                        continue;

>
> > +		if (!list_empty(&krcp->bulk_head[0]) ||
> > +			!list_empty(&krcp->bulk_head[1]) ||
> > +			READ_ONCE(krcp->head)) {
>
Can we replace it with the:

+               // kvfree_rcu_drain_ready() might handle this krcp, if so give up.
+               if (need_offload_krc(krcp)) {

?
because we have a helper that is need_offload_krcp() that does the same.

> >  			// Channel 1 corresponds to the SLAB-pointer bulk path.
> >  			// Channel 2 corresponds to vmalloc-pointer bulk path.
> >  			for (j = 0; j < FREE_N_CHANNELS; j++) {
> > --
> > 1.9.1
> 

Also we need to add Fixes tag, you say you are on 5.15 kernel. So
something was added there. I will try to double check.

And thank you for fixing it!!!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-03-30 12:08 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-30  9:27 [PATCH] rcu: Make sure new krcp free business is handled after the wanted rcu grace period Ziwei Dai
2023-03-30  9:45 ` 答复: " 代子为 (Ziwei Dai)
2023-03-30 12:07   ` Uladzislau Rezki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox