From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
patches@lists.linux.dev, Mukesh Ojha <quic_mojha@quicinc.com>,
Ziwei Dai <ziwei.dai@unisoc.com>,
"Uladzislau Rezki (Sony)" <urezki@gmail.com>,
"Paul E. McKenney" <paulmck@kernel.org>
Subject: [PATCH 6.2 03/15] rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
Date: Fri, 28 Apr 2023 13:27:47 +0200 [thread overview]
Message-ID: <20230428112040.246245037@linuxfoundation.org> (raw)
In-Reply-To: <20230428112040.137898986@linuxfoundation.org>
From: Ziwei Dai <ziwei.dai@unisoc.com>
commit 5da7cb193db32da783a3f3e77d8b639989321d48 upstream.
Memory passed to kvfree_rcu() that is to be freed is tracked by a
per-CPU kfree_rcu_cpu structure, which in turn contains pointers
to kvfree_rcu_bulk_data structures that contain pointers to memory
that has not yet been handed to RCU, along with an kfree_rcu_cpu_work
structure that tracks the memory that has already been handed to RCU.
These structures track three categories of memory: (1) Memory for
kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived
during an OOM episode. The first two categories are tracked in a
cache-friendly manner involving a dynamically allocated page of pointers
(the aforementioned kvfree_rcu_bulk_data structures), while the third
uses a simple (but decidedly cache-unfriendly) linked list through the
rcu_head structures in each block of memory.
On a given CPU, these three categories are handled as a unit, with that
CPU's kfree_rcu_cpu_work structure having one pointer for each of the
three categories. Clearly, new memory for a given category cannot be
placed in the corresponding kfree_rcu_cpu_work structure until any old
memory has had its grace period elapse and thus has been removed. And
the kfree_rcu_monitor() function does in fact check for this.
Except that the kfree_rcu_monitor() function checks these pointers one
at a time. This means that if the previous kfree_rcu() memory passed
to RCU had only category 1 and the current one has only category 2, the
kfree_rcu_monitor() function will send that current category-2 memory
along immediately. This can result in memory being freed too soon,
that is, out from under unsuspecting RCU readers.
To see this, consider the following sequence of events, in which:
o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset",
then is preempted.
o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset"
after a later grace period. Except that "from_cset" is freed
right after the previous grace period ended, so that "from_cset"
is immediately freed. Task A resumes and references "from_cset"'s
member, after which nothing good happens.
In full detail:
CPU 0 CPU 1
---------------------- ----------------------
count_memcg_event_mm()
|rcu_read_lock() <---
|mem_cgroup_from_task()
|// css_set_ptr is the "from_cset" mentioned on CPU 1
|css_set_ptr = rcu_dereference((task)->cgroups)
|// Hard irq comes, current task is scheduled out.
cgroup_attach_task()
|cgroup_migrate()
|cgroup_migrate_execute()
|css_set_move_task(task, from_cset, to_cset, true)
|cgroup_move_task(task, to_cset)
|rcu_assign_pointer(.., to_cset)
|...
|cgroup_migrate_finish()
|put_css_set_locked(from_cset)
|from_cset->refcount return 0
|kfree_rcu(cset, rcu_head) // free from_cset after new gp
|add_ptr_to_bulk_krc_lock()
|schedule_delayed_work(&krcp->monitor_work, ..)
kfree_rcu_monitor()
|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
|queue_rcu_work(system_wq, &krwp->rcu_work)
|if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
|call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp
// There is a perious call_rcu(.., rcu_work_rcufn)
// gp end, rcu_work_rcufn() is called.
rcu_work_rcufn()
|__queue_work(.., rwork->wq, &rwork->work);
|kfree_rcu_work()
|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
|The "from_cset" is freed before new gp end.
// the task resumes some time later.
|css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed.
This commit therefore causes kfree_rcu_monitor() to refrain from moving
kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU
grace period has completed for all three categories.
v2: Use helper function instead of inserted code block at kfree_rcu_monitor().
Fixes: 34c881745549 ("rcu: Support kfree_bulk() interface in kfree_rcu()")
Fixes: 5f3c8d620447 ("rcu/tree: Maintain separate array for vmalloc ptrs")
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
kernel/rcu/tree.c | 27 +++++++++++++++++++--------
1 file changed, 19 insertions(+), 8 deletions(-)
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3131,6 +3131,18 @@ need_offload_krc(struct kfree_rcu_cpu *k
return !!krcp->head;
}
+static bool
+need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp)
+{
+ int i;
+
+ for (i = 0; i < FREE_N_CHANNELS; i++)
+ if (krwp->bkvhead_free[i])
+ return true;
+
+ return !!krwp->head_free;
+}
+
static void
schedule_delayed_monitor_work(struct kfree_rcu_cpu *krcp)
{
@@ -3162,14 +3174,13 @@ static void kfree_rcu_monitor(struct wor
for (i = 0; i < KFREE_N_BATCHES; i++) {
struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
- // Try to detach bkvhead or head and attach it over any
- // available corresponding free channel. It can be that
- // a previous RCU batch is in progress, it means that
- // immediately to queue another one is not possible so
- // in that case the monitor work is rearmed.
- if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
- (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
- (krcp->head && !krwp->head_free)) {
+ // Try to detach bulk_head or head and attach it, only when
+ // all channels are free. Any channel is not free means at krwp
+ // there is on-going rcu work to handle krwp's free business.
+ if (need_wait_for_krwp_work(krwp))
+ continue;
+
+ if (need_offload_krc(krcp)) {
// Channel 1 corresponds to the SLAB-pointer bulk path.
// Channel 2 corresponds to vmalloc-pointer bulk path.
for (j = 0; j < FREE_N_CHANNELS; j++) {
next prev parent reply other threads:[~2023-04-28 11:28 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-28 11:27 [PATCH 6.2 00/15] 6.2.14-rc1 review Greg Kroah-Hartman
2023-04-28 11:27 ` [PATCH 6.2 01/15] rust: arch/um: Disable FP/SIMD instruction to match x86 Greg Kroah-Hartman
2023-04-28 11:27 ` [PATCH 6.2 02/15] um: Only disable SSE on clang to work around old GCC bugs Greg Kroah-Hartman
2023-04-28 11:27 ` Greg Kroah-Hartman [this message]
2023-04-28 11:27 ` [PATCH 6.2 04/15] mm/mempolicy: fix use-after-free of VMA iterator Greg Kroah-Hartman
2023-04-28 11:27 ` [PATCH 6.2 05/15] wifi: brcmfmac: slab-out-of-bounds read in brcmf_get_assoc_ies() Greg Kroah-Hartman
2023-04-28 11:27 ` [PATCH 6.2 06/15] drm/fb-helper: set x/yres_virtual in drm_fb_helper_check_var Greg Kroah-Hartman
2023-04-28 11:27 ` [PATCH 6.2 07/15] gpiolib: acpi: Add a ignore wakeup quirk for Clevo NL5xNU Greg Kroah-Hartman
2023-04-28 11:27 ` [PATCH 6.2 08/15] bluetooth: Perform careful capability checks in hci_sock_ioctl() Greg Kroah-Hartman
2023-04-28 11:27 ` [PATCH 6.2 09/15] wifi: brcmfmac: add Cypress 43439 SDIO ids Greg Kroah-Hartman
2023-04-28 11:27 ` [PATCH 6.2 10/15] btrfs: fix uninitialized variable warnings Greg Kroah-Hartman
2023-04-28 11:27 ` [PATCH 6.2 11/15] USB: serial: option: add UNISOC vendor and TOZED LT70C product Greg Kroah-Hartman
2023-04-28 11:27 ` [PATCH 6.2 12/15] driver core: Dont require dynamic_debug for initcall_debug probe timing Greg Kroah-Hartman
2023-04-28 11:27 ` [PATCH 6.2 13/15] riscv: Move early dtb mapping into the fixmap region Greg Kroah-Hartman
2023-04-28 11:27 ` [PATCH 6.2 14/15] riscv: Do not set initial_boot_params to the linear address of the dtb Greg Kroah-Hartman
2023-04-28 11:27 ` [PATCH 6.2 15/15] riscv: No need to relocate the dtb as it lies in the fixmap region Greg Kroah-Hartman
2023-04-28 22:02 ` [PATCH 6.2 00/15] 6.2.14-rc1 review Naresh Kamboju
2023-04-28 22:26 ` Shuah Khan
2023-04-29 4:10 ` Guenter Roeck
2023-04-29 4:43 ` Ron Economos
2023-04-29 7:49 ` Bagas Sanjaya
2023-04-29 9:55 ` Conor Dooley
2023-04-29 17:14 ` Florian Fainelli
2023-05-02 5:39 ` Chris Paterson
2023-05-02 16:18 ` Jon Hunter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230428112040.246245037@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=patches@lists.linux.dev \
--cc=paulmck@kernel.org \
--cc=quic_mojha@quicinc.com \
--cc=stable@vger.kernel.org \
--cc=urezki@gmail.com \
--cc=ziwei.dai@unisoc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox