* [PATCH] mm: fix COW mapping handing in generic_access_phys
@ 2025-05-28 1:56 Jinjiang Tu
2025-05-28 8:59 ` David Hildenbrand
0 siblings, 1 reply; 23+ messages in thread
From: Jinjiang Tu @ 2025-05-28 1:56 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb,
mhocko
Cc: linux-mm, wangkefeng.wang, tujinjiang
Syzkaller reports a below BUG:
ioremap on RAM at 0x0000000022727000 - 0x0000000022727fff
WARNING: CPU: 3 PID: 3609 at arch/x86/mm/ioremap.c:216 __ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
Modules linked in:
CPU: 3 PID: 3609 Comm: syz.2.577 Not tainted 6.6.0+ #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:__ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
Call Trace:
<TASK>
generic_access_phys+0x241/0x480 mm/memory.c:6458
__access_remote_vm+0x6af/0x970 mm/memory.c:6535
access_process_vm+0x53/0x80 mm/memory.c:6600
get_cmdline+0x192/0x380 mm/util.c:1041
audit_log_proctitle kernel/auditsc.c:1620 [inline]
audit_log_exit+0x1424/0x18c0 kernel/auditsc.c:1811
__audit_syscall_exit+0x252/0x2f0 kernel/auditsc.c:2079
audit_syscall_exit include/linux/audit.h:356 [inline]
syscall_exit_work+0x10f/0x130 kernel/entry/common.c:166
__syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
syscall_exit_to_user_mode+0x10/0x1e0 kernel/entry/common.c:218
do_syscall_64+0x66/0x110 arch/x86/entry/common.c:87
entry_SYSCALL_64_after_hwframe+0x78/0xe2
The /dev/mem is mapped with COW mapping, and mremap at the mm->args_start.
The special pfn mapping is replaced by anon folios due to COW.
generic_access_phys() is supposed to handle iomem, instead of RAM pfn,
thus trigger a WARN_ON.
Similar to commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in
COW mappings"). check if the pte is special to reject Cowed anon folios.
Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
---
mm/memory.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/memory.c b/mm/memory.c
index 49199410805c..e1dac84536ee 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6840,6 +6840,13 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
retry:
if (follow_pfnmap_start(&args))
return -EINVAL;
+
+ /* Never return PFNs of anon folios in COW mappings. */
+ if (!args.special) {
+ follow_pfnmap_end(&args);
+ return -EINVAL;
+ }
+
prot = args.pgprot;
phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
writable = args.writable;
--
2.43.0
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 1:56 [PATCH] mm: fix COW mapping handing in generic_access_phys Jinjiang Tu
@ 2025-05-28 8:59 ` David Hildenbrand
2025-05-28 9:59 ` David Hildenbrand
2025-05-28 12:13 ` Jinjiang Tu
0 siblings, 2 replies; 23+ messages in thread
From: David Hildenbrand @ 2025-05-28 8:59 UTC (permalink / raw)
To: Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko
Cc: linux-mm, wangkefeng.wang, Peter Xu
On 28.05.25 03:56, Jinjiang Tu wrote:
> Syzkaller reports a below BUG:
> ioremap on RAM at 0x0000000022727000 - 0x0000000022727fff
> WARNING: CPU: 3 PID: 3609 at arch/x86/mm/ioremap.c:216 __ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
> Modules linked in:
> CPU: 3 PID: 3609 Comm: syz.2.577 Not tainted 6.6.0+ #63
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> RIP: 0010:__ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
> Call Trace:
> <TASK>
> generic_access_phys+0x241/0x480 mm/memory.c:6458
> __access_remote_vm+0x6af/0x970 mm/memory.c:6535
> access_process_vm+0x53/0x80 mm/memory.c:6600
> get_cmdline+0x192/0x380 mm/util.c:1041
> audit_log_proctitle kernel/auditsc.c:1620 [inline]
> audit_log_exit+0x1424/0x18c0 kernel/auditsc.c:1811
> __audit_syscall_exit+0x252/0x2f0 kernel/auditsc.c:2079
> audit_syscall_exit include/linux/audit.h:356 [inline]
> syscall_exit_work+0x10f/0x130 kernel/entry/common.c:166
> __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
> syscall_exit_to_user_mode+0x10/0x1e0 kernel/entry/common.c:218
> do_syscall_64+0x66/0x110 arch/x86/entry/common.c:87
> entry_SYSCALL_64_after_hwframe+0x78/0xe2
>
> The /dev/mem is mapped with COW mapping, and mremap at the mm->args_start.
> The special pfn mapping is replaced by anon folios due to COW.
> generic_access_phys() is supposed to handle iomem, instead of RAM pfn,
> thus trigger a WARN_ON.
>
> Similar to commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in
> COW mappings"). check if the pte is special to reject Cowed anon folios.
>
> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> ---
> mm/memory.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 49199410805c..e1dac84536ee 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6840,6 +6840,13 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
> retry:
> if (follow_pfnmap_start(&args))
> return -EINVAL;
> +
> + /* Never return PFNs of anon folios in COW mappings. */
> + if (!args.special) {
> + follow_pfnmap_end(&args);
> + return -EINVAL;
> + }
> +
> prot = args.pgprot;
> phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
> writable = args.writable;
I assume we trigger this through vma->vm_ops->access, when the vm_ops have generic_access_phys set.
I still dislike exposing the "special" bit here, as it is absolutely not what we should care about in the caller.
In case our arch does not support pte_special, you fix will not catch that case ...
The following might be better:
diff --git a/mm/memory.c b/mm/memory.c
index 37d8738f5e12e..810adb8d1a53b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6681,6 +6681,14 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
prot = args.pgprot;
phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
writable = args.writable;
+
+ /* Refuse (refcounted) anonymous pages in CoW mappings. */
+ if (is_cow_mapping(vma->vm_flags) &&
+ vm_normal_page(vma, addr, ptep_get(args.ptep))) {
+ follow_pfnmap_end(&args);
+ return -EINVAL;
+ }
+
follow_pfnmap_end(&args);
if ((write & FOLL_WRITE) && !writable)
--
Cheers,
David / dhildenb
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 8:59 ` David Hildenbrand
@ 2025-05-28 9:59 ` David Hildenbrand
2025-05-28 12:14 ` Jinjiang Tu
2025-05-28 14:54 ` Peter Xu
2025-05-28 12:13 ` Jinjiang Tu
1 sibling, 2 replies; 23+ messages in thread
From: David Hildenbrand @ 2025-05-28 9:59 UTC (permalink / raw)
To: Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko
Cc: linux-mm, wangkefeng.wang, Peter Xu
On 28.05.25 10:59, David Hildenbrand wrote:
> On 28.05.25 03:56, Jinjiang Tu wrote:
>> Syzkaller reports a below BUG:
>> ioremap on RAM at 0x0000000022727000 - 0x0000000022727fff
>> WARNING: CPU: 3 PID: 3609 at arch/x86/mm/ioremap.c:216 __ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
>> Modules linked in:
>> CPU: 3 PID: 3609 Comm: syz.2.577 Not tainted 6.6.0+ #63
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>> RIP: 0010:__ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
>> Call Trace:
>> <TASK>
>> generic_access_phys+0x241/0x480 mm/memory.c:6458
>> __access_remote_vm+0x6af/0x970 mm/memory.c:6535
>> access_process_vm+0x53/0x80 mm/memory.c:6600
>> get_cmdline+0x192/0x380 mm/util.c:1041
>> audit_log_proctitle kernel/auditsc.c:1620 [inline]
>> audit_log_exit+0x1424/0x18c0 kernel/auditsc.c:1811
>> __audit_syscall_exit+0x252/0x2f0 kernel/auditsc.c:2079
>> audit_syscall_exit include/linux/audit.h:356 [inline]
>> syscall_exit_work+0x10f/0x130 kernel/entry/common.c:166
>> __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
>> syscall_exit_to_user_mode+0x10/0x1e0 kernel/entry/common.c:218
>> do_syscall_64+0x66/0x110 arch/x86/entry/common.c:87
>> entry_SYSCALL_64_after_hwframe+0x78/0xe2
>>
>> The /dev/mem is mapped with COW mapping, and mremap at the mm->args_start.
>> The special pfn mapping is replaced by anon folios due to COW.
>> generic_access_phys() is supposed to handle iomem, instead of RAM pfn,
>> thus trigger a WARN_ON.
>>
>> Similar to commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in
>> COW mappings"). check if the pte is special to reject Cowed anon folios.
>>
>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>> ---
>> mm/memory.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 49199410805c..e1dac84536ee 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6840,6 +6840,13 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>> retry:
>> if (follow_pfnmap_start(&args))
>> return -EINVAL;
>> +
>> + /* Never return PFNs of anon folios in COW mappings. */
>> + if (!args.special) {
>> + follow_pfnmap_end(&args);
>> + return -EINVAL;
>> + }
>> +
>> prot = args.pgprot;
>> phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
>> writable = args.writable;
>
> I assume we trigger this through vma->vm_ops->access, when the vm_ops have generic_access_phys set.
>
> I still dislike exposing the "special" bit here, as it is absolutely not what we should care about in the caller.
>
> In case our arch does not support pte_special, you fix will not catch that case ...
>
> The following might be better:
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 37d8738f5e12e..810adb8d1a53b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6681,6 +6681,14 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
> prot = args.pgprot;
> phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
> writable = args.writable;
> +
> + /* Refuse (refcounted) anonymous pages in CoW mappings. */
> + if (is_cow_mapping(vma->vm_flags) &&
> + vm_normal_page(vma, addr, ptep_get(args.ptep))) {
> + follow_pfnmap_end(&args);
> + return -EINVAL;
> + }
> +
Thinking again, we might have a PMD/PUD mapping, so maybe
follow_pfnmap_start() should really just refuse any refcounted pages.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 8:59 ` David Hildenbrand
2025-05-28 9:59 ` David Hildenbrand
@ 2025-05-28 12:13 ` Jinjiang Tu
1 sibling, 0 replies; 23+ messages in thread
From: Jinjiang Tu @ 2025-05-28 12:13 UTC (permalink / raw)
To: David Hildenbrand, akpm, lorenzo.stoakes, Liam.Howlett, vbabka,
rppt, surenb, mhocko
Cc: linux-mm, wangkefeng.wang, Peter Xu
在 2025/5/28 16:59, David Hildenbrand 写道:
> On 28.05.25 03:56, Jinjiang Tu wrote:
>> Syzkaller reports a below BUG:
>> ioremap on RAM at 0x0000000022727000 - 0x0000000022727fff
>> WARNING: CPU: 3 PID: 3609 at arch/x86/mm/ioremap.c:216
>> __ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
>> Modules linked in:
>> CPU: 3 PID: 3609 Comm: syz.2.577 Not tainted 6.6.0+ #63
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>> RIP: 0010:__ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
>> Call Trace:
>> <TASK>
>> generic_access_phys+0x241/0x480 mm/memory.c:6458
>> __access_remote_vm+0x6af/0x970 mm/memory.c:6535
>> access_process_vm+0x53/0x80 mm/memory.c:6600
>> get_cmdline+0x192/0x380 mm/util.c:1041
>> audit_log_proctitle kernel/auditsc.c:1620 [inline]
>> audit_log_exit+0x1424/0x18c0 kernel/auditsc.c:1811
>> __audit_syscall_exit+0x252/0x2f0 kernel/auditsc.c:2079
>> audit_syscall_exit include/linux/audit.h:356 [inline]
>> syscall_exit_work+0x10f/0x130 kernel/entry/common.c:166
>> __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
>> syscall_exit_to_user_mode+0x10/0x1e0 kernel/entry/common.c:218
>> do_syscall_64+0x66/0x110 arch/x86/entry/common.c:87
>> entry_SYSCALL_64_after_hwframe+0x78/0xe2
>>
>> The /dev/mem is mapped with COW mapping, and mremap at the
>> mm->args_start.
>> The special pfn mapping is replaced by anon folios due to COW.
>> generic_access_phys() is supposed to handle iomem, instead of RAM pfn,
>> thus trigger a WARN_ON.
>>
>> Similar to commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in
>> COW mappings"). check if the pte is special to reject Cowed anon folios.
>>
>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>> ---
>> mm/memory.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 49199410805c..e1dac84536ee 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6840,6 +6840,13 @@ int generic_access_phys(struct vm_area_struct
>> *vma, unsigned long addr,
>> retry:
>> if (follow_pfnmap_start(&args))
>> return -EINVAL;
>> +
>> + /* Never return PFNs of anon folios in COW mappings. */
>> + if (!args.special) {
>> + follow_pfnmap_end(&args);
>> + return -EINVAL;
>> + }
>> +
>> prot = args.pgprot;
>> phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
>> writable = args.writable;
>
> I assume we trigger this through vma->vm_ops->access, when the vm_ops
> have generic_access_phys set.
>
> I still dislike exposing the "special" bit here, as it is absolutely
> not what we should care about in the caller.
>
> In case our arch does not support pte_special, you fix will not catch
> that case ...
Yes, I reference to the check in follow_phys(), but forgot
follow_phys() is only defined in x86, which supports pte special.
>
> The following might be better:
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 37d8738f5e12e..810adb8d1a53b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6681,6 +6681,14 @@ int generic_access_phys(struct vm_area_struct
> *vma, unsigned long addr,
> prot = args.pgprot;
> phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
> writable = args.writable;
> +
> + /* Refuse (refcounted) anonymous pages in CoW mappings. */
> + if (is_cow_mapping(vma->vm_flags) &&
> + vm_normal_page(vma, addr, ptep_get(args.ptep))) {
> + follow_pfnmap_end(&args);
> + return -EINVAL;
> + }
> +
> follow_pfnmap_end(&args);
>
> if ((write & FOLL_WRITE) && !writable)
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 9:59 ` David Hildenbrand
@ 2025-05-28 12:14 ` Jinjiang Tu
2025-05-28 14:54 ` Peter Xu
1 sibling, 0 replies; 23+ messages in thread
From: Jinjiang Tu @ 2025-05-28 12:14 UTC (permalink / raw)
To: David Hildenbrand, akpm, lorenzo.stoakes, Liam.Howlett, vbabka,
rppt, surenb, mhocko
Cc: linux-mm, wangkefeng.wang, Peter Xu
在 2025/5/28 17:59, David Hildenbrand 写道:
> On 28.05.25 10:59, David Hildenbrand wrote:
>> On 28.05.25 03:56, Jinjiang Tu wrote:
>>> Syzkaller reports a below BUG:
>>> ioremap on RAM at 0x0000000022727000 - 0x0000000022727fff
>>> WARNING: CPU: 3 PID: 3609 at arch/x86/mm/ioremap.c:216
>>> __ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
>>> Modules linked in:
>>> CPU: 3 PID: 3609 Comm: syz.2.577 Not tainted 6.6.0+ #63
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>>> rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>> RIP: 0010:__ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
>>> Call Trace:
>>> <TASK>
>>> generic_access_phys+0x241/0x480 mm/memory.c:6458
>>> __access_remote_vm+0x6af/0x970 mm/memory.c:6535
>>> access_process_vm+0x53/0x80 mm/memory.c:6600
>>> get_cmdline+0x192/0x380 mm/util.c:1041
>>> audit_log_proctitle kernel/auditsc.c:1620 [inline]
>>> audit_log_exit+0x1424/0x18c0 kernel/auditsc.c:1811
>>> __audit_syscall_exit+0x252/0x2f0 kernel/auditsc.c:2079
>>> audit_syscall_exit include/linux/audit.h:356 [inline]
>>> syscall_exit_work+0x10f/0x130 kernel/entry/common.c:166
>>> __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
>>> syscall_exit_to_user_mode+0x10/0x1e0 kernel/entry/common.c:218
>>> do_syscall_64+0x66/0x110 arch/x86/entry/common.c:87
>>> entry_SYSCALL_64_after_hwframe+0x78/0xe2
>>>
>>> The /dev/mem is mapped with COW mapping, and mremap at the
>>> mm->args_start.
>>> The special pfn mapping is replaced by anon folios due to COW.
>>> generic_access_phys() is supposed to handle iomem, instead of RAM pfn,
>>> thus trigger a WARN_ON.
>>>
>>> Similar to commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in
>>> COW mappings"). check if the pte is special to reject Cowed anon
>>> folios.
>>>
>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>> ---
>>> mm/memory.c | 7 +++++++
>>> 1 file changed, 7 insertions(+)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 49199410805c..e1dac84536ee 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6840,6 +6840,13 @@ int generic_access_phys(struct vm_area_struct
>>> *vma, unsigned long addr,
>>> retry:
>>> if (follow_pfnmap_start(&args))
>>> return -EINVAL;
>>> +
>>> + /* Never return PFNs of anon folios in COW mappings. */
>>> + if (!args.special) {
>>> + follow_pfnmap_end(&args);
>>> + return -EINVAL;
>>> + }
>>> +
>>> prot = args.pgprot;
>>> phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
>>> writable = args.writable;
>>
>> I assume we trigger this through vma->vm_ops->access, when the vm_ops
>> have generic_access_phys set.
>>
>> I still dislike exposing the "special" bit here, as it is absolutely
>> not what we should care about in the caller.
>>
>> In case our arch does not support pte_special, you fix will not catch
>> that case ...
>>
>> The following might be better:
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 37d8738f5e12e..810adb8d1a53b 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -6681,6 +6681,14 @@ int generic_access_phys(struct vm_area_struct
>> *vma, unsigned long addr,
>> prot = args.pgprot;
>> phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
>> writable = args.writable;
>> +
>> + /* Refuse (refcounted) anonymous pages in CoW mappings. */
>> + if (is_cow_mapping(vma->vm_flags) &&
>> + vm_normal_page(vma, addr, ptep_get(args.ptep))) {
>> + follow_pfnmap_end(&args);
>> + return -EINVAL;
>> + }
>> +
>
> Thinking again, we might have a PMD/PUD mapping, so maybe
> follow_pfnmap_start() should really just refuse any refcounted pages.
Thanks for reviewing, I will investigate it.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 9:59 ` David Hildenbrand
2025-05-28 12:14 ` Jinjiang Tu
@ 2025-05-28 14:54 ` Peter Xu
2025-05-28 15:02 ` David Hildenbrand
1 sibling, 1 reply; 23+ messages in thread
From: Peter Xu @ 2025-05-28 14:54 UTC (permalink / raw)
To: David Hildenbrand
Cc: Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko, linux-mm, wangkefeng.wang
[Add Jason]
On Wed, May 28, 2025 at 11:59:56AM +0200, David Hildenbrand wrote:
> On 28.05.25 10:59, David Hildenbrand wrote:
> > On 28.05.25 03:56, Jinjiang Tu wrote:
> > > Syzkaller reports a below BUG:
> > > ioremap on RAM at 0x0000000022727000 - 0x0000000022727fff
> > > WARNING: CPU: 3 PID: 3609 at arch/x86/mm/ioremap.c:216 __ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
> > > Modules linked in:
> > > CPU: 3 PID: 3609 Comm: syz.2.577 Not tainted 6.6.0+ #63
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > > RIP: 0010:__ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
> > > Call Trace:
> > > <TASK>
> > > generic_access_phys+0x241/0x480 mm/memory.c:6458
> > > __access_remote_vm+0x6af/0x970 mm/memory.c:6535
> > > access_process_vm+0x53/0x80 mm/memory.c:6600
> > > get_cmdline+0x192/0x380 mm/util.c:1041
> > > audit_log_proctitle kernel/auditsc.c:1620 [inline]
> > > audit_log_exit+0x1424/0x18c0 kernel/auditsc.c:1811
> > > __audit_syscall_exit+0x252/0x2f0 kernel/auditsc.c:2079
> > > audit_syscall_exit include/linux/audit.h:356 [inline]
> > > syscall_exit_work+0x10f/0x130 kernel/entry/common.c:166
> > > __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
> > > syscall_exit_to_user_mode+0x10/0x1e0 kernel/entry/common.c:218
> > > do_syscall_64+0x66/0x110 arch/x86/entry/common.c:87
> > > entry_SYSCALL_64_after_hwframe+0x78/0xe2
> > >
> > > The /dev/mem is mapped with COW mapping, and mremap at the mm->args_start.
> > > The special pfn mapping is replaced by anon folios due to COW.
> > > generic_access_phys() is supposed to handle iomem, instead of RAM pfn,
> > > thus trigger a WARN_ON.
> > >
> > > Similar to commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in
> > > COW mappings"). check if the pte is special to reject Cowed anon folios.
> > >
> > > Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> > > ---
> > > mm/memory.c | 7 +++++++
> > > 1 file changed, 7 insertions(+)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 49199410805c..e1dac84536ee 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -6840,6 +6840,13 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
> > > retry:
> > > if (follow_pfnmap_start(&args))
> > > return -EINVAL;
> > > +
> > > + /* Never return PFNs of anon folios in COW mappings. */
> > > + if (!args.special) {
> > > + follow_pfnmap_end(&args);
> > > + return -EINVAL;
> > > + }
> > > +
> > > prot = args.pgprot;
> > > phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
> > > writable = args.writable;
> >
> > I assume we trigger this through vma->vm_ops->access, when the vm_ops have generic_access_phys set.
> >
> > I still dislike exposing the "special" bit here, as it is absolutely not what we should care about in the caller.
> >
> > In case our arch does not support pte_special, you fix will not catch that case ...
> >
> > The following might be better:
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 37d8738f5e12e..810adb8d1a53b 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -6681,6 +6681,14 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
> > prot = args.pgprot;
> > phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
> > writable = args.writable;
> > +
> > + /* Refuse (refcounted) anonymous pages in CoW mappings. */
> > + if (is_cow_mapping(vma->vm_flags) &&
> > + vm_normal_page(vma, addr, ptep_get(args.ptep))) {
> > + follow_pfnmap_end(&args);
> > + return -EINVAL;
> > + }
> > +
>
> Thinking again, we might have a PMD/PUD mapping, so maybe
> follow_pfnmap_start() should really just refuse any refcounted pages.
We may want to be careful on this.
I feel like we can still potentially break drivers that
follow_pfnmap_start() used to work on debateable things like RAM page
injections, unless breaking them is the intention.
OTOH, I also see at least two in-tree drivers set VM_IO|VM_MIXEDMAP:
*** drivers/gpu/drm/gma500/fbdev.c:
psb_fbdev_fb_mmap[110] vm_flags_set(vma, VM_IO | VM_MIXEDMAP | VM_DONTEXPAND | VM_DONTDUMP);
*** drivers/gpu/drm/omapdrm/omap_gem.c:
omap_gem_object_mmap[538] vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP | VM_IO | VM_MIXEDMAP);
AFAIU, these MIXEDMAP users will still rely on follow_pfnmap_start() to
work on e.g. RAM pages, because GUP will simply fail them..
Sligtly off-topic: it's also a bit confusing to me on whether the driver
should set VM_IO for VM_MIXEDMAP. I think it should because VM_IO says:
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */
IIUC it implies it's IO mapping so e.g. the cache behavior can be
different. But I'm not very sure now seeing both kinds of driver exist that
only sets MIXEDMAP while the above two set MIXEDMAP|IO.
--
Peter Xu
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 14:54 ` Peter Xu
@ 2025-05-28 15:02 ` David Hildenbrand
2025-05-28 15:25 ` Peter Xu
0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-05-28 15:02 UTC (permalink / raw)
To: Peter Xu
Cc: Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko, linux-mm, wangkefeng.wang
On 28.05.25 16:54, Peter Xu wrote:
> [Add Jason]
>
> On Wed, May 28, 2025 at 11:59:56AM +0200, David Hildenbrand wrote:
>> On 28.05.25 10:59, David Hildenbrand wrote:
>>> On 28.05.25 03:56, Jinjiang Tu wrote:
>>>> Syzkaller reports a below BUG:
>>>> ioremap on RAM at 0x0000000022727000 - 0x0000000022727fff
>>>> WARNING: CPU: 3 PID: 3609 at arch/x86/mm/ioremap.c:216 __ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
>>>> Modules linked in:
>>>> CPU: 3 PID: 3609 Comm: syz.2.577 Not tainted 6.6.0+ #63
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>>> RIP: 0010:__ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
>>>> Call Trace:
>>>> <TASK>
>>>> generic_access_phys+0x241/0x480 mm/memory.c:6458
>>>> __access_remote_vm+0x6af/0x970 mm/memory.c:6535
>>>> access_process_vm+0x53/0x80 mm/memory.c:6600
>>>> get_cmdline+0x192/0x380 mm/util.c:1041
>>>> audit_log_proctitle kernel/auditsc.c:1620 [inline]
>>>> audit_log_exit+0x1424/0x18c0 kernel/auditsc.c:1811
>>>> __audit_syscall_exit+0x252/0x2f0 kernel/auditsc.c:2079
>>>> audit_syscall_exit include/linux/audit.h:356 [inline]
>>>> syscall_exit_work+0x10f/0x130 kernel/entry/common.c:166
>>>> __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
>>>> syscall_exit_to_user_mode+0x10/0x1e0 kernel/entry/common.c:218
>>>> do_syscall_64+0x66/0x110 arch/x86/entry/common.c:87
>>>> entry_SYSCALL_64_after_hwframe+0x78/0xe2
>>>>
>>>> The /dev/mem is mapped with COW mapping, and mremap at the mm->args_start.
>>>> The special pfn mapping is replaced by anon folios due to COW.
>>>> generic_access_phys() is supposed to handle iomem, instead of RAM pfn,
>>>> thus trigger a WARN_ON.
>>>>
>>>> Similar to commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in
>>>> COW mappings"). check if the pte is special to reject Cowed anon folios.
>>>>
>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>> ---
>>>> mm/memory.c | 7 +++++++
>>>> 1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 49199410805c..e1dac84536ee 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -6840,6 +6840,13 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>>>> retry:
>>>> if (follow_pfnmap_start(&args))
>>>> return -EINVAL;
>>>> +
>>>> + /* Never return PFNs of anon folios in COW mappings. */
>>>> + if (!args.special) {
>>>> + follow_pfnmap_end(&args);
>>>> + return -EINVAL;
>>>> + }
>>>> +
>>>> prot = args.pgprot;
>>>> phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
>>>> writable = args.writable;
>>>
>>> I assume we trigger this through vma->vm_ops->access, when the vm_ops have generic_access_phys set.
>>>
>>> I still dislike exposing the "special" bit here, as it is absolutely not what we should care about in the caller.
>>>
>>> In case our arch does not support pte_special, you fix will not catch that case ...
>>>
>>> The following might be better:
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 37d8738f5e12e..810adb8d1a53b 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -6681,6 +6681,14 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>>> prot = args.pgprot;
>>> phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
>>> writable = args.writable;
>>> +
>>> + /* Refuse (refcounted) anonymous pages in CoW mappings. */
>>> + if (is_cow_mapping(vma->vm_flags) &&
>>> + vm_normal_page(vma, addr, ptep_get(args.ptep))) {
>>> + follow_pfnmap_end(&args);
>>> + return -EINVAL;
>>> + }
>>> +
>>
>> Thinking again, we might have a PMD/PUD mapping, so maybe
>> follow_pfnmap_start() should really just refuse any refcounted pages.
>
> We may want to be careful on this.
>
> I feel like we can still potentially break drivers that
> follow_pfnmap_start() used to work on debateable things like RAM page
> injections, unless breaking them is the intention.
Yes, that all needs a cleanup likely; it's all very confusing and
inconsistent.
>
> OTOH, I also see at least two in-tree drivers set VM_IO|VM_MIXEDMAP:
>
> *** drivers/gpu/drm/gma500/fbdev.c:
> psb_fbdev_fb_mmap[110] vm_flags_set(vma, VM_IO | VM_MIXEDMAP | VM_DONTEXPAND | VM_DONTDUMP);
>
> *** drivers/gpu/drm/omapdrm/omap_gem.c:
> omap_gem_object_mmap[538] vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP | VM_IO | VM_MIXEDMAP);
>
> AFAIU, these MIXEDMAP users will still rely on follow_pfnmap_start() to
> work on e.g. RAM pages, because GUP will simply fail them..
Right.
VM_IO essentially tells us "don't touch this memory, it might have side
effects", such as MMIO, that's why GUP outright refuses VM_IO VMAs.
I am not sure why generic_access_phys() should be allowed to ... touch
that memory instead?
Confusing.
>
> Sligtly off-topic: it's also a bit confusing to me on whether the driver
> should set VM_IO for VM_MIXEDMAP. I think it should because VM_IO says:
Depends on the driver. Some VM_MIXEDMAP users only map ordinary kernel
allocations, where VM_IO is not required. For these drivers, I really
don't know.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 15:02 ` David Hildenbrand
@ 2025-05-28 15:25 ` Peter Xu
2025-05-28 15:29 ` David Hildenbrand
0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2025-05-28 15:25 UTC (permalink / raw)
To: David Hildenbrand
Cc: Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko, linux-mm, wangkefeng.wang
On Wed, May 28, 2025 at 05:02:15PM +0200, David Hildenbrand wrote:
> On 28.05.25 16:54, Peter Xu wrote:
> > [Add Jason]
> >
> > On Wed, May 28, 2025 at 11:59:56AM +0200, David Hildenbrand wrote:
> > > On 28.05.25 10:59, David Hildenbrand wrote:
> > > > On 28.05.25 03:56, Jinjiang Tu wrote:
> > > > > Syzkaller reports a below BUG:
> > > > > ioremap on RAM at 0x0000000022727000 - 0x0000000022727fff
> > > > > WARNING: CPU: 3 PID: 3609 at arch/x86/mm/ioremap.c:216 __ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
> > > > > Modules linked in:
> > > > > CPU: 3 PID: 3609 Comm: syz.2.577 Not tainted 6.6.0+ #63
> > > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > > > > RIP: 0010:__ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
> > > > > Call Trace:
> > > > > <TASK>
> > > > > generic_access_phys+0x241/0x480 mm/memory.c:6458
> > > > > __access_remote_vm+0x6af/0x970 mm/memory.c:6535
> > > > > access_process_vm+0x53/0x80 mm/memory.c:6600
> > > > > get_cmdline+0x192/0x380 mm/util.c:1041
> > > > > audit_log_proctitle kernel/auditsc.c:1620 [inline]
> > > > > audit_log_exit+0x1424/0x18c0 kernel/auditsc.c:1811
> > > > > __audit_syscall_exit+0x252/0x2f0 kernel/auditsc.c:2079
> > > > > audit_syscall_exit include/linux/audit.h:356 [inline]
> > > > > syscall_exit_work+0x10f/0x130 kernel/entry/common.c:166
> > > > > __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
> > > > > syscall_exit_to_user_mode+0x10/0x1e0 kernel/entry/common.c:218
> > > > > do_syscall_64+0x66/0x110 arch/x86/entry/common.c:87
> > > > > entry_SYSCALL_64_after_hwframe+0x78/0xe2
> > > > >
> > > > > The /dev/mem is mapped with COW mapping, and mremap at the mm->args_start.
> > > > > The special pfn mapping is replaced by anon folios due to COW.
> > > > > generic_access_phys() is supposed to handle iomem, instead of RAM pfn,
> > > > > thus trigger a WARN_ON.
> > > > >
> > > > > Similar to commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in
> > > > > COW mappings"). check if the pte is special to reject Cowed anon folios.
> > > > >
> > > > > Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> > > > > ---
> > > > > mm/memory.c | 7 +++++++
> > > > > 1 file changed, 7 insertions(+)
> > > > >
> > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > index 49199410805c..e1dac84536ee 100644
> > > > > --- a/mm/memory.c
> > > > > +++ b/mm/memory.c
> > > > > @@ -6840,6 +6840,13 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
> > > > > retry:
> > > > > if (follow_pfnmap_start(&args))
> > > > > return -EINVAL;
> > > > > +
> > > > > + /* Never return PFNs of anon folios in COW mappings. */
> > > > > + if (!args.special) {
> > > > > + follow_pfnmap_end(&args);
> > > > > + return -EINVAL;
> > > > > + }
> > > > > +
> > > > > prot = args.pgprot;
> > > > > phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
> > > > > writable = args.writable;
> > > >
> > > > I assume we trigger this through vma->vm_ops->access, when the vm_ops have generic_access_phys set.
> > > >
> > > > I still dislike exposing the "special" bit here, as it is absolutely not what we should care about in the caller.
> > > >
> > > > In case our arch does not support pte_special, you fix will not catch that case ...
> > > >
> > > > The following might be better:
> > > >
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index 37d8738f5e12e..810adb8d1a53b 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -6681,6 +6681,14 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
> > > > prot = args.pgprot;
> > > > phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
> > > > writable = args.writable;
> > > > +
> > > > + /* Refuse (refcounted) anonymous pages in CoW mappings. */
> > > > + if (is_cow_mapping(vma->vm_flags) &&
> > > > + vm_normal_page(vma, addr, ptep_get(args.ptep))) {
> > > > + follow_pfnmap_end(&args);
> > > > + return -EINVAL;
> > > > + }
> > > > +
> > >
> > > Thinking again, we might have a PMD/PUD mapping, so maybe
> > > follow_pfnmap_start() should really just refuse any refcounted pages.
[1]
> >
> > We may want to be careful on this.
> >
> > I feel like we can still potentially break drivers that
> > follow_pfnmap_start() used to work on debateable things like RAM page
> > injections, unless breaking them is the intention.
>
> Yes, that all needs a cleanup likely; it's all very confusing and
> inconsistent.
>
> >
> > OTOH, I also see at least two in-tree drivers set VM_IO|VM_MIXEDMAP:
> >
> > *** drivers/gpu/drm/gma500/fbdev.c:
> > psb_fbdev_fb_mmap[110] vm_flags_set(vma, VM_IO | VM_MIXEDMAP | VM_DONTEXPAND | VM_DONTDUMP);
> >
> > *** drivers/gpu/drm/omapdrm/omap_gem.c:
> > omap_gem_object_mmap[538] vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP | VM_IO | VM_MIXEDMAP);
> >
> > AFAIU, these MIXEDMAP users will still rely on follow_pfnmap_start() to
> > work on e.g. RAM pages, because GUP will simply fail them..
>
> Right.
>
> VM_IO essentially tells us "don't touch this memory, it might have side
> effects", such as MMIO, that's why GUP outright refuses VM_IO VMAs.
>
> I am not sure why generic_access_phys() should be allowed to ... touch that
> memory instead?
I'm looking at:
commit 28b2ee20c7cba812b6f2ccf6d722cf86d00a84dc
Author: Rik van Riel <riel@redhat.com>
Date: Wed Jul 23 21:27:05 2008 -0700
access_process_vm device memory infrastructure
VM_IO is also intentionally mentioned in the doc too:
Documentation/filesystems/locking.rst
->access() is called when get_user_pages() fails in
access_process_vm(), typically used to debug a process through
/proc/pid/mem or ptrace. This function is needed only for
VM_IO | VM_PFNMAP VMAs.
So it definitely looks like intentional, though I know nothing about PPC
Cell SPUs..
>
> Confusing.
>
> >
> > Sligtly off-topic: it's also a bit confusing to me on whether the driver
> > should set VM_IO for VM_MIXEDMAP. I think it should because VM_IO says:
>
> Depends on the driver. Some VM_MIXEDMAP users only map ordinary kernel
> allocations, where VM_IO is not required. For these drivers, I really don't
> know.
>
> --
> Cheers,
>
> David / dhildenb
>
--
Peter Xu
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 15:25 ` Peter Xu
@ 2025-05-28 15:29 ` David Hildenbrand
2025-05-28 16:06 ` Peter Xu
0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-05-28 15:29 UTC (permalink / raw)
To: Peter Xu
Cc: Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko, linux-mm, wangkefeng.wang
On 28.05.25 17:25, Peter Xu wrote:
> On Wed, May 28, 2025 at 05:02:15PM +0200, David Hildenbrand wrote:
>> On 28.05.25 16:54, Peter Xu wrote:
>>> [Add Jason]
>>>
>>> On Wed, May 28, 2025 at 11:59:56AM +0200, David Hildenbrand wrote:
>>>> On 28.05.25 10:59, David Hildenbrand wrote:
>>>>> On 28.05.25 03:56, Jinjiang Tu wrote:
>>>>>> Syzkaller reports a below BUG:
>>>>>> ioremap on RAM at 0x0000000022727000 - 0x0000000022727fff
>>>>>> WARNING: CPU: 3 PID: 3609 at arch/x86/mm/ioremap.c:216 __ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
>>>>>> Modules linked in:
>>>>>> CPU: 3 PID: 3609 Comm: syz.2.577 Not tainted 6.6.0+ #63
>>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>>>>> RIP: 0010:__ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
>>>>>> Call Trace:
>>>>>> <TASK>
>>>>>> generic_access_phys+0x241/0x480 mm/memory.c:6458
>>>>>> __access_remote_vm+0x6af/0x970 mm/memory.c:6535
>>>>>> access_process_vm+0x53/0x80 mm/memory.c:6600
>>>>>> get_cmdline+0x192/0x380 mm/util.c:1041
>>>>>> audit_log_proctitle kernel/auditsc.c:1620 [inline]
>>>>>> audit_log_exit+0x1424/0x18c0 kernel/auditsc.c:1811
>>>>>> __audit_syscall_exit+0x252/0x2f0 kernel/auditsc.c:2079
>>>>>> audit_syscall_exit include/linux/audit.h:356 [inline]
>>>>>> syscall_exit_work+0x10f/0x130 kernel/entry/common.c:166
>>>>>> __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
>>>>>> syscall_exit_to_user_mode+0x10/0x1e0 kernel/entry/common.c:218
>>>>>> do_syscall_64+0x66/0x110 arch/x86/entry/common.c:87
>>>>>> entry_SYSCALL_64_after_hwframe+0x78/0xe2
>>>>>>
>>>>>> The /dev/mem is mapped with COW mapping, and mremap at the mm->args_start.
>>>>>> The special pfn mapping is replaced by anon folios due to COW.
>>>>>> generic_access_phys() is supposed to handle iomem, instead of RAM pfn,
>>>>>> thus trigger a WARN_ON.
>>>>>>
>>>>>> Similar to commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in
>>>>>> COW mappings"). check if the pte is special to reject Cowed anon folios.
>>>>>>
>>>>>> Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
>>>>>> ---
>>>>>> mm/memory.c | 7 +++++++
>>>>>> 1 file changed, 7 insertions(+)
>>>>>>
>>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>>> index 49199410805c..e1dac84536ee 100644
>>>>>> --- a/mm/memory.c
>>>>>> +++ b/mm/memory.c
>>>>>> @@ -6840,6 +6840,13 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>>>>>> retry:
>>>>>> if (follow_pfnmap_start(&args))
>>>>>> return -EINVAL;
>>>>>> +
>>>>>> + /* Never return PFNs of anon folios in COW mappings. */
>>>>>> + if (!args.special) {
>>>>>> + follow_pfnmap_end(&args);
>>>>>> + return -EINVAL;
>>>>>> + }
>>>>>> +
>>>>>> prot = args.pgprot;
>>>>>> phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
>>>>>> writable = args.writable;
>>>>>
>>>>> I assume we trigger this through vma->vm_ops->access, when the vm_ops have generic_access_phys set.
>>>>>
>>>>> I still dislike exposing the "special" bit here, as it is absolutely not what we should care about in the caller.
>>>>>
>>>>> In case our arch does not support pte_special, you fix will not catch that case ...
>>>>>
>>>>> The following might be better:
>>>>>
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 37d8738f5e12e..810adb8d1a53b 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -6681,6 +6681,14 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
>>>>> prot = args.pgprot;
>>>>> phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
>>>>> writable = args.writable;
>>>>> +
>>>>> + /* Refuse (refcounted) anonymous pages in CoW mappings. */
>>>>> + if (is_cow_mapping(vma->vm_flags) &&
>>>>> + vm_normal_page(vma, addr, ptep_get(args.ptep))) {
>>>>> + follow_pfnmap_end(&args);
>>>>> + return -EINVAL;
>>>>> + }
>>>>> +
>>>>
>>>> Thinking again, we might have a PMD/PUD mapping, so maybe
>>>> follow_pfnmap_start() should really just refuse any refcounted pages.
>
> [1]
>
>>>
>>> We may want to be careful on this.
>>>
>>> I feel like we can still potentially break drivers that
>>> follow_pfnmap_start() used to work on debateable things like RAM page
>>> injections, unless breaking them is the intention.
>>
>> Yes, that all needs a cleanup likely; it's all very confusing and
>> inconsistent.
>>
>>>
>>> OTOH, I also see at least two in-tree drivers set VM_IO|VM_MIXEDMAP:
>>>
>>> *** drivers/gpu/drm/gma500/fbdev.c:
>>> psb_fbdev_fb_mmap[110] vm_flags_set(vma, VM_IO | VM_MIXEDMAP | VM_DONTEXPAND | VM_DONTDUMP);
>>>
>>> *** drivers/gpu/drm/omapdrm/omap_gem.c:
>>> omap_gem_object_mmap[538] vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP | VM_IO | VM_MIXEDMAP);
>>>
>>> AFAIU, these MIXEDMAP users will still rely on follow_pfnmap_start() to
>>> work on e.g. RAM pages, because GUP will simply fail them..
>>
>> Right.
>>
>> VM_IO essentially tells us "don't touch this memory, it might have side
>> effects", such as MMIO, that's why GUP outright refuses VM_IO VMAs.
>>
>> I am not sure why generic_access_phys() should be allowed to ... touch that
>> memory instead?
>
> I'm looking at:
>
> commit 28b2ee20c7cba812b6f2ccf6d722cf86d00a84dc
> Author: Rik van Riel <riel@redhat.com>
> Date: Wed Jul 23 21:27:05 2008 -0700
>
> access_process_vm device memory infrastructure
>
> VM_IO is also intentionally mentioned in the doc too:
>
> Documentation/filesystems/locking.rst
>
> ->access() is called when get_user_pages() fails in
> access_process_vm(), typically used to debug a process through
> /proc/pid/mem or ptrace. This function is needed only for
> VM_IO | VM_PFNMAP VMAs.
>
> So it definitely looks like intentional, though I know nothing about PPC
> Cell SPUs..
VM_IO | VM_PFNMAP, I can understand that. It's weird combined with weird.
But the use case for "VM_IO | VM_MIXEDMAP" ?
To be precise, I am questioning if follow_pfnmap_start() should only
work on ...
PFNMAPs ?
:)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 15:29 ` David Hildenbrand
@ 2025-05-28 16:06 ` Peter Xu
2025-05-28 16:29 ` Jason Gunthorpe
0 siblings, 1 reply; 23+ messages in thread
From: Peter Xu @ 2025-05-28 16:06 UTC (permalink / raw)
To: David Hildenbrand
Cc: Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko, linux-mm, wangkefeng.wang, Jason Gunthorpe
On Wed, May 28, 2025 at 05:29:29PM +0200, David Hildenbrand wrote:
> On 28.05.25 17:25, Peter Xu wrote:
> > On Wed, May 28, 2025 at 05:02:15PM +0200, David Hildenbrand wrote:
> > > On 28.05.25 16:54, Peter Xu wrote:
> > > > [Add Jason]
> > > >
> > > > On Wed, May 28, 2025 at 11:59:56AM +0200, David Hildenbrand wrote:
> > > > > On 28.05.25 10:59, David Hildenbrand wrote:
> > > > > > On 28.05.25 03:56, Jinjiang Tu wrote:
> > > > > > > Syzkaller reports a below BUG:
> > > > > > > ioremap on RAM at 0x0000000022727000 - 0x0000000022727fff
> > > > > > > WARNING: CPU: 3 PID: 3609 at arch/x86/mm/ioremap.c:216 __ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
> > > > > > > Modules linked in:
> > > > > > > CPU: 3 PID: 3609 Comm: syz.2.577 Not tainted 6.6.0+ #63
> > > > > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > > > > > > RIP: 0010:__ioremap_caller+0x644/0x7f0 arch/x86/mm/ioremap.c:216
> > > > > > > Call Trace:
> > > > > > > <TASK>
> > > > > > > generic_access_phys+0x241/0x480 mm/memory.c:6458
> > > > > > > __access_remote_vm+0x6af/0x970 mm/memory.c:6535
> > > > > > > access_process_vm+0x53/0x80 mm/memory.c:6600
> > > > > > > get_cmdline+0x192/0x380 mm/util.c:1041
> > > > > > > audit_log_proctitle kernel/auditsc.c:1620 [inline]
> > > > > > > audit_log_exit+0x1424/0x18c0 kernel/auditsc.c:1811
> > > > > > > __audit_syscall_exit+0x252/0x2f0 kernel/auditsc.c:2079
> > > > > > > audit_syscall_exit include/linux/audit.h:356 [inline]
> > > > > > > syscall_exit_work+0x10f/0x130 kernel/entry/common.c:166
> > > > > > > __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
> > > > > > > syscall_exit_to_user_mode+0x10/0x1e0 kernel/entry/common.c:218
> > > > > > > do_syscall_64+0x66/0x110 arch/x86/entry/common.c:87
> > > > > > > entry_SYSCALL_64_after_hwframe+0x78/0xe2
> > > > > > >
> > > > > > > The /dev/mem is mapped with COW mapping, and mremap at the mm->args_start.
> > > > > > > The special pfn mapping is replaced by anon folios due to COW.
> > > > > > > generic_access_phys() is supposed to handle iomem, instead of RAM pfn,
> > > > > > > thus trigger a WARN_ON.
> > > > > > >
> > > > > > > Similar to commit 04c35ab3bdae ("x86/mm/pat: fix VM_PAT handling in
> > > > > > > COW mappings"). check if the pte is special to reject Cowed anon folios.
> > > > > > >
> > > > > > > Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com>
> > > > > > > ---
> > > > > > > mm/memory.c | 7 +++++++
> > > > > > > 1 file changed, 7 insertions(+)
> > > > > > >
> > > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > > index 49199410805c..e1dac84536ee 100644
> > > > > > > --- a/mm/memory.c
> > > > > > > +++ b/mm/memory.c
> > > > > > > @@ -6840,6 +6840,13 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
> > > > > > > retry:
> > > > > > > if (follow_pfnmap_start(&args))
> > > > > > > return -EINVAL;
> > > > > > > +
> > > > > > > + /* Never return PFNs of anon folios in COW mappings. */
> > > > > > > + if (!args.special) {
> > > > > > > + follow_pfnmap_end(&args);
> > > > > > > + return -EINVAL;
> > > > > > > + }
> > > > > > > +
> > > > > > > prot = args.pgprot;
> > > > > > > phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
> > > > > > > writable = args.writable;
> > > > > >
> > > > > > I assume we trigger this through vma->vm_ops->access, when the vm_ops have generic_access_phys set.
> > > > > >
> > > > > > I still dislike exposing the "special" bit here, as it is absolutely not what we should care about in the caller.
> > > > > >
> > > > > > In case our arch does not support pte_special, you fix will not catch that case ...
> > > > > >
> > > > > > The following might be better:
> > > > > >
> > > > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > > > index 37d8738f5e12e..810adb8d1a53b 100644
> > > > > > --- a/mm/memory.c
> > > > > > +++ b/mm/memory.c
> > > > > > @@ -6681,6 +6681,14 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr,
> > > > > > prot = args.pgprot;
> > > > > > phys_addr = (resource_size_t)args.pfn << PAGE_SHIFT;
> > > > > > writable = args.writable;
> > > > > > +
> > > > > > + /* Refuse (refcounted) anonymous pages in CoW mappings. */
> > > > > > + if (is_cow_mapping(vma->vm_flags) &&
> > > > > > + vm_normal_page(vma, addr, ptep_get(args.ptep))) {
> > > > > > + follow_pfnmap_end(&args);
> > > > > > + return -EINVAL;
> > > > > > + }
> > > > > > +
> > > > >
> > > > > Thinking again, we might have a PMD/PUD mapping, so maybe
> > > > > follow_pfnmap_start() should really just refuse any refcounted pages.
> >
> > [1]
> >
> > > >
> > > > We may want to be careful on this.
> > > >
> > > > I feel like we can still potentially break drivers that
> > > > follow_pfnmap_start() used to work on debateable things like RAM page
> > > > injections, unless breaking them is the intention.
> > >
> > > Yes, that all needs a cleanup likely; it's all very confusing and
> > > inconsistent.
> > >
> > > >
> > > > OTOH, I also see at least two in-tree drivers set VM_IO|VM_MIXEDMAP:
> > > >
> > > > *** drivers/gpu/drm/gma500/fbdev.c:
> > > > psb_fbdev_fb_mmap[110] vm_flags_set(vma, VM_IO | VM_MIXEDMAP | VM_DONTEXPAND | VM_DONTDUMP);
> > > >
> > > > *** drivers/gpu/drm/omapdrm/omap_gem.c:
> > > > omap_gem_object_mmap[538] vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP | VM_IO | VM_MIXEDMAP);
> > > >
> > > > AFAIU, these MIXEDMAP users will still rely on follow_pfnmap_start() to
> > > > work on e.g. RAM pages, because GUP will simply fail them..
> > >
> > > Right.
> > >
> > > VM_IO essentially tells us "don't touch this memory, it might have side
> > > effects", such as MMIO, that's why GUP outright refuses VM_IO VMAs.
> > >
> > > I am not sure why generic_access_phys() should be allowed to ... touch that
> > > memory instead?
> >
> > I'm looking at:
> >
> > commit 28b2ee20c7cba812b6f2ccf6d722cf86d00a84dc
> > Author: Rik van Riel <riel@redhat.com>
> > Date: Wed Jul 23 21:27:05 2008 -0700
> >
> > access_process_vm device memory infrastructure
> >
> > VM_IO is also intentionally mentioned in the doc too:
> >
> > Documentation/filesystems/locking.rst
> >
> > ->access() is called when get_user_pages() fails in
> > access_process_vm(), typically used to debug a process through
> > /proc/pid/mem or ptrace. This function is needed only for
> > VM_IO | VM_PFNMAP VMAs.
> >
> > So it definitely looks like intentional, though I know nothing about PPC
> > Cell SPUs..
>
>
> VM_IO | VM_PFNMAP, I can understand that. It's weird combined with weird.
>
> But the use case for "VM_IO | VM_MIXEDMAP" ?
>
> To be precise, I am questioning if follow_pfnmap_start() should only work on
> ...
>
> PFNMAPs ?
>
> :)
It goes back to the question on whether things will break which used to
work..
I was almost conservative as I know little on driver side, that's also why
I tend to prefer making iov_iter work with ram-mapped VM_PFNMAP vmas too.
After all, AFAIU Linux should try to not break working users; sometimes we
pay for that.
Meanwhile:
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
I'm not confident to blame any driver yet to have those special cases for
VM_PFNMAP, because it only says "managed without struct page", it didn't
say "it must not contain struct page".. Hence it hints the core mm "please
do not manage these mappings with struct page at all". Still sounds fair
contract, even if not ideal.
But yeah, once again I agree that'll be ideal if what you said could happen
some day.
[I wanted to copy Jason but I failed the job; do it this time]
--
Peter Xu
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 16:06 ` Peter Xu
@ 2025-05-28 16:29 ` Jason Gunthorpe
2025-05-28 17:14 ` Peter Xu
2025-05-28 17:32 ` David Hildenbrand
0 siblings, 2 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2025-05-28 16:29 UTC (permalink / raw)
To: Peter Xu
Cc: David Hildenbrand, Jinjiang Tu, akpm, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, linux-mm,
wangkefeng.wang
On Wed, May 28, 2025 at 12:06:07PM -0400, Peter Xu wrote:
> #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
>
> I'm not confident to blame any driver yet to have those special cases for
> VM_PFNMAP, because it only says "managed without struct page", it didn't
> say "it must not contain struct page".. Hence it hints the core mm "please
> do not manage these mappings with struct page at all". Still sounds fair
> contract, even if not ideal.
I think it is pretty clear, if a VMA has VM_PFNMAP then nothing must
ever try to obtain a struct page from any PTEs in it, for any reason,
even if things in it might have a struct page. In practice it means
nothing can call vm_normal_page() on a VM_PFNMAP.
It would be nice to update the comment to make it clearer.
If the VMA owner wanted to permit access to the struct page then it
should have used VM_MIXEDMAP.
The fundamental difference between PFNMAP and MIXEDMAP is that
vm_normal_page() is allowed on MIXEDMAP. That comes with some extra
rules and restrictions to support arches without the special pte bit.
VM_IO | VM_PFNMAP further means that all the pfns in the VMA require
the use of io accessors (writel/readl) to access them.
No idea what VM_IO | VM_MIXEDMAP is supposed to mean. Only the special
ptes need io accessors?
In either case GUP doesn't really work on the VMA. PFNMAP is totally
blocked, and for MIXEDMAP userspace has no way to discover which
subset of the VMA is GUPable. I think that GUP is supported on
MIXEDMAP at all is a bit of a weirdo thing.
The main value of MIXEDMAP having access to the struct page is to
allow the page table itself to manage the life-cycle of memory that
was allocated specifically and only for the VMA. It will do fork and
unmap properly and free the memory at the right time.
If a driver is managing the memory lifecylce on its own and using zap
to clear the PTEs (because it must to manage the non-struct page PTE
life-cycle), then there isn't much point in using MIXEDMAP, IMHO. I
wonder if some of the GPU drivers are confused like this.
Jason
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 16:29 ` Jason Gunthorpe
@ 2025-05-28 17:14 ` Peter Xu
2025-05-28 17:34 ` Jason Gunthorpe
2025-05-28 17:37 ` David Hildenbrand
2025-05-28 17:32 ` David Hildenbrand
1 sibling, 2 replies; 23+ messages in thread
From: Peter Xu @ 2025-05-28 17:14 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: David Hildenbrand, Jinjiang Tu, akpm, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, linux-mm,
wangkefeng.wang
On Wed, May 28, 2025 at 01:29:15PM -0300, Jason Gunthorpe wrote:
> On Wed, May 28, 2025 at 12:06:07PM -0400, Peter Xu wrote:
> > #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
> >
> > I'm not confident to blame any driver yet to have those special cases for
> > VM_PFNMAP, because it only says "managed without struct page", it didn't
> > say "it must not contain struct page".. Hence it hints the core mm "please
> > do not manage these mappings with struct page at all". Still sounds fair
> > contract, even if not ideal.
>
> I think it is pretty clear, if a VMA has VM_PFNMAP then nothing must
> ever try to obtain a struct page from any PTEs in it, for any reason,
> even if things in it might have a struct page. In practice it means
> nothing can call vm_normal_page() on a VM_PFNMAP.
>
> It would be nice to update the comment to make it clearer.
Yes that would help. Maybe the hard part is making sure how it is
documented will be how it is used..
>
> If the VMA owner wanted to permit access to the struct page then it
> should have used VM_MIXEDMAP.
>
> The fundamental difference between PFNMAP and MIXEDMAP is that
> vm_normal_page() is allowed on MIXEDMAP. That comes with some extra
> rules and restrictions to support arches without the special pte bit.
If in the ideal world where VM_PFNMAP has a stricter semantics, it sounds
fair to disable vm_normal_page() on top of VM_PFNMAP, yes.
>
> VM_IO | VM_PFNMAP further means that all the pfns in the VMA require
> the use of io accessors (writel/readl) to access them.
Hmm.. I'm not 100% sure on this one. E.g., vDSO is VM_IO now but it's
definitely accessible that got mapped into userspace. Same case when
e.g. mmap() VM_IO regions then accessing using the virtual addresses.
What you said sounds more like what __iomem declares rather than VM_IO.
But I confess I at least don't know why VM_IO existed, considering there're
also VM_*MAP and VM_DONTDUMP.
>
> No idea what VM_IO | VM_MIXEDMAP is supposed to mean. Only the special
> ptes need io accessors?
>
> In either case GUP doesn't really work on the VMA. PFNMAP is totally
> blocked, and for MIXEDMAP userspace has no way to discover which
> subset of the VMA is GUPable. I think that GUP is supported on
> MIXEDMAP at all is a bit of a weirdo thing.
Does it imply that in the ideal case one should use follow_pfnmap_start()
for MIXEDMAP?
I don't have a strong feeling yet on how GUP should treat MIXEDMAP, either
(1) fail MIXEDMAP like you said, falling back to follow_pfnmap_start(), or
(2) allow MIXEDMAP only on page-backed mappings, then fallback to
follow_pfnmap_start() on non-page-backed mappings only.
>
> The main value of MIXEDMAP having access to the struct page is to
> allow the page table itself to manage the life-cycle of memory that
> was allocated specifically and only for the VMA. It will do fork and
> unmap properly and free the memory at the right time.
Agree.
>
> If a driver is managing the memory lifecylce on its own and using zap
> to clear the PTEs (because it must to manage the non-struct page PTE
> life-cycle), then there isn't much point in using MIXEDMAP, IMHO. I
> wonder if some of the GPU drivers are confused like this.
>
> Jason
>
--
Peter Xu
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 16:29 ` Jason Gunthorpe
2025-05-28 17:14 ` Peter Xu
@ 2025-05-28 17:32 ` David Hildenbrand
2025-05-28 17:47 ` Jason Gunthorpe
1 sibling, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-05-28 17:32 UTC (permalink / raw)
To: Jason Gunthorpe, Peter Xu
Cc: Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko, linux-mm, wangkefeng.wang
On 28.05.25 18:29, Jason Gunthorpe wrote:
> On Wed, May 28, 2025 at 12:06:07PM -0400, Peter Xu wrote:
>> #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
>>
>> I'm not confident to blame any driver yet to have those special cases for
>> VM_PFNMAP, because it only says "managed without struct page", it didn't
>> say "it must not contain struct page".. Hence it hints the core mm "please
>> do not manage these mappings with struct page at all". Still sounds fair
>> contract, even if not ideal.
>
> I think it is pretty clear, if a VMA has VM_PFNMAP then nothing must
> ever try to obtain a struct page from any PTEs in it, for any reason,
> even if things in it might have a struct page. In practice it means
> nothing can call vm_normal_page() on a VM_PFNMAP.
No, not until we remove any COW mappings of VM_PFNMAP.
But maybe I misunderstood what you mean.
>
> It would be nice to update the comment to make it clearer.
>
> If the VMA owner wanted to permit access to the struct page then it
> should have used VM_MIXEDMAP.
>
> The fundamental difference between PFNMAP and MIXEDMAP is that
> vm_normal_page() is allowed on MIXEDMAP. That comes with some extra
> rules and restrictions to support arches without the special pte bit.
> > VM_IO | VM_PFNMAP further means that all the pfns in the VMA require
> the use of io accessors (writel/readl) to access them.
> > No idea what VM_IO | VM_MIXEDMAP is supposed to mean. Only the special
> ptes need io accessors?
> > In either case GUP doesn't really work on the VMA. PFNMAP is totally
> blocked, and for MIXEDMAP userspace has no way to discover which
> subset of the VMA is GUPable. I think that GUP is supported on
> MIXEDMAP at all is a bit of a weirdo thing.
IIRC (after recent discussions with Lorenzo) there are use cases for that.
And there is no way to block GUP-fast either way without pte_special().
And pte_special() ... is not for refcounted pages.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 17:14 ` Peter Xu
@ 2025-05-28 17:34 ` Jason Gunthorpe
2025-05-28 17:37 ` David Hildenbrand
1 sibling, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2025-05-28 17:34 UTC (permalink / raw)
To: Peter Xu
Cc: David Hildenbrand, Jinjiang Tu, akpm, lorenzo.stoakes,
Liam.Howlett, vbabka, rppt, surenb, mhocko, linux-mm,
wangkefeng.wang
On Wed, May 28, 2025 at 01:14:58PM -0400, Peter Xu wrote:
> On Wed, May 28, 2025 at 01:29:15PM -0300, Jason Gunthorpe wrote:
> > On Wed, May 28, 2025 at 12:06:07PM -0400, Peter Xu wrote:
> > > #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
> > >
> > > I'm not confident to blame any driver yet to have those special cases for
> > > VM_PFNMAP, because it only says "managed without struct page", it didn't
> > > say "it must not contain struct page".. Hence it hints the core mm "please
> > > do not manage these mappings with struct page at all". Still sounds fair
> > > contract, even if not ideal.
> >
> > I think it is pretty clear, if a VMA has VM_PFNMAP then nothing must
> > ever try to obtain a struct page from any PTEs in it, for any reason,
> > even if things in it might have a struct page. In practice it means
> > nothing can call vm_normal_page() on a VM_PFNMAP.
> >
> > It would be nice to update the comment to make it clearer.
>
> Yes that would help. Maybe the hard part is making sure how it is
> documented will be how it is used..
>
> >
> > If the VMA owner wanted to permit access to the struct page then it
> > should have used VM_MIXEDMAP.
> >
> > The fundamental difference between PFNMAP and MIXEDMAP is that
> > vm_normal_page() is allowed on MIXEDMAP. That comes with some extra
> > rules and restrictions to support arches without the special pte bit.
>
> If in the ideal world where VM_PFNMAP has a stricter semantics, it sounds
> fair to disable vm_normal_page() on top of VM_PFNMAP, yes.
>
> >
> > VM_IO | VM_PFNMAP further means that all the pfns in the VMA require
> > the use of io accessors (writel/readl) to access them.
>
> Hmm.. I'm not 100% sure on this one. E.g., vDSO is VM_IO now but it's
> definitely accessible that got mapped into userspace.
That seems wrong...
It says it in the comment clearly:
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */
"memory mapped i/o" is exactly __iomem.
At least this VDSO is kind of weird:
vma = _install_special_mapping(mm,
VDSO_VCLOCK_PAGES_START(addr),
VDSO_NR_VCLOCK_PAGES * PAGE_SIZE,
VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|
VM_PFNMAP|VM_SEALED_SYSMAP,
&vvar_vclock_mapping);
Because the things it puts into the VMA are not actually known to be
MMIO, they are both special hypervisor clock pages:
struct pvclock_vsyscall_time_info *pvti =
pvclock_get_pvti_cpu0_va();
unsigned long pfn = hv_get_tsc_pfn();
And they are *probably* ddr, but also x86 doesn't care about VM_IO, so
it doesn't matter if it is wrong.
I wonder if the vdso_install_vvar_mapping() one was blidly copied from x86:
pfn = __phys_to_pfn(__pa_symbol(vdso_k_time_data));
pfn = __phys_to_pfn(__pa_symbol(vdso_k_time_data));
pfn = __phys_to_pfn(__pa_symbol(vdso_k_rng_data));
pfn = __phys_to_pfn(__pa_symbol(vdso_k_arch_data)) +
Because __pa_symbol is definately not MMIO and should not have VM_IO.
> But I confess I at least don't know why VM_IO existed, considering there're
> also VM_*MAP and VM_DONTDUMP.
I've assumed it was for the various debugger/dump related paths to
prevent access to the memory and system crash. Some environments
cannot touch MMIO addresses without using readl/writel.
> > No idea what VM_IO | VM_MIXEDMAP is supposed to mean. Only the special
> > ptes need io accessors?
> >
> > In either case GUP doesn't really work on the VMA. PFNMAP is totally
> > blocked, and for MIXEDMAP userspace has no way to discover which
> > subset of the VMA is GUPable. I think that GUP is supported on
> > MIXEDMAP at all is a bit of a weirdo thing.
>
> Does it imply that in the ideal case one should use follow_pfnmap_start()
> for MIXEDMAP?
No
> I don't have a strong feeling yet on how GUP should treat MIXEDMAP, either
> (1) fail MIXEDMAP like you said, falling back to follow_pfnmap_start(), or
> (2) allow MIXEDMAP only on page-backed mappings, then fallback to
> follow_pfnmap_start() on non-page-backed mappings only.
GUP should follow the rules, it must use vm_normal_page() on each PTE
and if and only if a struct page is returned then it can be refcounted
and returned by GUP.
GUP should never ignore the special bit and convert a special PTE to a
struct page. The very definition of the special bit is that you cannot
do this.
Jason
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 17:14 ` Peter Xu
2025-05-28 17:34 ` Jason Gunthorpe
@ 2025-05-28 17:37 ` David Hildenbrand
1 sibling, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2025-05-28 17:37 UTC (permalink / raw)
To: Peter Xu, Jason Gunthorpe
Cc: Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko, linux-mm, wangkefeng.wang
>> No idea what VM_IO | VM_MIXEDMAP is supposed to mean. Only the special
>> ptes need io accessors?
>>
>> In either case GUP doesn't really work on the VMA. PFNMAP is totally
>> blocked, and for MIXEDMAP userspace has no way to discover which
>> subset of the VMA is GUPable. I think that GUP is supported on
>> MIXEDMAP at all is a bit of a weirdo thing.
>
> Does it imply that in the ideal case one should use follow_pfnmap_start()
> for MIXEDMAP?
>
> I don't have a strong feeling yet on how GUP should treat MIXEDMAP, either
> (1) fail MIXEDMAP like you said, falling back to follow_pfnmap_start(), or
> (2) allow MIXEDMAP only on page-backed mappings, then fallback to
> follow_pfnmap_start() on non-page-backed mappings only.
I think GUP should GUP when there is a page to GUP.
If there is no page to GUP, then we have a PFNMAP and should fallback to
follow_pfnmap_start() indeed.
And now I realize that MIXEDMAP should indeed be working with
follow_pfnmap_start(). But I think it should only deal with actual
non-struct page thingies.
GUP not supporting COW'ed pages in a VM_PFNMAP is something we should
likely just fix. It would likely also fix the problem here, because the
anonymous page would simply be returned by GUP.
I mean, GUP-fast should work on COW'ed pages in a VM_PFNMAP already ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 17:32 ` David Hildenbrand
@ 2025-05-28 17:47 ` Jason Gunthorpe
2025-05-28 17:59 ` Jason Gunthorpe
2025-05-28 18:00 ` David Hildenbrand
0 siblings, 2 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2025-05-28 17:47 UTC (permalink / raw)
To: David Hildenbrand
Cc: Peter Xu, Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, linux-mm, wangkefeng.wang
On Wed, May 28, 2025 at 07:32:59PM +0200, David Hildenbrand wrote:
> On 28.05.25 18:29, Jason Gunthorpe wrote:
> > On Wed, May 28, 2025 at 12:06:07PM -0400, Peter Xu wrote:
> > > #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
> > >
> > > I'm not confident to blame any driver yet to have those special cases for
> > > VM_PFNMAP, because it only says "managed without struct page", it didn't
> > > say "it must not contain struct page".. Hence it hints the core mm "please
> > > do not manage these mappings with struct page at all". Still sounds fair
> > > contract, even if not ideal.
> >
> > I think it is pretty clear, if a VMA has VM_PFNMAP then nothing must
> > ever try to obtain a struct page from any PTEs in it, for any reason,
> > even if things in it might have a struct page. In practice it means
> > nothing can call vm_normal_page() on a VM_PFNMAP.
>
> No, not until we remove any COW mappings of VM_PFNMAP.
I stand by the statement, the COW mapping thing is broken. Just
because it is wrong doesn't mean it is allowed to call
vm_normal_page() on VM_PFNMAP.
It makes no sense at all to have VM_PFNMAP where someone has COW'd the
pages and effectively turned it into a MIXEDMAP.
It should just be a MIXEDMAP at that point!!
IMHO we should have the core code complain with a dmesg for VM_PFNMAP
without VM_SHARED and get driver authors to fix it. Either add the
missed VM_SHARED or changed to MIXEDMAP. Make it WARN_ON in a few
years.
If they can't follow the rules required to use MIXEDMAP then they
certainly do not have working COW!!
> > In either case GUP doesn't really work on the VMA. PFNMAP is totally
> > blocked, and for MIXEDMAP userspace has no way to discover which
> > subset of the VMA is GUPable. I think that GUP is supported on
> > MIXEDMAP at all is a bit of a weirdo thing.
>
> IIRC (after recent discussions with Lorenzo) there are use cases for that.
Really? Somehow userspace can know which sub slice of the VMA is
GUP'able? Gross :P
> And there is no way to block GUP-fast either way without
> pte_special(). And pte_special() ... is not for refcounted pages.
Yes, on arches with the special bit GUP-fast is properly blocked. I
view those as the reference behavior. If non-special arches can't do
this check then it still not something that is uAPI because GUP-fast
could fallback unpredictably and go to GUP slow which will do the
check. Userspace can't rely on it to work!
Jason
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 17:47 ` Jason Gunthorpe
@ 2025-05-28 17:59 ` Jason Gunthorpe
2025-05-28 18:03 ` David Hildenbrand
2025-05-28 18:00 ` David Hildenbrand
1 sibling, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2025-05-28 17:59 UTC (permalink / raw)
To: David Hildenbrand
Cc: Peter Xu, Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, linux-mm, wangkefeng.wang
On Wed, May 28, 2025 at 02:47:54PM -0300, Jason Gunthorpe wrote:
> On Wed, May 28, 2025 at 07:32:59PM +0200, David Hildenbrand wrote:
> > On 28.05.25 18:29, Jason Gunthorpe wrote:
> > > On Wed, May 28, 2025 at 12:06:07PM -0400, Peter Xu wrote:
> > > > #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
> > > >
> > > > I'm not confident to blame any driver yet to have those special cases for
> > > > VM_PFNMAP, because it only says "managed without struct page", it didn't
> > > > say "it must not contain struct page".. Hence it hints the core mm "please
> > > > do not manage these mappings with struct page at all". Still sounds fair
> > > > contract, even if not ideal.
> > >
> > > I think it is pretty clear, if a VMA has VM_PFNMAP then nothing must
> > > ever try to obtain a struct page from any PTEs in it, for any reason,
> > > even if things in it might have a struct page. In practice it means
> > > nothing can call vm_normal_page() on a VM_PFNMAP.
> >
> > No, not until we remove any COW mappings of VM_PFNMAP.
>
> I stand by the statement, the COW mapping thing is broken. Just
> because it is wrong doesn't mean it is allowed to call
> vm_normal_page() on VM_PFNMAP.
I guess I got this backwards in my memory, the vm_normal_page() thing
is specifically to fix the COW nastyness on VM_PFNMAP.
So it is OK to call vm_normal_page() on both kinds of VMAs and it will
return NULL if you are not allowed to touch the struct page.
This is still what GUP should be doing, vm_normal_page() and if no
struct page comes back then failure. It should never try to get a
struct page some other way.
Yes, GUP fast will be inconsistent with GUP slow on non-special
arches, but that is unfixable without taking a performance hit.
Jason
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 17:47 ` Jason Gunthorpe
2025-05-28 17:59 ` Jason Gunthorpe
@ 2025-05-28 18:00 ` David Hildenbrand
2025-05-28 18:15 ` Jason Gunthorpe
1 sibling, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-05-28 18:00 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Peter Xu, Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, linux-mm, wangkefeng.wang
On 28.05.25 19:47, Jason Gunthorpe wrote:
> On Wed, May 28, 2025 at 07:32:59PM +0200, David Hildenbrand wrote:
>> On 28.05.25 18:29, Jason Gunthorpe wrote:
>>> On Wed, May 28, 2025 at 12:06:07PM -0400, Peter Xu wrote:
>>>> #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
>>>>
>>>> I'm not confident to blame any driver yet to have those special cases for
>>>> VM_PFNMAP, because it only says "managed without struct page", it didn't
>>>> say "it must not contain struct page".. Hence it hints the core mm "please
>>>> do not manage these mappings with struct page at all". Still sounds fair
>>>> contract, even if not ideal.
>>>
>>> I think it is pretty clear, if a VMA has VM_PFNMAP then nothing must
>>> ever try to obtain a struct page from any PTEs in it, for any reason,
>>> even if things in it might have a struct page. In practice it means
>>> nothing can call vm_normal_page() on a VM_PFNMAP.
>>
>> No, not until we remove any COW mappings of VM_PFNMAP.
>
> I stand by the statement, the COW mapping thing is broken. Just
> because it is wrong doesn't mean it is allowed to call
> vm_normal_page() on VM_PFNMAP.
>
> It makes no sense at all to have VM_PFNMAP where someone has COW'd the
> pages and effectively turned it into a MIXEDMAP.
>
Well, not quite.
MIXEDMAP is clearly documented that whatever has a refcount (struct
page) is refcounted. That's not the case for PFNMAP, obviously, where
nothing is refcounted, except ... anon pages in COW mappings.
Having refcounted anon page in a VM_PFNMAP doesn't suddenly turn the
whole thing into a MIXEDMAP where other things with a "struct page" are
suddenly refcounted as well.
> It should just be a MIXEDMAP at that point!!
>
> IMHO we should have the core code complain with a dmesg for
VM_PFNMAP> without VM_SHARED and get driver authors to fix it. Either
add the
> missed VM_SHARED or changed to MIXEDMAP. Make it WARN_ON in a few
> years.
>
> If they can't follow the rules required to use MIXEDMAP then they
> certainly do not have working COW!!
>
>>> In either case GUP doesn't really work on the VMA. PFNMAP is
totally>>> blocked, and for MIXEDMAP userspace has no way to discover which
>>> subset of the VMA is GUPable. I think that GUP is supported on
>>> MIXEDMAP at all is a bit of a weirdo thing.
>>
>> IIRC (after recent discussions with Lorenzo) there are use cases for that.
>
> Really? Somehow userspace can know which sub slice of the VMA is
> GUP'able? Gross :P
Well, the primary use case is mapping kernel allocations into user space
and GUP'ing them. So the pages are refcounted ...
MIXEDMAP is likely just abused for that ... so I take back that that's a
requirement, rather that MIXEDMAP is automatically used when using the
interfaces -- vm_insert_pages -- to insert refcounted pages into page
tables :)
... and that should definitely be changed, such that MIXEDMAP is really
only used when actually having non-refcounted things in there.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 17:59 ` Jason Gunthorpe
@ 2025-05-28 18:03 ` David Hildenbrand
0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2025-05-28 18:03 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Peter Xu, Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, linux-mm, wangkefeng.wang
On 28.05.25 19:59, Jason Gunthorpe wrote:
> On Wed, May 28, 2025 at 02:47:54PM -0300, Jason Gunthorpe wrote:
>> On Wed, May 28, 2025 at 07:32:59PM +0200, David Hildenbrand wrote:
>>> On 28.05.25 18:29, Jason Gunthorpe wrote:
>>>> On Wed, May 28, 2025 at 12:06:07PM -0400, Peter Xu wrote:
>>>>> #define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
>>>>>
>>>>> I'm not confident to blame any driver yet to have those special cases for
>>>>> VM_PFNMAP, because it only says "managed without struct page", it didn't
>>>>> say "it must not contain struct page".. Hence it hints the core mm "please
>>>>> do not manage these mappings with struct page at all". Still sounds fair
>>>>> contract, even if not ideal.
>>>>
>>>> I think it is pretty clear, if a VMA has VM_PFNMAP then nothing must
>>>> ever try to obtain a struct page from any PTEs in it, for any reason,
>>>> even if things in it might have a struct page. In practice it means
>>>> nothing can call vm_normal_page() on a VM_PFNMAP.
>>>
>>> No, not until we remove any COW mappings of VM_PFNMAP.
>>
>> I stand by the statement, the COW mapping thing is broken. Just
>> because it is wrong doesn't mean it is allowed to call
>> vm_normal_page() on VM_PFNMAP.
>
> I guess I got this backwards in my memory, the vm_normal_page() thing
> is specifically to fix the COW nastyness on VM_PFNMAP.
>
> So it is OK to call vm_normal_page() on both kinds of VMAs and it will
> return NULL if you are not allowed to touch the struct page.
>
> This is still what GUP should be doing, vm_normal_page() and if no
> struct page comes back then failure. It should never try to get a
> struct page some other way.
Exactly. And we can even do that for VM_PFNMAP and just have COW
VM_PFNMAP supported in GUP, likely resolving the issue here.
Should be fairly easy ...
>
> Yes, GUP fast will be inconsistent with GUP slow on non-special
> arches, but that is unfixable without taking a performance hit.
IIRC, GUP-fast requires PTE-special.
So it should then all work as expected ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 18:00 ` David Hildenbrand
@ 2025-05-28 18:15 ` Jason Gunthorpe
2025-05-28 18:22 ` David Hildenbrand
0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2025-05-28 18:15 UTC (permalink / raw)
To: David Hildenbrand
Cc: Peter Xu, Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, linux-mm, wangkefeng.wang
On Wed, May 28, 2025 at 08:00:34PM +0200, David Hildenbrand wrote:
> Having refcounted anon page in a VM_PFNMAP doesn't suddenly turn the whole
> thing into a MIXEDMAP where other things with a "struct page" are suddenly
> refcounted as well.
The special COW rules for PFNMAP require a single linear mapping so
that vma->vm_pgoff can encode the "special" range. So the only way to
get struct pages in a PFNMAP that also satisfy the COW rule is if you
have a single contiguous chunk of them.
Otherwise they turn into MIXEDMAP on non-special arches.
Which is a little detail I wonder how many DRM drivers trip up on :|
> ... and that should definitely be changed, such that MIXEDMAP is really only
> used when actually having non-refcounted things in there.
MIXEDMAP has to be used any time you can't meet the special rules that
remap_pfn_range() sets for PFNMAP.
Really I think we have three special cases, not just two:
VM_MIXEDMAP - The ptes can hold struct pages and if they do the refcount is
managed by the core code
VM_PFNMAP|VM_SHARED - The ptes can never be converted into struct pages, nothing
can ever refcount them and you can use fault/etc to store
things
VM_PFNMAP|!VM_SHARED - Must only be created by remap_pfn_range() and can store
exactly 1 contiguous range of address that cannot be converted to
struct pages and will never be refcounted. Probably can't
use fault
Most of my remarks are about the VM_PFNMAP|VM_SHARED case
For arches with the special bit VM_PFNMAP|!VM_SHARED == VM_MIXEDMAP
Jason
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 18:15 ` Jason Gunthorpe
@ 2025-05-28 18:22 ` David Hildenbrand
2025-05-28 18:29 ` Jason Gunthorpe
0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-05-28 18:22 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Peter Xu, Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, linux-mm, wangkefeng.wang
On 28.05.25 20:15, Jason Gunthorpe wrote:
> On Wed, May 28, 2025 at 08:00:34PM +0200, David Hildenbrand wrote:
>
>> Having refcounted anon page in a VM_PFNMAP doesn't suddenly turn the whole
>> thing into a MIXEDMAP where other things with a "struct page" are suddenly
>> refcounted as well.
>
> The special COW rules for PFNMAP require a single linear mapping so
> that vma->vm_pgoff can encode the "special" range.
Yes, I have a semi-ugly patch to adjust remove that special casing.
(less ugly than current handling: *still* lookup the struct page, and if
it references an anonymous folio, we ... have an anonymous folio.
Because nobody should ever,ever,ever PFNMAP an anonymous folio. Ever.)
So the only way to
> get struct pages in a PFNMAP that also satisfy the COW rule is if you
> have a single contiguous chunk of them.
In remap_pfn_range_internal() we sanity-check that right now. And fail
if it is not in place.
Any COW VM_PFNMAP that doesn't go through remap_pfn_range() would be ...
broken. I don't think we have them.
>
> Otherwise they turn into MIXEDMAP on non-special arches.
>
> Which is a little detail I wonder how many DRM drivers trip up on :|
>
>> ... and that should definitely be changed, such that MIXEDMAP is really only
>> used when actually having non-refcounted things in there.
>
> MIXEDMAP has to be used any time you can't meet the special rules that
> remap_pfn_range() sets for PFNMAP.
>
> Really I think we have three special cases, not just two:
>
> VM_MIXEDMAP - The ptes can hold struct pages and if they do the refcount is
> managed by the core code
>
> VM_PFNMAP|VM_SHARED - The ptes can never be converted into struct pages, nothing
> can ever refcount them and you can use fault/etc to store
> things
>
> VM_PFNMAP|!VM_SHARED - Must only be created by remap_pfn_range() and can store
> exactly 1 contiguous range of address that cannot be converted to
> struct pages and will never be refcounted. Probably can't
> use fault
Yes.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 18:22 ` David Hildenbrand
@ 2025-05-28 18:29 ` Jason Gunthorpe
2025-05-30 10:04 ` David Hildenbrand
0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2025-05-28 18:29 UTC (permalink / raw)
To: David Hildenbrand
Cc: Peter Xu, Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, linux-mm, wangkefeng.wang
On Wed, May 28, 2025 at 08:22:08PM +0200, David Hildenbrand wrote:
> On 28.05.25 20:15, Jason Gunthorpe wrote:
> > On Wed, May 28, 2025 at 08:00:34PM +0200, David Hildenbrand wrote:
> >
> > > Having refcounted anon page in a VM_PFNMAP doesn't suddenly turn the whole
> > > thing into a MIXEDMAP where other things with a "struct page" are suddenly
> > > refcounted as well.
> >
> > The special COW rules for PFNMAP require a single linear mapping so
> > that vma->vm_pgoff can encode the "special" range.
>
> Yes, I have a semi-ugly patch to adjust remove that special casing.
>
> (less ugly than current handling: *still* lookup the struct page, and if it
> references an anonymous folio, we ... have an anonymous folio. Because
> nobody should ever,ever,ever PFNMAP an anonymous folio. Ever.)
Hmm, maybe so.
> In remap_pfn_range_internal() we sanity-check that right now. And fail if it
> is not in place.
>
> Any COW VM_PFNMAP that doesn't go through remap_pfn_range() would be ...
> broken. I don't think we have them.
It is a really easy driver mistake to make though, you have to omit
the rejection of !VM_SHARED, then go on and do something else.
How many drivers use VM_PFNMAP without remap_pfn_range() and don't
check for VM_SHARED?
From a driver perspective it would be much easier to understand if the
driver could just tell the core code what functions it wants to use
if (vma_prepare(vma, USE_REMAP_PFN_RANGE |
USE_IO_REMAP_PFN_RANGE |
USE_VMF_INSERT_PAGE |
USE_VMF_INSERT_IO_PFN |
[.. etc ..]))
Then the above would set the VM flags properly and fail in cases that
are wrong. Like you can't do USE_VMF_INSERT_IO_PFN without VM_SHARED.
No more messes of drivers doing random and confusing mixtures of
VM_XX.
So much easier to audit if the drivers are doing the right stuff
because we just check what mm functions they call against the bit mask
they provide. A debug kernel could enforce the check.
Then we can say with clarity that VM_XX flags are used with various
insertion functions exclusively and know alot better what it is they
mean.
Jason
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] mm: fix COW mapping handing in generic_access_phys
2025-05-28 18:29 ` Jason Gunthorpe
@ 2025-05-30 10:04 ` David Hildenbrand
0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2025-05-30 10:04 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Peter Xu, Jinjiang Tu, akpm, lorenzo.stoakes, Liam.Howlett,
vbabka, rppt, surenb, mhocko, linux-mm, wangkefeng.wang
On 28.05.25 20:29, Jason Gunthorpe wrote:
> On Wed, May 28, 2025 at 08:22:08PM +0200, David Hildenbrand wrote:
>> On 28.05.25 20:15, Jason Gunthorpe wrote:
>>> On Wed, May 28, 2025 at 08:00:34PM +0200, David Hildenbrand wrote:
>>>
>>>> Having refcounted anon page in a VM_PFNMAP doesn't suddenly turn the whole
>>>> thing into a MIXEDMAP where other things with a "struct page" are suddenly
>>>> refcounted as well.
>>>
>>> The special COW rules for PFNMAP require a single linear mapping so
>>> that vma->vm_pgoff can encode the "special" range.
>>
>> Yes, I have a semi-ugly patch to adjust remove that special casing.
>>
>> (less ugly than current handling: *still* lookup the struct page, and if it
>> references an anonymous folio, we ... have an anonymous folio. Because
>> nobody should ever,ever,ever PFNMAP an anonymous folio. Ever.)
>
> Hmm, maybe so.
I'll try to send out a prototype of that patch once I managed to ... finish
never ending stream of review :)
>
>> In remap_pfn_range_internal() we sanity-check that right now. And fail if it
>> is not in place.
>>
>> Any COW VM_PFNMAP that doesn't go through remap_pfn_range() would be ...
>> broken. I don't think we have them.
>
> It is a really easy driver mistake to make though, you have to omit
> the rejection of !VM_SHARED, then go on and do something else.
As I just reviewed related code, I think when not using remap_pfn_range(), a
driver -- like vfio -- will be using vmf_insert_pfn().
There, we sanity check it as well:
vmf_insert_pfn()->vmf_insert_pfn_prot()
BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
>
> How many drivers use VM_PFNMAP without remap_pfn_range() and don't
> check for VM_SHARED?
Not sure, only a handful even use vmf_insert_pfn(), and the BUG_ON would at
least catch it.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-05-30 10:05 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-28 1:56 [PATCH] mm: fix COW mapping handing in generic_access_phys Jinjiang Tu
2025-05-28 8:59 ` David Hildenbrand
2025-05-28 9:59 ` David Hildenbrand
2025-05-28 12:14 ` Jinjiang Tu
2025-05-28 14:54 ` Peter Xu
2025-05-28 15:02 ` David Hildenbrand
2025-05-28 15:25 ` Peter Xu
2025-05-28 15:29 ` David Hildenbrand
2025-05-28 16:06 ` Peter Xu
2025-05-28 16:29 ` Jason Gunthorpe
2025-05-28 17:14 ` Peter Xu
2025-05-28 17:34 ` Jason Gunthorpe
2025-05-28 17:37 ` David Hildenbrand
2025-05-28 17:32 ` David Hildenbrand
2025-05-28 17:47 ` Jason Gunthorpe
2025-05-28 17:59 ` Jason Gunthorpe
2025-05-28 18:03 ` David Hildenbrand
2025-05-28 18:00 ` David Hildenbrand
2025-05-28 18:15 ` Jason Gunthorpe
2025-05-28 18:22 ` David Hildenbrand
2025-05-28 18:29 ` Jason Gunthorpe
2025-05-30 10:04 ` David Hildenbrand
2025-05-28 12:13 ` Jinjiang Tu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).