* [PATCH 0/2] Fix high-order allocations for AUX space
@ 2023-06-12  5:24 Shuai Xue
  2023-06-12  5:24 ` [PATCH 1/2] perf/core: Bail out early if the request AUX area is out of bound Shuai Xue
  2023-06-12  5:24 ` [PATCH 2/2] perf/ring_buffer: Fix high-order allocations for AUX space with correct MAX_ORDER limit Shuai Xue
  0 siblings, 2 replies; 14+ messages in thread
From: Shuai Xue @ 2023-06-12  5:24 UTC (permalink / raw)
  To: alexander.shishkin, peterz
  Cc: mingo, acme, mark.rutland, jolsa, namhyung, irogers,
	adrian.hunter, linux-perf-users, linux-kernel, bpf
When perf-record with large AUX area, it reveals WARNINGs with __alloc_pages.
Fix with correct MAX_ORDER limit to request higher order allocations so that
larger contiguous areas is allocated. 
Shuai Xue (2):
  perf/core: Bail out early if the request AUX area is out of bound
  perf/ring_buffer: Fix high-order allocations for AUX space with
    correct MAX_ORDER limit
 kernel/events/core.c                     | 10 ++++++++++
 kernel/events/ring_buffer.c              |  4 ++--
 tools/perf/Documentation/perf-record.txt |  3 ++-
 3 files changed, 14 insertions(+), 3 deletions(-)
-- 
1.8.3.1
^ permalink raw reply	[flat|nested] 14+ messages in thread* [PATCH 1/2] perf/core: Bail out early if the request AUX area is out of bound 2023-06-12 5:24 [PATCH 0/2] Fix high-order allocations for AUX space Shuai Xue @ 2023-06-12 5:24 ` Shuai Xue 2023-06-12 7:38 ` Leo Yan 2023-06-12 5:24 ` [PATCH 2/2] perf/ring_buffer: Fix high-order allocations for AUX space with correct MAX_ORDER limit Shuai Xue 1 sibling, 1 reply; 14+ messages in thread From: Shuai Xue @ 2023-06-12 5:24 UTC (permalink / raw) To: alexander.shishkin, peterz Cc: mingo, acme, mark.rutland, jolsa, namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel, bpf When perf-record with a large AUX area, e.g 4GB, it fails with: #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1 failed to mmap with 12 (Cannot allocate memory) and it reveals a WARNING with __alloc_pages: [ 66.595604] ------------[ cut here ]------------ [ 66.600206] WARNING: CPU: 44 PID: 17573 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248 [ 66.608375] Modules linked in: ip6table_filter(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) aes_ce_blk(E) vfat(E) fat(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_cipher(E) sm4(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) ip_tables(E) sd_mod(E) ast(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) nvme(E) sysimgblt(E) i2c_algo_bit(E) nvme_core(E) drm_shmem_helper(E) ahci(E) t10_pi(E) libahci(E) drm(E) crc64_rocksoft(E) i40e(E) crc64(E) libata(E) i2c_core(E) [ 66.657719] CPU: 44 PID: 17573 Comm: perf Kdump: loaded Tainted: G E 6.3.0-rc4+ #58 [ 66.666749] Hardware name: Default Default/Default, BIOS 1.2.M1.AL.P.139.00 03/22/2023 [ 66.674650] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 66.681597] pc : __alloc_pages+0x1ec/0x248 [ 66.685680] lr : __kmalloc_large_node+0xc0/0x1f8 [ 66.690285] sp : ffff800020523980 [ 66.693585] pmr_save: 000000e0 [ 66.696624] x29: ffff800020523980 x28: ffff000832975800 x27: 0000000000000000 [ 66.703746] x26: 0000000000100000 x25: 0000000000100000 x24: ffff8000083615d0 [ 66.710866] x23: 0000000000040dc0 x22: ffff000823d6d140 x21: 000000000000000b [ 66.717987] x20: 000000000000000b x19: 0000000000000000 x18: 0000000000000030 [ 66.725108] x17: 0000000000000000 x16: ffff800008f05be8 x15: ffff000823d6d6d0 [ 66.732229] x14: 0000000000000000 x13: 343373656761705f x12: 726e202c30206574 [ 66.739350] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffff8000083af570 [ 66.746471] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 000000000005fff4 [ 66.753592] x5 : 0000000000000000 x4 : ffff000823d6d8d8 x3 : 0000000000000000 [ 66.760713] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000040dc0 [ 66.767834] Call trace: [ 66.770267] __alloc_pages+0x1ec/0x248 [ 66.774003] __kmalloc_large_node+0xc0/0x1f8 [ 66.778259] __kmalloc_node+0x134/0x1e8 [ 66.782081] rb_alloc_aux+0xe0/0x298 [ 66.785643] perf_mmap+0x440/0x660 [ 66.789031] mmap_region+0x308/0x8a8 [ 66.792593] do_mmap+0x3c0/0x528 [ 66.795807] vm_mmap_pgoff+0xf4/0x1b8 [ 66.799456] ksys_mmap_pgoff+0x18c/0x218 [ 66.803365] __arm64_sys_mmap+0x38/0x58 [ 66.807187] invoke_syscall+0x50/0x128 [ 66.810922] el0_svc_common.constprop.0+0x58/0x188 [ 66.815698] do_el0_svc+0x34/0x50 [ 66.818999] el0_svc+0x34/0x108 [ 66.822127] el0t_64_sync_handler+0xb8/0xc0 [ 66.826296] el0t_64_sync+0x1a4/0x1a8 [ 66.829946] ---[ end trace 0000000000000000 ]--- The pages for AUX area are organized as rb->aux_pages[] which alloced by kcalloc_node() later. The kcalloc() family guarantees the pages are physically contiguous (and virtually contiguous) with an order of MAX_ORDER - 1 at maximum. So bail out early with -EINVAL if the request AUX area is out of bound, e.g.: #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1 failed to mmap with 22 (Invalid argument) Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> --- kernel/events/core.c | 10 ++++++++++ tools/perf/Documentation/perf-record.txt | 3 ++- 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 435815d..83d4e29 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6404,6 +6404,16 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma) return -EINVAL; nr_pages = vma_size / PAGE_SIZE; + /* + * The pages for AUX area are organized as rb->aux_pages[] + * which alloced by kcalloc_node() later. The kcalloc() family + * guarantees the pages are physically contiguous (and + * virtually contiguous) with an order of MAX_ORDER - 1 at + * maximum MAX_ORDER. So bail out early if the request AUX area + * is out of bound. + */ + if (get_order(nr_pages * sizeof(void *)) >= MAX_ORDER) + return -EINVAL; mutex_lock(&event->mmap_mutex); ret = -EINVAL; diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt index ff815c2..a50a426 100644 --- a/tools/perf/Documentation/perf-record.txt +++ b/tools/perf/Documentation/perf-record.txt @@ -236,7 +236,8 @@ OPTIONS specification with appended unit character - B/K/M/G. The size is rounded up to have nearest pages power of two value. Also, by adding a comma, the number of mmap pages for AUX - area tracing can be specified. + area tracing can be specified. With MAX_ORDER set as 11, the + maximum AUX area is limit to 2GB. -g:: Enables call-graph (stack chain/backtrace) recording for both -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] perf/core: Bail out early if the request AUX area is out of bound 2023-06-12 5:24 ` [PATCH 1/2] perf/core: Bail out early if the request AUX area is out of bound Shuai Xue @ 2023-06-12 7:38 ` Leo Yan 2023-06-12 8:35 ` Shuai Xue 2023-06-12 8:47 ` James Clark 0 siblings, 2 replies; 14+ messages in thread From: Leo Yan @ 2023-06-12 7:38 UTC (permalink / raw) To: Shuai Xue Cc: alexander.shishkin, peterz, mingo, acme, mark.rutland, jolsa, namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel, bpf Hi Shuai, On Mon, Jun 12, 2023 at 01:24:51PM +0800, Shuai Xue wrote: > When perf-record with a large AUX area, e.g 4GB, it fails with: > > #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1 > failed to mmap with 12 (Cannot allocate memory) > > and it reveals a WARNING with __alloc_pages: > > [ 66.595604] ------------[ cut here ]------------ > [ 66.600206] WARNING: CPU: 44 PID: 17573 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248 > [ 66.608375] Modules linked in: ip6table_filter(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) aes_ce_blk(E) vfat(E) fat(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_cipher(E) sm4(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) ip_tables(E) sd_mod(E) ast(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) nvme(E) sysimgblt(E) i2c_algo_bit(E) nvme_core(E) drm_shmem_helper(E) ahci(E) t10_pi(E) libahci(E) drm(E) crc64_rocksoft(E) i40e(E) crc64(E) libata(E) i2c_core(E) > [ 66.657719] CPU: 44 PID: 17573 Comm: perf Kdump: loaded Tainted: G E 6.3.0-rc4+ #58 > [ 66.666749] Hardware name: Default Default/Default, BIOS 1.2.M1.AL.P.139.00 03/22/2023 > [ 66.674650] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) > [ 66.681597] pc : __alloc_pages+0x1ec/0x248 > [ 66.685680] lr : __kmalloc_large_node+0xc0/0x1f8 > [ 66.690285] sp : ffff800020523980 > [ 66.693585] pmr_save: 000000e0 > [ 66.696624] x29: ffff800020523980 x28: ffff000832975800 x27: 0000000000000000 > [ 66.703746] x26: 0000000000100000 x25: 0000000000100000 x24: ffff8000083615d0 > [ 66.710866] x23: 0000000000040dc0 x22: ffff000823d6d140 x21: 000000000000000b > [ 66.717987] x20: 000000000000000b x19: 0000000000000000 x18: 0000000000000030 > [ 66.725108] x17: 0000000000000000 x16: ffff800008f05be8 x15: ffff000823d6d6d0 > [ 66.732229] x14: 0000000000000000 x13: 343373656761705f x12: 726e202c30206574 > [ 66.739350] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffff8000083af570 > [ 66.746471] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 000000000005fff4 > [ 66.753592] x5 : 0000000000000000 x4 : ffff000823d6d8d8 x3 : 0000000000000000 > [ 66.760713] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000040dc0 > [ 66.767834] Call trace: > [ 66.770267] __alloc_pages+0x1ec/0x248 > [ 66.774003] __kmalloc_large_node+0xc0/0x1f8 > [ 66.778259] __kmalloc_node+0x134/0x1e8 > [ 66.782081] rb_alloc_aux+0xe0/0x298 > [ 66.785643] perf_mmap+0x440/0x660 > [ 66.789031] mmap_region+0x308/0x8a8 > [ 66.792593] do_mmap+0x3c0/0x528 > [ 66.795807] vm_mmap_pgoff+0xf4/0x1b8 > [ 66.799456] ksys_mmap_pgoff+0x18c/0x218 > [ 66.803365] __arm64_sys_mmap+0x38/0x58 > [ 66.807187] invoke_syscall+0x50/0x128 > [ 66.810922] el0_svc_common.constprop.0+0x58/0x188 > [ 66.815698] do_el0_svc+0x34/0x50 > [ 66.818999] el0_svc+0x34/0x108 > [ 66.822127] el0t_64_sync_handler+0xb8/0xc0 > [ 66.826296] el0t_64_sync+0x1a4/0x1a8 > [ 66.829946] ---[ end trace 0000000000000000 ]--- > > The pages for AUX area are organized as rb->aux_pages[] which alloced by > kcalloc_node() later. The kcalloc() family guarantees the pages are > physically contiguous (and virtually contiguous) with an order of > MAX_ORDER - 1 at maximum. This description is incorrect. We need to distinguish two things: AUX trace pages and 'rb->aux_pages' is pointer array which is used to maintains these page. Here, the kernel oops reports the error is not for AUX trace pages but for failing to allocate the pointer array from slab (or slub) area. Furthermore, I believe the AUX trace pages are only mapped for VMA (continuous virtual address), the kernel will defer to map to physical pages (which means it's not necessarily continuous physical pages) when handling data abort caused by accessing the pages. When you specify the AUX buffer size to 4GiB, the kernel will convert it to page numbers (page size is 4KiB, page number is = 4GiB / 4KiB = 1MiB). Since aarch64's pointer type's length is 8 bytes, thus we need to allocate the 8MiB buffer from slab/slub, unfortunately, 8MiB crosses the limitation set by MAX_ORDER (4KiB ^ (MAX_ORDER - 1) = 4MiB), this is why we receive the oops from __alloc_pages(). > So bail out early with -EINVAL if the request AUX area is out of bound, > e.g.: > > #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1 > failed to mmap with 22 (Invalid argument) > > Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> > --- > kernel/events/core.c | 10 ++++++++++ > tools/perf/Documentation/perf-record.txt | 3 ++- > 2 files changed, 12 insertions(+), 1 deletion(-) > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 435815d..83d4e29 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -6404,6 +6404,16 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma) > return -EINVAL; > > nr_pages = vma_size / PAGE_SIZE; > + /* > + * The pages for AUX area are organized as rb->aux_pages[] > + * which alloced by kcalloc_node() later. The kcalloc() family > + * guarantees the pages are physically contiguous (and > + * virtually contiguous) with an order of MAX_ORDER - 1 at > + * maximum MAX_ORDER. So bail out early if the request AUX area > + * is out of bound. > + */ > + if (get_order(nr_pages * sizeof(void *)) >= MAX_ORDER) > + return -EINVAL; From my view, now kernel has handled this case (I agree it might be not directive for outputting error or warning info rather than oops). If we really want this checking, I'd like to add it in rb_alloc_aux(), since rb_alloc_aux() is the place for allocation the memory, thus it's right place for the checking memory limitation. And you might need to consider the update the comments to avoid confusion. I am not the best person for the decision, I'd like to leave it to perf maintainers and wait for their thoughts. Thanks, Leo > mutex_lock(&event->mmap_mutex); > ret = -EINVAL; > diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt > index ff815c2..a50a426 100644 > --- a/tools/perf/Documentation/perf-record.txt > +++ b/tools/perf/Documentation/perf-record.txt > @@ -236,7 +236,8 @@ OPTIONS > specification with appended unit character - B/K/M/G. The > size is rounded up to have nearest pages power of two value. > Also, by adding a comma, the number of mmap pages for AUX > - area tracing can be specified. > + area tracing can be specified. With MAX_ORDER set as 11, the > + maximum AUX area is limit to 2GB. > > -g:: > Enables call-graph (stack chain/backtrace) recording for both > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] perf/core: Bail out early if the request AUX area is out of bound 2023-06-12 7:38 ` Leo Yan @ 2023-06-12 8:35 ` Shuai Xue 2023-06-12 10:05 ` Leo Yan 2023-06-12 8:47 ` James Clark 1 sibling, 1 reply; 14+ messages in thread From: Shuai Xue @ 2023-06-12 8:35 UTC (permalink / raw) To: Leo Yan Cc: alexander.shishkin, peterz, mingo, acme, mark.rutland, jolsa, namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel, bpf, Baolin Wang On 2023/6/12 15:38, Leo Yan wrote: > Hi Shuai, > > On Mon, Jun 12, 2023 at 01:24:51PM +0800, Shuai Xue wrote: >> When perf-record with a large AUX area, e.g 4GB, it fails with: >> >> #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1 >> failed to mmap with 12 (Cannot allocate memory) >> >> and it reveals a WARNING with __alloc_pages: >> >> [ 66.595604] ------------[ cut here ]------------ >> [ 66.600206] WARNING: CPU: 44 PID: 17573 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248 >> [ 66.608375] Modules linked in: ip6table_filter(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) aes_ce_blk(E) vfat(E) fat(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_cipher(E) sm4(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) ip_tables(E) sd_mod(E) ast(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) nvme(E) sysimgblt(E) i2c_algo_bit(E) nvme_core(E) drm_shmem_helper(E) ahci(E) t10_pi(E) libahci(E) drm(E) crc64_rocksoft(E) i40e(E) crc64(E) libata(E) i2c_core(E) >> [ 66.657719] CPU: 44 PID: 17573 Comm: perf Kdump: loaded Tainted: G E 6.3.0-rc4+ #58 >> [ 66.666749] Hardware name: Default Default/Default, BIOS 1.2.M1.AL.P.139.00 03/22/2023 >> [ 66.674650] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) >> [ 66.681597] pc : __alloc_pages+0x1ec/0x248 >> [ 66.685680] lr : __kmalloc_large_node+0xc0/0x1f8 >> [ 66.690285] sp : ffff800020523980 >> [ 66.693585] pmr_save: 000000e0 >> [ 66.696624] x29: ffff800020523980 x28: ffff000832975800 x27: 0000000000000000 >> [ 66.703746] x26: 0000000000100000 x25: 0000000000100000 x24: ffff8000083615d0 >> [ 66.710866] x23: 0000000000040dc0 x22: ffff000823d6d140 x21: 000000000000000b >> [ 66.717987] x20: 000000000000000b x19: 0000000000000000 x18: 0000000000000030 >> [ 66.725108] x17: 0000000000000000 x16: ffff800008f05be8 x15: ffff000823d6d6d0 >> [ 66.732229] x14: 0000000000000000 x13: 343373656761705f x12: 726e202c30206574 >> [ 66.739350] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffff8000083af570 >> [ 66.746471] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 000000000005fff4 >> [ 66.753592] x5 : 0000000000000000 x4 : ffff000823d6d8d8 x3 : 0000000000000000 >> [ 66.760713] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000040dc0 >> [ 66.767834] Call trace: >> [ 66.770267] __alloc_pages+0x1ec/0x248 >> [ 66.774003] __kmalloc_large_node+0xc0/0x1f8 >> [ 66.778259] __kmalloc_node+0x134/0x1e8 >> [ 66.782081] rb_alloc_aux+0xe0/0x298 >> [ 66.785643] perf_mmap+0x440/0x660 >> [ 66.789031] mmap_region+0x308/0x8a8 >> [ 66.792593] do_mmap+0x3c0/0x528 >> [ 66.795807] vm_mmap_pgoff+0xf4/0x1b8 >> [ 66.799456] ksys_mmap_pgoff+0x18c/0x218 >> [ 66.803365] __arm64_sys_mmap+0x38/0x58 >> [ 66.807187] invoke_syscall+0x50/0x128 >> [ 66.810922] el0_svc_common.constprop.0+0x58/0x188 >> [ 66.815698] do_el0_svc+0x34/0x50 >> [ 66.818999] el0_svc+0x34/0x108 >> [ 66.822127] el0t_64_sync_handler+0xb8/0xc0 >> [ 66.826296] el0t_64_sync+0x1a4/0x1a8 >> [ 66.829946] ---[ end trace 0000000000000000 ]--- >> >> The pages for AUX area are organized as rb->aux_pages[] which alloced by >> kcalloc_node() later. The kcalloc() family guarantees the pages are >> physically contiguous (and virtually contiguous) with an order of >> MAX_ORDER - 1 at maximum. > > This description is incorrect. We need to distinguish two things: > > AUX trace pages and 'rb->aux_pages' is pointer array which is used to > maintains these page. Here, the kernel oops reports the error is not > for AUX trace pages but for failing to allocate the pointer array from > slab (or slub) area. You are right, 'rb->aux_pages' is a pointer array of AUX trace pages, I intend to mean that "The pages for AUX area are organized as rb->aux_pages[]" and it fails to allocate a continuous physical for 'rb->aux_pages'. > > Furthermore, I believe the AUX trace pages are only mapped for VMA > (continuous virtual address), the kernel will defer to map to physical > pages (which means it's not necessarily continuous physical pages) > when handling data abort caused by accessing the pages. I don't know why the rb->aux_pages is limit to allocated with continuous physical pages. so I just add a check to avoid oops and report a proper error code -EINVAL to user. I would like to use vmalloc() family to replace kmalloc() so that we could support allocate a more large AUX area if it is not necessarily continuous physical pages. Should we remove the restriction? > > When you specify the AUX buffer size to 4GiB, the kernel will convert it > to page numbers (page size is 4KiB, page number is = 4GiB / 4KiB = 1MiB). > Since aarch64's pointer type's length is 8 bytes, thus we need to > allocate the 8MiB buffer from slab/slub, unfortunately, 8MiB crosses the > limitation set by MAX_ORDER (4KiB ^ (MAX_ORDER - 1) = 4MiB), this is > why we receive the oops from __alloc_pages(). Exactly, the oops is due to allocating of the pointer array rb->aux_pages. As the lines in doc I added, the max order page (order 10, 4MiB) of the pointer array supports to hold 2GiB AUX pages. > >> So bail out early with -EINVAL if the request AUX area is out of bound, >> e.g.: >> >> #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1 >> failed to mmap with 22 (Invalid argument) >> >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >> --- >> kernel/events/core.c | 10 ++++++++++ >> tools/perf/Documentation/perf-record.txt | 3 ++- >> 2 files changed, 12 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/events/core.c b/kernel/events/core.c >> index 435815d..83d4e29 100644 >> --- a/kernel/events/core.c >> +++ b/kernel/events/core.c >> @@ -6404,6 +6404,16 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma) >> return -EINVAL; >> >> nr_pages = vma_size / PAGE_SIZE; >> + /* >> + * The pages for AUX area are organized as rb->aux_pages[] >> + * which alloced by kcalloc_node() later. The kcalloc() family >> + * guarantees the pages are physically contiguous (and >> + * virtually contiguous) with an order of MAX_ORDER - 1 at >> + * maximum MAX_ORDER. So bail out early if the request AUX area >> + * is out of bound. >> + */ >> + if (get_order(nr_pages * sizeof(void *)) >= MAX_ORDER) >> + return -EINVAL; > > From my view, now kernel has handled this case (I agree it might be > not directive for outputting error or warning info rather than oops). > > If we really want this checking, I'd like to add it in rb_alloc_aux(), > since rb_alloc_aux() is the place for allocation the memory, thus it's > right place for the checking memory limitation. Got it. If we should have physical continuous for rb->aux_pages, I'd like to move this check to rb_alloc_aux(). > And you might need to > consider the update the comments to avoid confusion. Ok, I will update the commit log and comments to make it more clear. > > I am not the best person for the decision, I'd like to leave it to perf > maintainers and wait for their thoughts. Haha, let's waiting for other thoughts. > > Thanks, > Leo Thank you for your quick reply and valuable comments. Best Regards, Shuai > >> mutex_lock(&event->mmap_mutex); >> ret = -EINVAL; >> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt >> index ff815c2..a50a426 100644 >> --- a/tools/perf/Documentation/perf-record.txt >> +++ b/tools/perf/Documentation/perf-record.txt >> @@ -236,7 +236,8 @@ OPTIONS >> specification with appended unit character - B/K/M/G. The >> size is rounded up to have nearest pages power of two value. >> Also, by adding a comma, the number of mmap pages for AUX >> - area tracing can be specified. >> + area tracing can be specified. With MAX_ORDER set as 11, the >> + maximum AUX area is limit to 2GB. >> >> -g:: >> Enables call-graph (stack chain/backtrace) recording for both >> -- >> 1.8.3.1 >> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] perf/core: Bail out early if the request AUX area is out of bound 2023-06-12 8:35 ` Shuai Xue @ 2023-06-12 10:05 ` Leo Yan 2023-06-12 10:25 ` Leo Yan 0 siblings, 1 reply; 14+ messages in thread From: Leo Yan @ 2023-06-12 10:05 UTC (permalink / raw) To: Shuai Xue Cc: alexander.shishkin, peterz, mingo, acme, mark.rutland, jolsa, namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel, bpf, Baolin Wang On Mon, Jun 12, 2023 at 04:35:07PM +0800, Shuai Xue wrote: [...] > > Furthermore, I believe the AUX trace pages are only mapped for VMA > > (continuous virtual address), the kernel will defer to map to physical > > pages (which means it's not necessarily continuous physical pages) > > when handling data abort caused by accessing the pages. > > I don't know why the rb->aux_pages is limit to allocated with continuous physical pages. > so I just add a check to avoid oops and report a proper error code -EINVAL to > user. > > I would like to use vmalloc() family to replace kmalloc() so that we could support > allocate a more large AUX area if it is not necessarily continuous physical pages. > Should we remove the restriction? As you said, we are now able to support a maximum AUX trace buffer size of up to 2GiB, and AUX trace buffer is per CPU wise. Seems to me, 2GiB AUX buffer per CPU is big enough for most tracing scenarios, right? Except you can provide profiling scenario which must use bigger buffer size. Another factor is the allocation of buffers from kmalloc() offers better performance compared to allocation from vmalloc(), this is also important for perf core layer. Thanks, Leo ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] perf/core: Bail out early if the request AUX area is out of bound 2023-06-12 10:05 ` Leo Yan @ 2023-06-12 10:25 ` Leo Yan 2023-06-13 3:00 ` Shuai Xue 0 siblings, 1 reply; 14+ messages in thread From: Leo Yan @ 2023-06-12 10:25 UTC (permalink / raw) To: Shuai Xue Cc: alexander.shishkin, peterz, mingo, acme, mark.rutland, jolsa, namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel, bpf, Baolin Wang On Mon, Jun 12, 2023 at 06:05:02PM +0800, Leo Yan wrote: > On Mon, Jun 12, 2023 at 04:35:07PM +0800, Shuai Xue wrote: > > [...] > > > > Furthermore, I believe the AUX trace pages are only mapped for VMA > > > (continuous virtual address), the kernel will defer to map to physical > > > pages (which means it's not necessarily continuous physical pages) > > > when handling data abort caused by accessing the pages. > > > > I don't know why the rb->aux_pages is limit to allocated with continuous physical pages. > > so I just add a check to avoid oops and report a proper error code -EINVAL to > > user. > > > > I would like to use vmalloc() family to replace kmalloc() so that we could support > > allocate a more large AUX area if it is not necessarily continuous physical pages. > > Should we remove the restriction? > > As you said, we are now able to support a maximum AUX trace buffer > size of up to 2GiB, and AUX trace buffer is per CPU wise. Ouch, I reviewed my notes and correct myself: For per thread mode, perf tool only allocates one generic ring buffer and one AUX ring buffer for the whole session; for the system wide mode, perf allocates the generic ring buffer and the AUX ring buffer per CPU wise. > Seems to me, 2GiB AUX buffer per CPU is big enough for most tracing > scenarios, right? Except you can provide profiling scenario which > must use bigger buffer size. But I think this question is still valid. > Another factor is the allocation of buffers from kmalloc() offers better > performance compared to allocation from vmalloc(), this is also > important for perf core layer. > > Thanks, > Leo ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] perf/core: Bail out early if the request AUX area is out of bound 2023-06-12 10:25 ` Leo Yan @ 2023-06-13 3:00 ` Shuai Xue 0 siblings, 0 replies; 14+ messages in thread From: Shuai Xue @ 2023-06-13 3:00 UTC (permalink / raw) To: Leo Yan Cc: alexander.shishkin, peterz, mingo, acme, mark.rutland, jolsa, namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel, bpf, Baolin Wang On 2023/6/12 18:25, Leo Yan wrote: > On Mon, Jun 12, 2023 at 06:05:02PM +0800, Leo Yan wrote: >> On Mon, Jun 12, 2023 at 04:35:07PM +0800, Shuai Xue wrote: >> >> [...] >> >>>> Furthermore, I believe the AUX trace pages are only mapped for VMA >>>> (continuous virtual address), the kernel will defer to map to physical >>>> pages (which means it's not necessarily continuous physical pages) >>>> when handling data abort caused by accessing the pages. >>> >>> I don't know why the rb->aux_pages is limit to allocated with continuous physical pages. >>> so I just add a check to avoid oops and report a proper error code -EINVAL to >>> user. >>> >>> I would like to use vmalloc() family to replace kmalloc() so that we could support >>> allocate a more large AUX area if it is not necessarily continuous physical pages. >>> Should we remove the restriction? >> >> As you said, we are now able to support a maximum AUX trace buffer >> size of up to 2GiB, and AUX trace buffer is per CPU wise. > > Ouch, I reviewed my notes and correct myself: > > For per thread mode, perf tool only allocates one generic ring buffer > and one AUX ring buffer for the whole session; for the system wide mode, > perf allocates the generic ring buffer and the AUX ring buffer per CPU > wise. > >> Seems to me, 2GiB AUX buffer per CPU is big enough for most tracing >> scenarios, right? Except you can provide profiling scenario which >> must use bigger buffer size. > > But I think this question is still valid. > >> Another factor is the allocation of buffers from kmalloc() offers better >> performance compared to allocation from vmalloc(), this is also >> important for perf core layer. I see, I don't have such profiling scenario, let's leave the restriction untouched and I will move the sanity check into rb_alloc_aux(). >> >> Thanks, >> Leo Thank you. Best Regards, Shuai ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] perf/core: Bail out early if the request AUX area is out of bound 2023-06-12 7:38 ` Leo Yan 2023-06-12 8:35 ` Shuai Xue @ 2023-06-12 8:47 ` James Clark 1 sibling, 0 replies; 14+ messages in thread From: James Clark @ 2023-06-12 8:47 UTC (permalink / raw) To: Leo Yan, Shuai Xue Cc: alexander.shishkin, peterz, mingo, acme, mark.rutland, jolsa, namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel, bpf On 12/06/2023 08:38, Leo Yan wrote: > Hi Shuai, > > On Mon, Jun 12, 2023 at 01:24:51PM +0800, Shuai Xue wrote: >> When perf-record with a large AUX area, e.g 4GB, it fails with: >> >> #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1 >> failed to mmap with 12 (Cannot allocate memory) >> >> and it reveals a WARNING with __alloc_pages: >> >> [ 66.595604] ------------[ cut here ]------------ >> [ 66.600206] WARNING: CPU: 44 PID: 17573 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248 >> [ 66.608375] Modules linked in: ip6table_filter(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) aes_ce_blk(E) vfat(E) fat(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_cipher(E) sm4(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) ip_tables(E) sd_mod(E) ast(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) nvme(E) sysimgblt(E) i2c_algo_bit(E) nvme_core(E) drm_shmem_helper(E) ahci(E) t10_pi(E) libahci(E) drm(E) crc64_rocksoft(E) i40e(E) crc64(E) libata(E) i2c_core(E) >> [ 66.657719] CPU: 44 PID: 17573 Comm: perf Kdump: loaded Tainted: G E 6.3.0-rc4+ #58 >> [ 66.666749] Hardware name: Default Default/Default, BIOS 1.2.M1.AL.P.139.00 03/22/2023 >> [ 66.674650] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) >> [ 66.681597] pc : __alloc_pages+0x1ec/0x248 >> [ 66.685680] lr : __kmalloc_large_node+0xc0/0x1f8 >> [ 66.690285] sp : ffff800020523980 >> [ 66.693585] pmr_save: 000000e0 >> [ 66.696624] x29: ffff800020523980 x28: ffff000832975800 x27: 0000000000000000 >> [ 66.703746] x26: 0000000000100000 x25: 0000000000100000 x24: ffff8000083615d0 >> [ 66.710866] x23: 0000000000040dc0 x22: ffff000823d6d140 x21: 000000000000000b >> [ 66.717987] x20: 000000000000000b x19: 0000000000000000 x18: 0000000000000030 >> [ 66.725108] x17: 0000000000000000 x16: ffff800008f05be8 x15: ffff000823d6d6d0 >> [ 66.732229] x14: 0000000000000000 x13: 343373656761705f x12: 726e202c30206574 >> [ 66.739350] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffff8000083af570 >> [ 66.746471] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 000000000005fff4 >> [ 66.753592] x5 : 0000000000000000 x4 : ffff000823d6d8d8 x3 : 0000000000000000 >> [ 66.760713] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000040dc0 >> [ 66.767834] Call trace: >> [ 66.770267] __alloc_pages+0x1ec/0x248 >> [ 66.774003] __kmalloc_large_node+0xc0/0x1f8 >> [ 66.778259] __kmalloc_node+0x134/0x1e8 >> [ 66.782081] rb_alloc_aux+0xe0/0x298 >> [ 66.785643] perf_mmap+0x440/0x660 >> [ 66.789031] mmap_region+0x308/0x8a8 >> [ 66.792593] do_mmap+0x3c0/0x528 >> [ 66.795807] vm_mmap_pgoff+0xf4/0x1b8 >> [ 66.799456] ksys_mmap_pgoff+0x18c/0x218 >> [ 66.803365] __arm64_sys_mmap+0x38/0x58 >> [ 66.807187] invoke_syscall+0x50/0x128 >> [ 66.810922] el0_svc_common.constprop.0+0x58/0x188 >> [ 66.815698] do_el0_svc+0x34/0x50 >> [ 66.818999] el0_svc+0x34/0x108 >> [ 66.822127] el0t_64_sync_handler+0xb8/0xc0 >> [ 66.826296] el0t_64_sync+0x1a4/0x1a8 >> [ 66.829946] ---[ end trace 0000000000000000 ]--- >> >> The pages for AUX area are organized as rb->aux_pages[] which alloced by >> kcalloc_node() later. The kcalloc() family guarantees the pages are >> physically contiguous (and virtually contiguous) with an order of >> MAX_ORDER - 1 at maximum. > > This description is incorrect. We need to distinguish two things: > > AUX trace pages and 'rb->aux_pages' is pointer array which is used to > maintains these page. Here, the kernel oops reports the error is not > for AUX trace pages but for failing to allocate the pointer array from > slab (or slub) area. > > Furthermore, I believe the AUX trace pages are only mapped for VMA > (continuous virtual address), the kernel will defer to map to physical > pages (which means it's not necessarily continuous physical pages) > when handling data abort caused by accessing the pages. > > When you specify the AUX buffer size to 4GiB, the kernel will convert it > to page numbers (page size is 4KiB, page number is = 4GiB / 4KiB = 1MiB). > Since aarch64's pointer type's length is 8 bytes, thus we need to > allocate the 8MiB buffer from slab/slub, unfortunately, 8MiB crosses the > limitation set by MAX_ORDER (4KiB ^ (MAX_ORDER - 1) = 4MiB), this is > why we receive the oops from __alloc_pages(). > >> So bail out early with -EINVAL if the request AUX area is out of bound, >> e.g.: >> >> #perf record -C 0 -m ,4G -e arm_spe_0// -- sleep 1 >> failed to mmap with 22 (Invalid argument) >> >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >> --- >> kernel/events/core.c | 10 ++++++++++ >> tools/perf/Documentation/perf-record.txt | 3 ++- >> 2 files changed, 12 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/events/core.c b/kernel/events/core.c >> index 435815d..83d4e29 100644 >> --- a/kernel/events/core.c >> +++ b/kernel/events/core.c >> @@ -6404,6 +6404,16 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma) >> return -EINVAL; >> >> nr_pages = vma_size / PAGE_SIZE; >> + /* >> + * The pages for AUX area are organized as rb->aux_pages[] >> + * which alloced by kcalloc_node() later. The kcalloc() family >> + * guarantees the pages are physically contiguous (and >> + * virtually contiguous) with an order of MAX_ORDER - 1 at >> + * maximum MAX_ORDER. So bail out early if the request AUX area >> + * is out of bound. >> + */ >> + if (get_order(nr_pages * sizeof(void *)) >= MAX_ORDER) >> + return -EINVAL; > > From my view, now kernel has handled this case (I agree it might be > not directive for outputting error or warning info rather than oops). > > If we really want this checking, I'd like to add it in rb_alloc_aux(), > since rb_alloc_aux() is the place for allocation the memory, thus it's > right place for the checking memory limitation. And you might need to > consider the update the comments to avoid confusion. > > I am not the best person for the decision, I'd like to leave it to perf > maintainers and wait for their thoughts. > > Thanks, > Leo If we were to keep it in the end, it should technically have a multiplication overflow check on it as well. James > >> mutex_lock(&event->mmap_mutex); >> ret = -EINVAL; >> diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt >> index ff815c2..a50a426 100644 >> --- a/tools/perf/Documentation/perf-record.txt >> +++ b/tools/perf/Documentation/perf-record.txt >> @@ -236,7 +236,8 @@ OPTIONS >> specification with appended unit character - B/K/M/G. The >> size is rounded up to have nearest pages power of two value. >> Also, by adding a comma, the number of mmap pages for AUX >> - area tracing can be specified. >> + area tracing can be specified. With MAX_ORDER set as 11, the >> + maximum AUX area is limit to 2GB. >> >> -g:: >> Enables call-graph (stack chain/backtrace) recording for both >> -- >> 1.8.3.1 >> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/2] perf/ring_buffer: Fix high-order allocations for AUX space with correct MAX_ORDER limit 2023-06-12 5:24 [PATCH 0/2] Fix high-order allocations for AUX space Shuai Xue 2023-06-12 5:24 ` [PATCH 1/2] perf/core: Bail out early if the request AUX area is out of bound Shuai Xue @ 2023-06-12 5:24 ` Shuai Xue 2023-06-12 8:30 ` Leo Yan 2023-06-12 8:45 ` James Clark 1 sibling, 2 replies; 14+ messages in thread From: Shuai Xue @ 2023-06-12 5:24 UTC (permalink / raw) To: alexander.shishkin, peterz Cc: mingo, acme, mark.rutland, jolsa, namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel, bpf When perf-record with a large AUX area, e.g 2GB: #perf record -C 0 -m ,2G -e arm_spe_0// -- sleep 1 it reveals a WARNING with __alloc_pages: [ 48.888553] ------------[ cut here ]------------ [ 48.888559] WARNING: CPU: 39 PID: 17438 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248 [ 48.888569] Modules linked in: ip6table_filter(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) vfat(E) fat(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_cipher(E) sm4(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) ip_tables(E) sd_mod(E) ast(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) nvme(E) sysimgblt(E) i2c_algo_bit(E) nvme_core(E) drm_shmem_helper(E) ahci(E) t10_pi(E) libahci(E) drm(E) crc64_rocksoft(E) i40e(E) crc64(E) libata(E) i2c_core(E) [ 48.888610] CPU: 39 PID: 17438 Comm: perf Kdump: loaded Tainted: G E 6.3.0-rc4+ #56 [ 48.888613] Hardware name: Default Default/Default, BIOS 1.2.M1.AL.P.139.00 03/22/2023 [ 48.888614] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 48.888616] pc : __alloc_pages+0x1ec/0x248 [ 48.888619] lr : rb_alloc_aux_page+0x78/0xe0 [ 48.888622] sp : ffff800037c5b9e0 [ 48.888623] pmr_save: 000000e0 [ 48.888624] x29: ffff800037c5b9e0 x28: ffff00082743b800 x27: 0000000000000000 [ 48.888627] x26: 0000000000080000 x25: ffff000000000000 x24: ffff000860b21380 [ 48.888629] x23: ffff8000093fd3c0 x22: ffff00081a4a2080 x21: 000000000000000b [ 48.888631] x20: 0000000000000000 x19: 000000000000000b x18: 0000000000000020 [ 48.888634] x17: 0000000000000000 x16: ffff800008f05be8 x15: ffff00081a4a2610 [ 48.888636] x14: 0000000000000000 x13: 323173656761705f x12: ffff00477fffeb90 [ 48.888638] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff8000083608a0 [ 48.888641] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000040 [ 48.888643] x5 : 000000000007f400 x4 : 0000000000000000 x3 : 0000000000000000 [ 48.888645] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000012dc0 [ 48.888648] Call trace: [ 48.888650] __alloc_pages+0x1ec/0x248 [ 48.888653] rb_alloc_aux_page+0x78/0xe0 [ 48.888654] rb_alloc_aux+0x13c/0x298 [ 48.888656] perf_mmap+0x468/0x688 [ 48.888659] mmap_region+0x308/0x8a8 [ 48.888662] do_mmap+0x3c0/0x528 [ 48.888664] vm_mmap_pgoff+0xf4/0x1b8 [ 48.888666] ksys_mmap_pgoff+0x18c/0x218 [ 48.888668] __arm64_sys_mmap+0x38/0x58 [ 48.888671] invoke_syscall+0x50/0x128 [ 48.888673] el0_svc_common.constprop.0+0x58/0x188 [ 48.888674] do_el0_svc+0x34/0x50 [ 48.888676] el0_svc+0x34/0x108 [ 48.888679] el0t_64_sync_handler+0xb8/0xc0 [ 48.888681] el0t_64_sync+0x1a4/0x1a8 [ 48.888686] ---[ end trace 0000000000000000 ]--- The problem is that the high-order pages for AUX area is allocated with an order of MAX_ORDER. Obviously, this is a bogus. Fix it with an order of MAX_ORDER - 1 at maximum. Fixes: 0a4e38e64f5e ("perf: Support high-order allocations for AUX space") Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> --- kernel/events/ring_buffer.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 273a0fe..d6bbdb7 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -609,8 +609,8 @@ static struct page *rb_alloc_aux_page(int node, int order) { struct page *page; - if (order > MAX_ORDER) - order = MAX_ORDER; + if (order >= MAX_ORDER) + order = MAX_ORDER - 1; do { page = alloc_pages_node(node, PERF_AUX_GFP, order); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf/ring_buffer: Fix high-order allocations for AUX space with correct MAX_ORDER limit 2023-06-12 5:24 ` [PATCH 2/2] perf/ring_buffer: Fix high-order allocations for AUX space with correct MAX_ORDER limit Shuai Xue @ 2023-06-12 8:30 ` Leo Yan 2023-06-12 8:35 ` Shuai Xue 2023-06-12 8:45 ` James Clark 1 sibling, 1 reply; 14+ messages in thread From: Leo Yan @ 2023-06-12 8:30 UTC (permalink / raw) To: Shuai Xue Cc: alexander.shishkin, peterz, mingo, acme, mark.rutland, jolsa, namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel, bpf On Mon, Jun 12, 2023 at 01:24:52PM +0800, Shuai Xue wrote: > When perf-record with a large AUX area, e.g 2GB: > > #perf record -C 0 -m ,2G -e arm_spe_0// -- sleep 1 > > it reveals a WARNING with __alloc_pages: > > [ 48.888553] ------------[ cut here ]------------ > [ 48.888559] WARNING: CPU: 39 PID: 17438 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248 > [ 48.888569] Modules linked in: ip6table_filter(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) vfat(E) fat(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_cipher(E) sm4(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) ip_tables(E) sd_mod(E) ast(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) nvme(E) sysimgblt(E) i2c_algo_bit(E) nvme_core(E) drm_shmem_helper(E) ahci(E) t10_pi(E) libahci(E) drm(E) crc64_rocksoft(E) i40e(E) crc64(E) libata(E) i2c_core(E) > [ 48.888610] CPU: 39 PID: 17438 Comm: perf Kdump: loaded Tainted: G E 6.3.0-rc4+ #56 > [ 48.888613] Hardware name: Default Default/Default, BIOS 1.2.M1.AL.P.139.00 03/22/2023 > [ 48.888614] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) > [ 48.888616] pc : __alloc_pages+0x1ec/0x248 > [ 48.888619] lr : rb_alloc_aux_page+0x78/0xe0 > [ 48.888622] sp : ffff800037c5b9e0 > [ 48.888623] pmr_save: 000000e0 > [ 48.888624] x29: ffff800037c5b9e0 x28: ffff00082743b800 x27: 0000000000000000 > [ 48.888627] x26: 0000000000080000 x25: ffff000000000000 x24: ffff000860b21380 > [ 48.888629] x23: ffff8000093fd3c0 x22: ffff00081a4a2080 x21: 000000000000000b > [ 48.888631] x20: 0000000000000000 x19: 000000000000000b x18: 0000000000000020 > [ 48.888634] x17: 0000000000000000 x16: ffff800008f05be8 x15: ffff00081a4a2610 > [ 48.888636] x14: 0000000000000000 x13: 323173656761705f x12: ffff00477fffeb90 > [ 48.888638] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff8000083608a0 > [ 48.888641] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000040 > [ 48.888643] x5 : 000000000007f400 x4 : 0000000000000000 x3 : 0000000000000000 > [ 48.888645] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000012dc0 > [ 48.888648] Call trace: > [ 48.888650] __alloc_pages+0x1ec/0x248 > [ 48.888653] rb_alloc_aux_page+0x78/0xe0 > [ 48.888654] rb_alloc_aux+0x13c/0x298 > [ 48.888656] perf_mmap+0x468/0x688 > [ 48.888659] mmap_region+0x308/0x8a8 > [ 48.888662] do_mmap+0x3c0/0x528 > [ 48.888664] vm_mmap_pgoff+0xf4/0x1b8 > [ 48.888666] ksys_mmap_pgoff+0x18c/0x218 > [ 48.888668] __arm64_sys_mmap+0x38/0x58 > [ 48.888671] invoke_syscall+0x50/0x128 > [ 48.888673] el0_svc_common.constprop.0+0x58/0x188 > [ 48.888674] do_el0_svc+0x34/0x50 > [ 48.888676] el0_svc+0x34/0x108 > [ 48.888679] el0t_64_sync_handler+0xb8/0xc0 > [ 48.888681] el0t_64_sync+0x1a4/0x1a8 > [ 48.888686] ---[ end trace 0000000000000000 ]--- > > The problem is that the high-order pages for AUX area is allocated with > an order of MAX_ORDER. Obviously, this is a bogus. > > Fix it with an order of MAX_ORDER - 1 at maximum. > > Fixes: 0a4e38e64f5e ("perf: Support high-order allocations for AUX space") > Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> LGTM: Reviewed-by: Leo Yan <leo.yan@linaro.org> > --- > kernel/events/ring_buffer.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c > index 273a0fe..d6bbdb7 100644 > --- a/kernel/events/ring_buffer.c > +++ b/kernel/events/ring_buffer.c > @@ -609,8 +609,8 @@ static struct page *rb_alloc_aux_page(int node, int order) > { > struct page *page; > > - if (order > MAX_ORDER) > - order = MAX_ORDER; > + if (order >= MAX_ORDER) > + order = MAX_ORDER - 1; > > do { > page = alloc_pages_node(node, PERF_AUX_GFP, order); > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf/ring_buffer: Fix high-order allocations for AUX space with correct MAX_ORDER limit 2023-06-12 8:30 ` Leo Yan @ 2023-06-12 8:35 ` Shuai Xue 0 siblings, 0 replies; 14+ messages in thread From: Shuai Xue @ 2023-06-12 8:35 UTC (permalink / raw) To: Leo Yan Cc: alexander.shishkin, peterz, mingo, acme, mark.rutland, jolsa, namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel, bpf, Baolin Wang On 2023/6/12 16:30, Leo Yan wrote: > On Mon, Jun 12, 2023 at 01:24:52PM +0800, Shuai Xue wrote: >> When perf-record with a large AUX area, e.g 2GB: >> >> #perf record -C 0 -m ,2G -e arm_spe_0// -- sleep 1 >> >> it reveals a WARNING with __alloc_pages: >> >> [ 48.888553] ------------[ cut here ]------------ >> [ 48.888559] WARNING: CPU: 39 PID: 17438 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248 >> [ 48.888569] Modules linked in: ip6table_filter(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) vfat(E) fat(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_cipher(E) sm4(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) ip_tables(E) sd_mod(E) ast(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) nvme(E) sysimgblt(E) i2c_algo_bit(E) nvme_core(E) drm_shmem_helper(E) ahci(E) t10_pi(E) libahci(E) drm(E) crc64_rocksoft(E) i40e(E) crc64(E) libata(E) i2c_core(E) >> [ 48.888610] CPU: 39 PID: 17438 Comm: perf Kdump: loaded Tainted: G E 6.3.0-rc4+ #56 >> [ 48.888613] Hardware name: Default Default/Default, BIOS 1.2.M1.AL.P.139.00 03/22/2023 >> [ 48.888614] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) >> [ 48.888616] pc : __alloc_pages+0x1ec/0x248 >> [ 48.888619] lr : rb_alloc_aux_page+0x78/0xe0 >> [ 48.888622] sp : ffff800037c5b9e0 >> [ 48.888623] pmr_save: 000000e0 >> [ 48.888624] x29: ffff800037c5b9e0 x28: ffff00082743b800 x27: 0000000000000000 >> [ 48.888627] x26: 0000000000080000 x25: ffff000000000000 x24: ffff000860b21380 >> [ 48.888629] x23: ffff8000093fd3c0 x22: ffff00081a4a2080 x21: 000000000000000b >> [ 48.888631] x20: 0000000000000000 x19: 000000000000000b x18: 0000000000000020 >> [ 48.888634] x17: 0000000000000000 x16: ffff800008f05be8 x15: ffff00081a4a2610 >> [ 48.888636] x14: 0000000000000000 x13: 323173656761705f x12: ffff00477fffeb90 >> [ 48.888638] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff8000083608a0 >> [ 48.888641] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000040 >> [ 48.888643] x5 : 000000000007f400 x4 : 0000000000000000 x3 : 0000000000000000 >> [ 48.888645] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000012dc0 >> [ 48.888648] Call trace: >> [ 48.888650] __alloc_pages+0x1ec/0x248 >> [ 48.888653] rb_alloc_aux_page+0x78/0xe0 >> [ 48.888654] rb_alloc_aux+0x13c/0x298 >> [ 48.888656] perf_mmap+0x468/0x688 >> [ 48.888659] mmap_region+0x308/0x8a8 >> [ 48.888662] do_mmap+0x3c0/0x528 >> [ 48.888664] vm_mmap_pgoff+0xf4/0x1b8 >> [ 48.888666] ksys_mmap_pgoff+0x18c/0x218 >> [ 48.888668] __arm64_sys_mmap+0x38/0x58 >> [ 48.888671] invoke_syscall+0x50/0x128 >> [ 48.888673] el0_svc_common.constprop.0+0x58/0x188 >> [ 48.888674] do_el0_svc+0x34/0x50 >> [ 48.888676] el0_svc+0x34/0x108 >> [ 48.888679] el0t_64_sync_handler+0xb8/0xc0 >> [ 48.888681] el0t_64_sync+0x1a4/0x1a8 >> [ 48.888686] ---[ end trace 0000000000000000 ]--- >> >> The problem is that the high-order pages for AUX area is allocated with >> an order of MAX_ORDER. Obviously, this is a bogus. >> >> Fix it with an order of MAX_ORDER - 1 at maximum. >> >> Fixes: 0a4e38e64f5e ("perf: Support high-order allocations for AUX space") >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> > > LGTM: > > Reviewed-by: Leo Yan <leo.yan@linaro.org> Thank you :) Cheers Shuai > >> --- >> kernel/events/ring_buffer.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c >> index 273a0fe..d6bbdb7 100644 >> --- a/kernel/events/ring_buffer.c >> +++ b/kernel/events/ring_buffer.c >> @@ -609,8 +609,8 @@ static struct page *rb_alloc_aux_page(int node, int order) >> { >> struct page *page; >> >> - if (order > MAX_ORDER) >> - order = MAX_ORDER; >> + if (order >= MAX_ORDER) >> + order = MAX_ORDER - 1; >> >> do { >> page = alloc_pages_node(node, PERF_AUX_GFP, order); >> -- >> 1.8.3.1 >> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf/ring_buffer: Fix high-order allocations for AUX space with correct MAX_ORDER limit 2023-06-12 5:24 ` [PATCH 2/2] perf/ring_buffer: Fix high-order allocations for AUX space with correct MAX_ORDER limit Shuai Xue 2023-06-12 8:30 ` Leo Yan @ 2023-06-12 8:45 ` James Clark 2023-06-12 9:09 ` Leo Yan 1 sibling, 1 reply; 14+ messages in thread From: James Clark @ 2023-06-12 8:45 UTC (permalink / raw) To: Shuai Xue, alexander.shishkin, peterz, kirill Cc: mingo, acme, mark.rutland, jolsa, namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel, bpf On 12/06/2023 06:24, Shuai Xue wrote: > When perf-record with a large AUX area, e.g 2GB: > > #perf record -C 0 -m ,2G -e arm_spe_0// -- sleep 1 > > it reveals a WARNING with __alloc_pages: > > [ 48.888553] ------------[ cut here ]------------ > [ 48.888559] WARNING: CPU: 39 PID: 17438 at mm/page_alloc.c:5568 __alloc_pages+0x1ec/0x248 > [ 48.888569] Modules linked in: ip6table_filter(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) vfat(E) fat(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sm4_ce_cipher(E) sm4(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) acpi_ipmi(E) sbsa_gwdt(E) sg(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) ip_tables(E) sd_mod(E) ast(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) nvme(E) sysimgblt(E) i2c_algo_bit(E) nvme_core(E) drm_shmem_helper(E) ahci(E) t10_pi(E) libahci(E) drm(E) crc64_rocksoft(E) i40e(E) crc64(E) libata(E) i2c_core(E) > [ 48.888610] CPU: 39 PID: 17438 Comm: perf Kdump: loaded Tainted: G E 6.3.0-rc4+ #56 > [ 48.888613] Hardware name: Default Default/Default, BIOS 1.2.M1.AL.P.139.00 03/22/2023 > [ 48.888614] pstate: 23400009 (nzCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) > [ 48.888616] pc : __alloc_pages+0x1ec/0x248 > [ 48.888619] lr : rb_alloc_aux_page+0x78/0xe0 > [ 48.888622] sp : ffff800037c5b9e0 > [ 48.888623] pmr_save: 000000e0 > [ 48.888624] x29: ffff800037c5b9e0 x28: ffff00082743b800 x27: 0000000000000000 > [ 48.888627] x26: 0000000000080000 x25: ffff000000000000 x24: ffff000860b21380 > [ 48.888629] x23: ffff8000093fd3c0 x22: ffff00081a4a2080 x21: 000000000000000b > [ 48.888631] x20: 0000000000000000 x19: 000000000000000b x18: 0000000000000020 > [ 48.888634] x17: 0000000000000000 x16: ffff800008f05be8 x15: ffff00081a4a2610 > [ 48.888636] x14: 0000000000000000 x13: 323173656761705f x12: ffff00477fffeb90 > [ 48.888638] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff8000083608a0 > [ 48.888641] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000040 > [ 48.888643] x5 : 000000000007f400 x4 : 0000000000000000 x3 : 0000000000000000 > [ 48.888645] x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000012dc0 > [ 48.888648] Call trace: > [ 48.888650] __alloc_pages+0x1ec/0x248 > [ 48.888653] rb_alloc_aux_page+0x78/0xe0 > [ 48.888654] rb_alloc_aux+0x13c/0x298 > [ 48.888656] perf_mmap+0x468/0x688 > [ 48.888659] mmap_region+0x308/0x8a8 > [ 48.888662] do_mmap+0x3c0/0x528 > [ 48.888664] vm_mmap_pgoff+0xf4/0x1b8 > [ 48.888666] ksys_mmap_pgoff+0x18c/0x218 > [ 48.888668] __arm64_sys_mmap+0x38/0x58 > [ 48.888671] invoke_syscall+0x50/0x128 > [ 48.888673] el0_svc_common.constprop.0+0x58/0x188 > [ 48.888674] do_el0_svc+0x34/0x50 > [ 48.888676] el0_svc+0x34/0x108 > [ 48.888679] el0t_64_sync_handler+0xb8/0xc0 > [ 48.888681] el0t_64_sync+0x1a4/0x1a8 > [ 48.888686] ---[ end trace 0000000000000000 ]--- > > The problem is that the high-order pages for AUX area is allocated with > an order of MAX_ORDER. Obviously, this is a bogus. > > Fix it with an order of MAX_ORDER - 1 at maximum. > > Fixes: 0a4e38e64f5e ("perf: Support high-order allocations for AUX space") > Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> > --- > kernel/events/ring_buffer.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c > index 273a0fe..d6bbdb7 100644 > --- a/kernel/events/ring_buffer.c > +++ b/kernel/events/ring_buffer.c > @@ -609,8 +609,8 @@ static struct page *rb_alloc_aux_page(int node, int order) > { > struct page *page; > > - if (order > MAX_ORDER) > - order = MAX_ORDER; > + if (order >= MAX_ORDER) > + order = MAX_ORDER - 1; > > do { > page = alloc_pages_node(node, PERF_AUX_GFP, order); It seems like this was only just recently changed with this as the commit message (23baf83): mm, treewide: redefine MAX_ORDER sanely MAX_ORDER currently defined as number of orders page allocator supports: user can ask buddy allocator for page order between 0 and MAX_ORDER-1. This definition is counter-intuitive and lead to number of bugs all over the kernel. Change the definition of MAX_ORDER to be inclusive: the range of orders user can ask from buddy allocator is 0..MAX_ORDER now. It might be worth referring to this in the commit message or adding a fixes: reference. Or maybe this new change isn't quite right? James ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf/ring_buffer: Fix high-order allocations for AUX space with correct MAX_ORDER limit 2023-06-12 8:45 ` James Clark @ 2023-06-12 9:09 ` Leo Yan 2023-06-13 6:42 ` Shuai Xue 0 siblings, 1 reply; 14+ messages in thread From: Leo Yan @ 2023-06-12 9:09 UTC (permalink / raw) To: James Clark Cc: Shuai Xue, alexander.shishkin, peterz, kirill, mingo, acme, mark.rutland, jolsa, namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel, bpf On Mon, Jun 12, 2023 at 09:45:38AM +0100, James Clark wrote: [...] > > @@ -609,8 +609,8 @@ static struct page *rb_alloc_aux_page(int node, int order) > > { > > struct page *page; > > > > - if (order > MAX_ORDER) > > - order = MAX_ORDER; > > + if (order >= MAX_ORDER) > > + order = MAX_ORDER - 1; > > > > do { > > page = alloc_pages_node(node, PERF_AUX_GFP, order); > > > It seems like this was only just recently changed with this as the > commit message (23baf83): > > mm, treewide: redefine MAX_ORDER sanely > > MAX_ORDER currently defined as number of orders page allocator > supports: user can ask buddy allocator for page order between 0 and > MAX_ORDER-1. > > This definition is counter-intuitive and lead to number of bugs all > over the kernel. > > Change the definition of MAX_ORDER to be inclusive: the range of > orders user can ask from buddy allocator is 0..MAX_ORDER now. > > It might be worth referring to this in the commit message or adding a > fixes: reference. Or maybe this new change isn't quite right? Good point. If so, we don't need this patch anymore. Thanks for reminding, James. Leo ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] perf/ring_buffer: Fix high-order allocations for AUX space with correct MAX_ORDER limit 2023-06-12 9:09 ` Leo Yan @ 2023-06-13 6:42 ` Shuai Xue 0 siblings, 0 replies; 14+ messages in thread From: Shuai Xue @ 2023-06-13 6:42 UTC (permalink / raw) To: Leo Yan, James Clark Cc: alexander.shishkin, peterz, kirill, mingo, acme, mark.rutland, jolsa, namhyung, irogers, adrian.hunter, linux-perf-users, linux-kernel, bpf On 2023/6/12 17:09, Leo Yan wrote: > On Mon, Jun 12, 2023 at 09:45:38AM +0100, James Clark wrote: > > [...] > >>> @@ -609,8 +609,8 @@ static struct page *rb_alloc_aux_page(int node, int order) >>> { >>> struct page *page; >>> >>> - if (order > MAX_ORDER) >>> - order = MAX_ORDER; >>> + if (order >= MAX_ORDER) >>> + order = MAX_ORDER - 1; >>> >>> do { >>> page = alloc_pages_node(node, PERF_AUX_GFP, order); >> >> >> It seems like this was only just recently changed with this as the >> commit message (23baf83): >> >> mm, treewide: redefine MAX_ORDER sanely >> >> MAX_ORDER currently defined as number of orders page allocator >> supports: user can ask buddy allocator for page order between 0 and >> MAX_ORDER-1. >> >> This definition is counter-intuitive and lead to number of bugs all >> over the kernel. >> >> Change the definition of MAX_ORDER to be inclusive: the range of >> orders user can ask from buddy allocator is 0..MAX_ORDER now. >> >> It might be worth referring to this in the commit message or adding a >> fixes: reference. Or maybe this new change isn't quite right? > > Good point. If so, we don't need this patch anymore. > > Thanks for reminding, James. > > Leo Hi, Leo and James, I tested on the Linus master tree, the mentioned commit 23baf83 ("mm, treewide: redefine MAX_ORDER sanely") has fix this oops. I will drop out this patch, thank you :) Cheers, Shuai ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2023-06-13 6:42 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-06-12 5:24 [PATCH 0/2] Fix high-order allocations for AUX space Shuai Xue 2023-06-12 5:24 ` [PATCH 1/2] perf/core: Bail out early if the request AUX area is out of bound Shuai Xue 2023-06-12 7:38 ` Leo Yan 2023-06-12 8:35 ` Shuai Xue 2023-06-12 10:05 ` Leo Yan 2023-06-12 10:25 ` Leo Yan 2023-06-13 3:00 ` Shuai Xue 2023-06-12 8:47 ` James Clark 2023-06-12 5:24 ` [PATCH 2/2] perf/ring_buffer: Fix high-order allocations for AUX space with correct MAX_ORDER limit Shuai Xue 2023-06-12 8:30 ` Leo Yan 2023-06-12 8:35 ` Shuai Xue 2023-06-12 8:45 ` James Clark 2023-06-12 9:09 ` Leo Yan 2023-06-13 6:42 ` Shuai Xue
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).