* [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
@ 2025-05-07 17:55 David Wang
2025-05-07 18:19 ` David Wang
` (5 more replies)
0 siblings, 6 replies; 44+ messages in thread
From: David Wang @ 2025-05-07 17:55 UTC (permalink / raw)
To: surenb, kent.overstreet, akpm; +Cc: linux-mm, linux-kernel, David Wang
---
The patch is not complete, just want to get feedbacks whether this
worth carrying on.
---
When reading /proc/allocinfo, for each read syscall, seq_file would
invoke start/stop callbacks. In start callback, a memory is alloced
to store iterator and the iterator would restart from beginning to
walk to its previous position.
Each seq_file read() takes at most 4096 bytes, even read with a larger
user space buffer, meaning read out /proc/allocinfo, tens of read
syscalls are needed. For example, a 306036 bytes allocinfo files need
76 reads:
$ sudo cat /proc/allocinfo | wc
3964 16678 306036
For those n=3964 lines, each read takes about m=3964/76=52 lines,
the iter would be rewinding:
m steps on 1st read,
2*m steps on 2nd read
3*m steps on 3rd read
...
n steps on the last read
totally, the iterator would be iterated O(n*n/m) times.
(Each read would take more time than previous one.)
To use a private data alloced when /proc/allocinfo is opened,
the n/m memory alloction could be avoid, and there is no need
to restart the iterator from very beginning everytime.
So only 1 memory allocation and n steps for iterating are needed.
(Only when module changed, the iterator should be invalidated and
restart.)
Timings before:
$ time cat /proc/allocinfo > /dev/null
real 0m0.007s
user 0m0.000s
sys 0m0.007s
read-syscalls get slower and slower:
read(3, "allocinfo - version: 1.0\n# <"..., 131072) = 4085 <0.000062>
...
read(3, " 0 0 drivers/gp"..., 131072) = 4046 <0.000135>
read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
...
and with the change:
$ time cat /proc/allocinfo > /dev/null
real 0m0.004s
user 0m0.000s
sys 0m0.003s
7ms drop to 4ms, more friendly to monitoring tools.
Detailed strace stats before the change:
$ sudo strace -T -e read cat /proc/allocinfo > /dev/null
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@>\2\0\0\0\0\0"..., 832) = 832 <0.000040>
read(3, "allocinfo - version: 1.0\n# <"..., 131072) = 4085 <0.000062>
read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062>
read(3, " 0 0 arch/x86/k"..., 131072) = 4029 <0.000061>
read(3, " 64 1 kernel/wor"..., 131072) = 4064 <0.000061>
read(3, " 0 0 kernel/sch"..., 131072) = 4041 <0.000051>
read(3, " 4992 39 kernel/irq"..., 131072) = 4061 <0.000070>
read(3, " 0 0 kernel/fut"..., 131072) = 4060 <0.000071>
read(3, " 0 0 kernel/aud"..., 131072) = 4025 <0.000122>
read(3, " 1536 3 kernel/tra"..., 131072) = 4014 <0.000064>
read(3, " 0 0 kernel/tra"..., 131072) = 4079 <0.000067>
read(3, " 0 0 kernel/tra"..., 131072) = 4035 <0.000072>
read(3, " 0 0 kernel/bpf"..., 131072) = 4044 <0.000074>
read(3, " 0 0 kernel/bpf"..., 131072) = 4061 <0.000063>
read(3, " 0 0 ./include/"..., 131072) = 4084 <0.000063>
read(3, " 0 0 kernel/eve"..., 131072) = 4084 <0.000071>
read(3, " 278528 68 mm/memory."..., 131072) = 4068 <0.000070>
read(3, " 0 0 mm/zswap.c"..., 131072) = 4091 <0.000082>
read(3, " 0 0 mm/balloon"..., 131072) = 4039 <0.000083>
read(3, " 0 0 fs/splice."..., 131072) = 4095 <0.000083>
read(3, " 0 0 fs/binfmt_"..., 131072) = 4039 <0.000075>
read(3, " 192 3 fs/proc/kc"..., 131072) = 4088 <0.000076>
read(3, " 0 0 ipc/mqueue"..., 131072) = 4017 <0.000076>
read(3, " 0 0 security/s"..., 131072) = 4075 <0.000072>
read(3, " 0 0 security/s"..., 131072) = 4029 <0.000084>
read(3, " 0 0 security/t"..., 131072) = 4053 <0.000076>
read(3, " 0 0 security/a"..., 131072) = 4054 <0.000078>
read(3, " 0 0 crypto/rsa"..., 131072) = 4046 <0.000081>
read(3, " 0 0 crypto/asy"..., 131072) = 4079 <0.000081>
read(3, " 1728 9 block/blk-"..., 131072) = 4034 <0.000084>
read(3, " 0 0 io_uring/r"..., 131072) = 4044 <0.000085>
read(3, " 0 0 lib/crypto"..., 131072) = 4056 <0.000363>
read(3, " 0 0 lib/pldmfw"..., 131072) = 4079 <0.000091>
read(3, " 384 6 drivers/pc"..., 131072) = 4091 <0.000098>
read(3, " 0 0 drivers/vi"..., 131072) = 4068 <0.000093>
read(3, " 4096 1 drivers/ac"..., 131072) = 4086 <0.000089>
read(3, " 11040 230 drivers/ac"..., 131072) = 4001 <0.000090>
read(3, " 0 0 drivers/ac"..., 131072) = 4030 <0.000094>
read(3, " 0 0 drivers/dm"..., 131072) = 4031 <0.000094>
read(3, " 0 0 drivers/xe"..., 131072) = 4085 <0.000097>
read(3, " 4224 33 drivers/tt"..., 131072) = 4086 <0.000097>
read(3, " 0 0 drivers/io"..., 131072) = 4038 <0.000097>
read(3, " 0 0 drivers/io"..., 131072) = 4034 <0.000100>
read(3, " 64 2 drivers/ba"..., 131072) = 4094 <0.000110>
read(3, " 0 0 drivers/ba"..., 131072) = 4025 <0.000100>
read(3, " 0 0 drivers/dm"..., 131072) = 4077 <0.000106>
read(3, " 0 0 drivers/i2"..., 131072) = 4049 <0.000103>
read(3, " 0 0 drivers/ed"..., 131072) = 4035 <0.000108>
read(3, " 0 0 drivers/fi"..., 131072) = 4052 <0.000107>
read(3, " 0 0 net/core/s"..., 131072) = 4057 <0.000119>
read(3, " 5248 82 net/core/r"..., 131072) = 4050 <0.000110>
read(3, " 0 0 net/core/d"..., 131072) = 4047 <0.000111>
read(3, " 0 0 net/ethtoo"..., 131072) = 4059 <0.000112>
read(3, " 32 1 net/ipv4/r"..., 131072) = 4052 <0.000116>
read(3, " 1920 10 net/ipv4/i"..., 131072) = 4040 <0.000116>
read(3, " 0 0 net/xfrm/x"..., 131072) = 4071 <0.000118>
read(3, " 0 0 net/ipv6/i"..., 131072) = 4053 <0.000120>
read(3, " 0 0 net/devlin"..., 131072) = 4064 <0.000120>
read(3, " 0 0 lib/decomp"..., 131072) = 3996 <0.000123>
read(3, " 0 0 drivers/vi"..., 131072) = 4043 <0.000121>
read(3, " 0 0 drivers/us"..., 131072) = 4045 <0.000123>
read(3, " 0 0 drivers/us"..., 131072) = 4020 <0.000126>
read(3, " 64 1 drivers/sc"..., 131072) = 4043 <0.000128>
read(3, " 0 0 drivers/ne"..., 131072) = 4094 <0.000130>
read(3, " 0 0 drivers/hi"..., 131072) = 4025 <0.000132>
read(3, " 672 6 fs/ext4/mb"..., 131072) = 4071 <0.000132>
read(3, " 0 0 fs/autofs/"..., 131072) = 4038 <0.000134>
read(3, " 0 0 fs/fuse/fi"..., 131072) = 4093 <0.000134>
read(3, " 0 0 drivers/gp"..., 131072) = 4035 <0.000133>
read(3, " 0 0 drivers/gp"..., 131072) = 4046 <0.000135>
read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
read(3, " 0 0 sound/core"..., 131072) = 4000 <0.000134>
read(3, " 2048 1 drivers/vi"..., 131072) = 4078 <0.000138>
read(3, " 3328 104 drivers/gp"..., 131072) = 4040 <0.000143>
read(3, " 0 0 fs/overlay"..., 131072) = 4075 <0.000141>
read(3, " 0 0 net/netfil"..., 131072) = 4061 <0.000139>
read(3, " 0 0 net/xfrm/x"..., 131072) = 2022 <0.000129>
read(3, "", 131072) = 0 <0.000114>
And with this change:
$ sudo strace -T -e read cat /proc/allocinfo > /dev/null
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@>\2\0\0\0\0\0"..., 832) = 832 <0.000018>
read(3, "allocinfo - version: 1.0\n# <"..., 131072) = 4085 <0.000066>
read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062>
read(3, " 0 0 arch/x86/k"..., 131072) = 4029 <0.000062>
read(3, " 64 1 kernel/wor"..., 131072) = 4064 <0.000064>
read(3, " 0 0 kernel/sch"..., 131072) = 4041 <0.000046>
read(3, " 4992 39 kernel/irq"..., 131072) = 4061 <0.000053>
read(3, " 0 0 kernel/fut"..., 131072) = 4060 <0.000054>
read(3, " 0 0 kernel/aud"..., 131072) = 4025 <0.000053>
read(3, " 1536 3 kernel/tra"..., 131072) = 4014 <0.000051>
read(3, " 0 0 kernel/tra"..., 131072) = 4079 <0.000050>
read(3, " 0 0 kernel/tra"..., 131072) = 4035 <0.000051>
read(3, " 0 0 kernel/bpf"..., 131072) = 4044 <0.000053>
read(3, " 0 0 kernel/bpf"..., 131072) = 4061 <0.000053>
read(3, " 0 0 ./include/"..., 131072) = 4084 <0.000052>
read(3, " 0 0 kernel/eve"..., 131072) = 4084 <0.000072>
read(3, " 286720 70 mm/memory."..., 131072) = 4068 <0.000071>
read(3, " 0 0 mm/zswap.c"..., 131072) = 4091 <0.000056>
read(3, " 0 0 mm/balloon"..., 131072) = 4039 <0.000056>
read(3, " 0 0 fs/splice."..., 131072) = 4095 <0.000076>
read(3, " 0 0 fs/binfmt_"..., 131072) = 4039 <0.000075>
read(3, " 192 3 fs/proc/kc"..., 131072) = 4088 <0.000057>
read(3, " 0 0 ipc/mqueue"..., 131072) = 4017 <0.000055>
read(3, " 0 0 security/s"..., 131072) = 4075 <0.000046>
read(3, " 0 0 security/s"..., 131072) = 4029 <0.000045>
read(3, " 0 0 security/t"..., 131072) = 4053 <0.000050>
read(3, " 0 0 security/a"..., 131072) = 4054 <0.000060>
read(3, " 0 0 crypto/rsa"..., 131072) = 4046 <0.000181>
read(3, " 0 0 crypto/asy"..., 131072) = 4079 <0.000051>
read(3, " 1728 9 block/blk-"..., 131072) = 4034 <0.000070>
read(3, " 0 0 io_uring/r"..., 131072) = 4044 <0.000078>
read(3, " 0 0 lib/crypto"..., 131072) = 4066 <0.000067>
read(3, " 0 0 arch/x86/l"..., 131072) = 4085 <0.000066>
read(3, " 7168 7 drivers/pc"..., 131072) = 4090 <0.000066>
read(3, " 0 0 drivers/vi"..., 131072) = 4063 <0.000065>
read(3, " 1024 1 drivers/ac"..., 131072) = 4094 <0.000067>
read(3, " 0 0 drivers/ac"..., 131072) = 4023 <0.000063>
read(3, " 2048 2 drivers/ac"..., 131072) = 4020 <0.000066>
read(3, " 0 0 drivers/dm"..., 131072) = 4094 <0.000067>
read(3, " 0 0 drivers/xe"..., 131072) = 4087 <0.000068>
read(3, " 1024 1 drivers/tt"..., 131072) = 4093 <0.000056>
read(3, " 0 0 drivers/io"..., 131072) = 4037 <0.000044>
read(3, " 0 0 drivers/io"..., 131072) = 4095 <0.000061>
read(3, " 0 0 drivers/ba"..., 131072) = 4045 <0.000059>
read(3, " 0 0 drivers/ba"..., 131072) = 4011 <0.000057>
read(3, " 0 0 drivers/ma"..., 131072) = 4087 <0.000064>
read(3, " 0 0 drivers/i2"..., 131072) = 4028 <0.000059>
read(3, " 0 0 drivers/ed"..., 131072) = 4048 <0.000080>
read(3, " 0 0 drivers/fi"..., 131072) = 4082 <0.000054>
read(3, " 0 0 net/core/s"..., 131072) = 4091 <0.000068>
read(3, " 0 0 net/core/f"..., 131072) = 4041 <0.000053>
read(3, " 20 1 net/sched/"..., 131072) = 4054 <0.000051>
read(3, " 0 0 net/ethtoo"..., 131072) = 4045 <0.000051>
read(3, " 0 0 net/ipv4/r"..., 131072) = 4040 <0.000051>
read(3, " 2048 1 net/ipv4/f"..., 131072) = 4056 <0.000050>
read(3, " 0 0 net/xfrm/x"..., 131072) = 4053 <0.000056>
read(3, " 3072 6 net/ipv6/i"..., 131072) = 4095 <0.000060>
read(3, " 0 0 net/devlin"..., 131072) = 4094 <0.000064>
read(3, " 0 0 lib/decomp"..., 131072) = 4023 <0.000063>
read(3, " 2048 2 drivers/us"..., 131072) = 4053 <0.000062>
read(3, " 0 0 drivers/us"..., 131072) = 4011 <0.000068>
read(3, " 0 0 drivers/in"..., 131072) = 4092 <0.000059>
read(3, " 64 1 drivers/sc"..., 131072) = 4067 <0.000064>
read(3, " 0 0 drivers/ne"..., 131072) = 4059 <0.000067>
read(3, " 8 1 drivers/hi"..., 131072) = 4050 <0.000052>
read(3, " 256 1 fs/ext4/mb"..., 131072) = 4024 <0.000070>
read(3, " 0 0 fs/autofs/"..., 131072) = 4082 <0.000070>
read(3, " 0 0 fs/fuse/in"..., 131072) = 4074 <0.000054>
read(3, " 0 0 drivers/gp"..., 131072) = 4066 <0.000063>
read(3, " 0 0 drivers/gp"..., 131072) = 4091 <0.000052>
read(3, " 0 0 drivers/gp"..., 131072) = 4029 <0.000064>
read(3, " 0 0 drivers/gp"..., 131072) = 4094 <0.000062>
read(3, " 128 2 sound/core"..., 131072) = 4069 <0.000068>
read(3, " 64 1 sound/pci/"..., 131072) = 4074 <0.000069>
read(3, " 384 4 net/bridge"..., 131072) = 4021 <0.000054>
read(3, " 0 0 net/netfil"..., 131072) = 4084 <0.000050>
read(3, " 0 0 net/xfrm/x"..., 131072) = 1513 <0.000047>
read(3, "", 131072) = 0 <0.000030>
Signed-off-by: David Wang <00107082@163.com>
---
lib/alloc_tag.c | 30 +++++++++++-------------------
1 file changed, 11 insertions(+), 19 deletions(-)
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 25ecc1334b67..ec83cf798209 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -45,21 +45,17 @@ struct allocinfo_private {
static void *allocinfo_start(struct seq_file *m, loff_t *pos)
{
struct allocinfo_private *priv;
- struct codetag *ct;
loff_t node = *pos;
- priv = kzalloc(sizeof(*priv), GFP_KERNEL);
- m->private = priv;
- if (!priv)
- return NULL;
-
- priv->print_header = (node == 0);
+ priv = (struct allocinfo_private *)m->private;
codetag_lock_module_list(alloc_tag_cttype, true);
- priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
- while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
- node--;
-
- return ct ? priv : NULL;
+ if (node == 0) {
+ priv->print_header = true;
+ priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
+ codetag_next_ct(&priv->iter);
+ }
+ // TODO: need to check iter is valid, or rewinding it.
+ return priv->iter.ct ? priv : NULL;
}
static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
@@ -76,12 +72,7 @@ static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
static void allocinfo_stop(struct seq_file *m, void *arg)
{
- struct allocinfo_private *priv = (struct allocinfo_private *)m->private;
-
- if (priv) {
- codetag_lock_module_list(alloc_tag_cttype, false);
- kfree(priv);
- }
+ codetag_lock_module_list(alloc_tag_cttype, false);
}
static void print_allocinfo_header(struct seq_buf *buf)
@@ -249,7 +240,8 @@ static void __init procfs_init(void)
if (!mem_profiling_support)
return;
- if (!proc_create_seq(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op)) {
+ if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
+ sizeof(struct allocinfo_private), NULL)) {
pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
shutdown_mem_profiling(false);
}
--
2.39.2
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re:[PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-07 17:55 [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo David Wang
@ 2025-05-07 18:19 ` David Wang
2025-05-07 23:42 ` [PATCH] " Suren Baghdasaryan
2025-05-07 23:36 ` Suren Baghdasaryan
` (4 subsequent siblings)
5 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-07 18:19 UTC (permalink / raw)
To: surenb, kent.overstreet, akpm; +Cc: linux-mm, linux-kernel
Hi,
Just want to share how I notice those memory allocation behaivors: the cumulative counters~!
With cumulative counters, I can identify which module keeps alloc/free memory, by the ratio between
cumulative calls and remaining calls, and maybe an optimization could be applied.
Following is top16 I got on my system:
+-----------------------------------------+-------+------------------+--------------------+
| alloc | calls | cumulative calls | ratio |
+-----------------------------------------+-------+------------------+--------------------+
| fs/seq_file.c:584 | 2 | 18064825 | 9032412.5 |
| fs/seq_file.c:38 | 5 | 18148288 | 3629657.6 |
| fs/seq_file.c:63 | 15 | 18153271 | 1210218.0666666667 |
| net/core/skbuff.c:577 | 9 | 10679975 | 1186663.888888889 |
| net/core/skbuff.c:658 | 21 | 11013437 | 524449.380952381 |
| fs/select.c:168 | 7 | 2831226 | 404460.85714285716 |
| lib/alloc_tag.c:51 | 1 | 340649 | 340649.0 | <--- Here I started
| kernel/signal.c:455 | 1 | 300730 | 300730.0 |
| fs/notify/inotify/inotify_fsnotify.c:96 | 1 | 249831 | 249831.0 |
| fs/ext4/dir.c:675 | 3 | 519734 | 173244.66666666666 |
| drivers/usb/host/xhci.c:1555 | 4 | 126402 | 31600.5 |
| fs/locks.c:275 | 36 | 986957 | 27415.472222222223 |
| fs/proc/inode.c:502 | 3 | 63753 | 21251.0 |
| fs/pipe.c:125 | 123 | 2143378 | 17425.837398373984 |
| net/core/scm.c:84 | 3 | 43267 | 14422.333333333334 |
| fs/kernel_read_file.c:80 | 2 | 26910 | 13455.0 |
+-----------------------------------------+-------+------------------+--------------------+
I think this is another "good" usage for cumulative counters: if a module just keeps alloc/free memory,
maybe it is good to move the memory alloc/free to somewhere less frequent.
In the case of this patch, a memory allocation for each read-calls, can be moved to opan-calls.
If interested, I can re-send the patch for cumulative counters for further discussions.
FYI
David
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-07 17:55 [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo David Wang
2025-05-07 18:19 ` David Wang
@ 2025-05-07 23:36 ` Suren Baghdasaryan
2025-05-08 3:10 ` David Wang
2025-05-08 15:32 ` David Wang
2025-05-09 5:51 ` [PATCH 1/2] alloc_tag: add timestamp to codetag iterator David Wang
` (3 subsequent siblings)
5 siblings, 2 replies; 44+ messages in thread
From: Suren Baghdasaryan @ 2025-05-07 23:36 UTC (permalink / raw)
To: David Wang; +Cc: kent.overstreet, akpm, linux-mm, linux-kernel
On Wed, May 7, 2025 at 5:55 PM David Wang <00107082@163.com> wrote:
>
> ---
> The patch is not complete, just want to get feedbacks whether this
> worth carrying on.
In such cases it's customary to mark the patch as RFC which saves you
time on explaining your motivation :)
> ---
> When reading /proc/allocinfo, for each read syscall, seq_file would
> invoke start/stop callbacks. In start callback, a memory is alloced
> to store iterator and the iterator would restart from beginning to
> walk to its previous position.
> Each seq_file read() takes at most 4096 bytes, even read with a larger
> user space buffer, meaning read out /proc/allocinfo, tens of read
> syscalls are needed. For example, a 306036 bytes allocinfo files need
> 76 reads:
>
> $ sudo cat /proc/allocinfo | wc
> 3964 16678 306036
>
> For those n=3964 lines, each read takes about m=3964/76=52 lines,
> the iter would be rewinding:
> m steps on 1st read,
> 2*m steps on 2nd read
> 3*m steps on 3rd read
> ...
> n steps on the last read
> totally, the iterator would be iterated O(n*n/m) times.
> (Each read would take more time than previous one.)
>
> To use a private data alloced when /proc/allocinfo is opened,
> the n/m memory alloction could be avoid, and there is no need
> to restart the iterator from very beginning everytime.
> So only 1 memory allocation and n steps for iterating are needed.
> (Only when module changed, the iterator should be invalidated and
> restart.)
Yeah, your change makes sense and looks like a good optimization. From
a quick look at the code, codetag_next_ct() should handle the case
when a module gets removed from under us while we are not holding
cttype->mod_lock. I'll need to take another closer look at it once you
post an official patch.
Thanks!
>
> Timings before:
> $ time cat /proc/allocinfo > /dev/null
>
> real 0m0.007s
> user 0m0.000s
> sys 0m0.007s
> read-syscalls get slower and slower:
> read(3, "allocinfo - version: 1.0\n# <"..., 131072) = 4085 <0.000062>
> ...
> read(3, " 0 0 drivers/gp"..., 131072) = 4046 <0.000135>
> read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
> ...
>
> and with the change:
> $ time cat /proc/allocinfo > /dev/null
>
> real 0m0.004s
> user 0m0.000s
> sys 0m0.003s
> 7ms drop to 4ms, more friendly to monitoring tools.
>
> Detailed strace stats before the change:
> $ sudo strace -T -e read cat /proc/allocinfo > /dev/null
> read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@>\2\0\0\0\0\0"..., 832) = 832 <0.000040>
> read(3, "allocinfo - version: 1.0\n# <"..., 131072) = 4085 <0.000062>
> read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062>
> read(3, " 0 0 arch/x86/k"..., 131072) = 4029 <0.000061>
> read(3, " 64 1 kernel/wor"..., 131072) = 4064 <0.000061>
> read(3, " 0 0 kernel/sch"..., 131072) = 4041 <0.000051>
> read(3, " 4992 39 kernel/irq"..., 131072) = 4061 <0.000070>
> read(3, " 0 0 kernel/fut"..., 131072) = 4060 <0.000071>
> read(3, " 0 0 kernel/aud"..., 131072) = 4025 <0.000122>
> read(3, " 1536 3 kernel/tra"..., 131072) = 4014 <0.000064>
> read(3, " 0 0 kernel/tra"..., 131072) = 4079 <0.000067>
> read(3, " 0 0 kernel/tra"..., 131072) = 4035 <0.000072>
> read(3, " 0 0 kernel/bpf"..., 131072) = 4044 <0.000074>
> read(3, " 0 0 kernel/bpf"..., 131072) = 4061 <0.000063>
> read(3, " 0 0 ./include/"..., 131072) = 4084 <0.000063>
> read(3, " 0 0 kernel/eve"..., 131072) = 4084 <0.000071>
> read(3, " 278528 68 mm/memory."..., 131072) = 4068 <0.000070>
> read(3, " 0 0 mm/zswap.c"..., 131072) = 4091 <0.000082>
> read(3, " 0 0 mm/balloon"..., 131072) = 4039 <0.000083>
> read(3, " 0 0 fs/splice."..., 131072) = 4095 <0.000083>
> read(3, " 0 0 fs/binfmt_"..., 131072) = 4039 <0.000075>
> read(3, " 192 3 fs/proc/kc"..., 131072) = 4088 <0.000076>
> read(3, " 0 0 ipc/mqueue"..., 131072) = 4017 <0.000076>
> read(3, " 0 0 security/s"..., 131072) = 4075 <0.000072>
> read(3, " 0 0 security/s"..., 131072) = 4029 <0.000084>
> read(3, " 0 0 security/t"..., 131072) = 4053 <0.000076>
> read(3, " 0 0 security/a"..., 131072) = 4054 <0.000078>
> read(3, " 0 0 crypto/rsa"..., 131072) = 4046 <0.000081>
> read(3, " 0 0 crypto/asy"..., 131072) = 4079 <0.000081>
> read(3, " 1728 9 block/blk-"..., 131072) = 4034 <0.000084>
> read(3, " 0 0 io_uring/r"..., 131072) = 4044 <0.000085>
> read(3, " 0 0 lib/crypto"..., 131072) = 4056 <0.000363>
> read(3, " 0 0 lib/pldmfw"..., 131072) = 4079 <0.000091>
> read(3, " 384 6 drivers/pc"..., 131072) = 4091 <0.000098>
> read(3, " 0 0 drivers/vi"..., 131072) = 4068 <0.000093>
> read(3, " 4096 1 drivers/ac"..., 131072) = 4086 <0.000089>
> read(3, " 11040 230 drivers/ac"..., 131072) = 4001 <0.000090>
> read(3, " 0 0 drivers/ac"..., 131072) = 4030 <0.000094>
> read(3, " 0 0 drivers/dm"..., 131072) = 4031 <0.000094>
> read(3, " 0 0 drivers/xe"..., 131072) = 4085 <0.000097>
> read(3, " 4224 33 drivers/tt"..., 131072) = 4086 <0.000097>
> read(3, " 0 0 drivers/io"..., 131072) = 4038 <0.000097>
> read(3, " 0 0 drivers/io"..., 131072) = 4034 <0.000100>
> read(3, " 64 2 drivers/ba"..., 131072) = 4094 <0.000110>
> read(3, " 0 0 drivers/ba"..., 131072) = 4025 <0.000100>
> read(3, " 0 0 drivers/dm"..., 131072) = 4077 <0.000106>
> read(3, " 0 0 drivers/i2"..., 131072) = 4049 <0.000103>
> read(3, " 0 0 drivers/ed"..., 131072) = 4035 <0.000108>
> read(3, " 0 0 drivers/fi"..., 131072) = 4052 <0.000107>
> read(3, " 0 0 net/core/s"..., 131072) = 4057 <0.000119>
> read(3, " 5248 82 net/core/r"..., 131072) = 4050 <0.000110>
> read(3, " 0 0 net/core/d"..., 131072) = 4047 <0.000111>
> read(3, " 0 0 net/ethtoo"..., 131072) = 4059 <0.000112>
> read(3, " 32 1 net/ipv4/r"..., 131072) = 4052 <0.000116>
> read(3, " 1920 10 net/ipv4/i"..., 131072) = 4040 <0.000116>
> read(3, " 0 0 net/xfrm/x"..., 131072) = 4071 <0.000118>
> read(3, " 0 0 net/ipv6/i"..., 131072) = 4053 <0.000120>
> read(3, " 0 0 net/devlin"..., 131072) = 4064 <0.000120>
> read(3, " 0 0 lib/decomp"..., 131072) = 3996 <0.000123>
> read(3, " 0 0 drivers/vi"..., 131072) = 4043 <0.000121>
> read(3, " 0 0 drivers/us"..., 131072) = 4045 <0.000123>
> read(3, " 0 0 drivers/us"..., 131072) = 4020 <0.000126>
> read(3, " 64 1 drivers/sc"..., 131072) = 4043 <0.000128>
> read(3, " 0 0 drivers/ne"..., 131072) = 4094 <0.000130>
> read(3, " 0 0 drivers/hi"..., 131072) = 4025 <0.000132>
> read(3, " 672 6 fs/ext4/mb"..., 131072) = 4071 <0.000132>
> read(3, " 0 0 fs/autofs/"..., 131072) = 4038 <0.000134>
> read(3, " 0 0 fs/fuse/fi"..., 131072) = 4093 <0.000134>
> read(3, " 0 0 drivers/gp"..., 131072) = 4035 <0.000133>
> read(3, " 0 0 drivers/gp"..., 131072) = 4046 <0.000135>
> read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
> read(3, " 0 0 sound/core"..., 131072) = 4000 <0.000134>
> read(3, " 2048 1 drivers/vi"..., 131072) = 4078 <0.000138>
> read(3, " 3328 104 drivers/gp"..., 131072) = 4040 <0.000143>
> read(3, " 0 0 fs/overlay"..., 131072) = 4075 <0.000141>
> read(3, " 0 0 net/netfil"..., 131072) = 4061 <0.000139>
> read(3, " 0 0 net/xfrm/x"..., 131072) = 2022 <0.000129>
> read(3, "", 131072) = 0 <0.000114>
>
> And with this change:
> $ sudo strace -T -e read cat /proc/allocinfo > /dev/null
> read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@>\2\0\0\0\0\0"..., 832) = 832 <0.000018>
> read(3, "allocinfo - version: 1.0\n# <"..., 131072) = 4085 <0.000066>
> read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062>
> read(3, " 0 0 arch/x86/k"..., 131072) = 4029 <0.000062>
> read(3, " 64 1 kernel/wor"..., 131072) = 4064 <0.000064>
> read(3, " 0 0 kernel/sch"..., 131072) = 4041 <0.000046>
> read(3, " 4992 39 kernel/irq"..., 131072) = 4061 <0.000053>
> read(3, " 0 0 kernel/fut"..., 131072) = 4060 <0.000054>
> read(3, " 0 0 kernel/aud"..., 131072) = 4025 <0.000053>
> read(3, " 1536 3 kernel/tra"..., 131072) = 4014 <0.000051>
> read(3, " 0 0 kernel/tra"..., 131072) = 4079 <0.000050>
> read(3, " 0 0 kernel/tra"..., 131072) = 4035 <0.000051>
> read(3, " 0 0 kernel/bpf"..., 131072) = 4044 <0.000053>
> read(3, " 0 0 kernel/bpf"..., 131072) = 4061 <0.000053>
> read(3, " 0 0 ./include/"..., 131072) = 4084 <0.000052>
> read(3, " 0 0 kernel/eve"..., 131072) = 4084 <0.000072>
> read(3, " 286720 70 mm/memory."..., 131072) = 4068 <0.000071>
> read(3, " 0 0 mm/zswap.c"..., 131072) = 4091 <0.000056>
> read(3, " 0 0 mm/balloon"..., 131072) = 4039 <0.000056>
> read(3, " 0 0 fs/splice."..., 131072) = 4095 <0.000076>
> read(3, " 0 0 fs/binfmt_"..., 131072) = 4039 <0.000075>
> read(3, " 192 3 fs/proc/kc"..., 131072) = 4088 <0.000057>
> read(3, " 0 0 ipc/mqueue"..., 131072) = 4017 <0.000055>
> read(3, " 0 0 security/s"..., 131072) = 4075 <0.000046>
> read(3, " 0 0 security/s"..., 131072) = 4029 <0.000045>
> read(3, " 0 0 security/t"..., 131072) = 4053 <0.000050>
> read(3, " 0 0 security/a"..., 131072) = 4054 <0.000060>
> read(3, " 0 0 crypto/rsa"..., 131072) = 4046 <0.000181>
> read(3, " 0 0 crypto/asy"..., 131072) = 4079 <0.000051>
> read(3, " 1728 9 block/blk-"..., 131072) = 4034 <0.000070>
> read(3, " 0 0 io_uring/r"..., 131072) = 4044 <0.000078>
> read(3, " 0 0 lib/crypto"..., 131072) = 4066 <0.000067>
> read(3, " 0 0 arch/x86/l"..., 131072) = 4085 <0.000066>
> read(3, " 7168 7 drivers/pc"..., 131072) = 4090 <0.000066>
> read(3, " 0 0 drivers/vi"..., 131072) = 4063 <0.000065>
> read(3, " 1024 1 drivers/ac"..., 131072) = 4094 <0.000067>
> read(3, " 0 0 drivers/ac"..., 131072) = 4023 <0.000063>
> read(3, " 2048 2 drivers/ac"..., 131072) = 4020 <0.000066>
> read(3, " 0 0 drivers/dm"..., 131072) = 4094 <0.000067>
> read(3, " 0 0 drivers/xe"..., 131072) = 4087 <0.000068>
> read(3, " 1024 1 drivers/tt"..., 131072) = 4093 <0.000056>
> read(3, " 0 0 drivers/io"..., 131072) = 4037 <0.000044>
> read(3, " 0 0 drivers/io"..., 131072) = 4095 <0.000061>
> read(3, " 0 0 drivers/ba"..., 131072) = 4045 <0.000059>
> read(3, " 0 0 drivers/ba"..., 131072) = 4011 <0.000057>
> read(3, " 0 0 drivers/ma"..., 131072) = 4087 <0.000064>
> read(3, " 0 0 drivers/i2"..., 131072) = 4028 <0.000059>
> read(3, " 0 0 drivers/ed"..., 131072) = 4048 <0.000080>
> read(3, " 0 0 drivers/fi"..., 131072) = 4082 <0.000054>
> read(3, " 0 0 net/core/s"..., 131072) = 4091 <0.000068>
> read(3, " 0 0 net/core/f"..., 131072) = 4041 <0.000053>
> read(3, " 20 1 net/sched/"..., 131072) = 4054 <0.000051>
> read(3, " 0 0 net/ethtoo"..., 131072) = 4045 <0.000051>
> read(3, " 0 0 net/ipv4/r"..., 131072) = 4040 <0.000051>
> read(3, " 2048 1 net/ipv4/f"..., 131072) = 4056 <0.000050>
> read(3, " 0 0 net/xfrm/x"..., 131072) = 4053 <0.000056>
> read(3, " 3072 6 net/ipv6/i"..., 131072) = 4095 <0.000060>
> read(3, " 0 0 net/devlin"..., 131072) = 4094 <0.000064>
> read(3, " 0 0 lib/decomp"..., 131072) = 4023 <0.000063>
> read(3, " 2048 2 drivers/us"..., 131072) = 4053 <0.000062>
> read(3, " 0 0 drivers/us"..., 131072) = 4011 <0.000068>
> read(3, " 0 0 drivers/in"..., 131072) = 4092 <0.000059>
> read(3, " 64 1 drivers/sc"..., 131072) = 4067 <0.000064>
> read(3, " 0 0 drivers/ne"..., 131072) = 4059 <0.000067>
> read(3, " 8 1 drivers/hi"..., 131072) = 4050 <0.000052>
> read(3, " 256 1 fs/ext4/mb"..., 131072) = 4024 <0.000070>
> read(3, " 0 0 fs/autofs/"..., 131072) = 4082 <0.000070>
> read(3, " 0 0 fs/fuse/in"..., 131072) = 4074 <0.000054>
> read(3, " 0 0 drivers/gp"..., 131072) = 4066 <0.000063>
> read(3, " 0 0 drivers/gp"..., 131072) = 4091 <0.000052>
> read(3, " 0 0 drivers/gp"..., 131072) = 4029 <0.000064>
> read(3, " 0 0 drivers/gp"..., 131072) = 4094 <0.000062>
> read(3, " 128 2 sound/core"..., 131072) = 4069 <0.000068>
> read(3, " 64 1 sound/pci/"..., 131072) = 4074 <0.000069>
> read(3, " 384 4 net/bridge"..., 131072) = 4021 <0.000054>
> read(3, " 0 0 net/netfil"..., 131072) = 4084 <0.000050>
> read(3, " 0 0 net/xfrm/x"..., 131072) = 1513 <0.000047>
> read(3, "", 131072) = 0 <0.000030>
>
> Signed-off-by: David Wang <00107082@163.com>
> ---
> lib/alloc_tag.c | 30 +++++++++++-------------------
> 1 file changed, 11 insertions(+), 19 deletions(-)
>
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index 25ecc1334b67..ec83cf798209 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -45,21 +45,17 @@ struct allocinfo_private {
> static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> {
> struct allocinfo_private *priv;
> - struct codetag *ct;
> loff_t node = *pos;
>
> - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> - m->private = priv;
> - if (!priv)
> - return NULL;
> -
> - priv->print_header = (node == 0);
> + priv = (struct allocinfo_private *)m->private;
> codetag_lock_module_list(alloc_tag_cttype, true);
> - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
> - node--;
> -
> - return ct ? priv : NULL;
> + if (node == 0) {
> + priv->print_header = true;
> + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> + codetag_next_ct(&priv->iter);
> + }
> + // TODO: need to check iter is valid, or rewinding it.
> + return priv->iter.ct ? priv : NULL;
> }
>
> static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
> @@ -76,12 +72,7 @@ static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
>
> static void allocinfo_stop(struct seq_file *m, void *arg)
> {
> - struct allocinfo_private *priv = (struct allocinfo_private *)m->private;
> -
> - if (priv) {
> - codetag_lock_module_list(alloc_tag_cttype, false);
> - kfree(priv);
> - }
> + codetag_lock_module_list(alloc_tag_cttype, false);
> }
>
> static void print_allocinfo_header(struct seq_buf *buf)
> @@ -249,7 +240,8 @@ static void __init procfs_init(void)
> if (!mem_profiling_support)
> return;
>
> - if (!proc_create_seq(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op)) {
> + if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
> + sizeof(struct allocinfo_private), NULL)) {
> pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
> shutdown_mem_profiling(false);
> }
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-07 18:19 ` David Wang
@ 2025-05-07 23:42 ` Suren Baghdasaryan
2025-05-08 0:01 ` Kent Overstreet
2025-05-08 2:24 ` David Wang
0 siblings, 2 replies; 44+ messages in thread
From: Suren Baghdasaryan @ 2025-05-07 23:42 UTC (permalink / raw)
To: David Wang; +Cc: kent.overstreet, akpm, linux-mm, linux-kernel
On Wed, May 7, 2025 at 6:19 PM David Wang <00107082@163.com> wrote:
>
> Hi,
> Just want to share how I notice those memory allocation behaivors: the cumulative counters~!
>
> With cumulative counters, I can identify which module keeps alloc/free memory, by the ratio between
> cumulative calls and remaining calls, and maybe an optimization could be applied.
> Following is top16 I got on my system:
>
> +-----------------------------------------+-------+------------------+--------------------+
> | alloc | calls | cumulative calls | ratio |
> +-----------------------------------------+-------+------------------+--------------------+
> | fs/seq_file.c:584 | 2 | 18064825 | 9032412.5 |
> | fs/seq_file.c:38 | 5 | 18148288 | 3629657.6 |
> | fs/seq_file.c:63 | 15 | 18153271 | 1210218.0666666667 |
> | net/core/skbuff.c:577 | 9 | 10679975 | 1186663.888888889 |
> | net/core/skbuff.c:658 | 21 | 11013437 | 524449.380952381 |
> | fs/select.c:168 | 7 | 2831226 | 404460.85714285716 |
> | lib/alloc_tag.c:51 | 1 | 340649 | 340649.0 | <--- Here I started
> | kernel/signal.c:455 | 1 | 300730 | 300730.0 |
> | fs/notify/inotify/inotify_fsnotify.c:96 | 1 | 249831 | 249831.0 |
> | fs/ext4/dir.c:675 | 3 | 519734 | 173244.66666666666 |
> | drivers/usb/host/xhci.c:1555 | 4 | 126402 | 31600.5 |
> | fs/locks.c:275 | 36 | 986957 | 27415.472222222223 |
> | fs/proc/inode.c:502 | 3 | 63753 | 21251.0 |
> | fs/pipe.c:125 | 123 | 2143378 | 17425.837398373984 |
> | net/core/scm.c:84 | 3 | 43267 | 14422.333333333334 |
> | fs/kernel_read_file.c:80 | 2 | 26910 | 13455.0 |
> +-----------------------------------------+-------+------------------+--------------------+
>
> I think this is another "good" usage for cumulative counters: if a module just keeps alloc/free memory,
> maybe it is good to move the memory alloc/free to somewhere less frequent.
>
> In the case of this patch, a memory allocation for each read-calls, can be moved to opan-calls.
>
> If interested, I can re-send the patch for cumulative counters for further discussions.
Yeah, my issue with cumulative counters is that while they might be
useful for some analyses, most usecases would probably not benefit
from them while sharing the performance overhead. OTOH making it
optional with a separate CONFIG that affects the content of the
/proc/allocinfo seems like a bad idea to me. Userspace parsers now
would have to check not only the file version but also whether this
kernel config is enabled, or handle a possibility of an additional
column in the output. Does not seem like a good solution to me.
All that said, I'm open to suggestions if there is a way to
incorporate cumulative counters that would not tax all other usecases
that do not need them.
>
>
> FYI
> David
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-07 23:42 ` [PATCH] " Suren Baghdasaryan
@ 2025-05-08 0:01 ` Kent Overstreet
2025-05-08 3:06 ` David Wang
2025-05-08 2:24 ` David Wang
1 sibling, 1 reply; 44+ messages in thread
From: Kent Overstreet @ 2025-05-08 0:01 UTC (permalink / raw)
To: Suren Baghdasaryan; +Cc: David Wang, akpm, linux-mm, linux-kernel
On Wed, May 07, 2025 at 11:42:56PM +0000, Suren Baghdasaryan wrote:
> On Wed, May 7, 2025 at 6:19 PM David Wang <00107082@163.com> wrote:
> >
> > Hi,
> > Just want to share how I notice those memory allocation behaivors: the cumulative counters~!
> >
> > With cumulative counters, I can identify which module keeps alloc/free memory, by the ratio between
> > cumulative calls and remaining calls, and maybe an optimization could be applied.
> > Following is top16 I got on my system:
> >
> > +-----------------------------------------+-------+------------------+--------------------+
> > | alloc | calls | cumulative calls | ratio |
> > +-----------------------------------------+-------+------------------+--------------------+
> > | fs/seq_file.c:584 | 2 | 18064825 | 9032412.5 |
> > | fs/seq_file.c:38 | 5 | 18148288 | 3629657.6 |
> > | fs/seq_file.c:63 | 15 | 18153271 | 1210218.0666666667 |
> > | net/core/skbuff.c:577 | 9 | 10679975 | 1186663.888888889 |
> > | net/core/skbuff.c:658 | 21 | 11013437 | 524449.380952381 |
> > | fs/select.c:168 | 7 | 2831226 | 404460.85714285716 |
> > | lib/alloc_tag.c:51 | 1 | 340649 | 340649.0 | <--- Here I started
> > | kernel/signal.c:455 | 1 | 300730 | 300730.0 |
> > | fs/notify/inotify/inotify_fsnotify.c:96 | 1 | 249831 | 249831.0 |
> > | fs/ext4/dir.c:675 | 3 | 519734 | 173244.66666666666 |
> > | drivers/usb/host/xhci.c:1555 | 4 | 126402 | 31600.5 |
> > | fs/locks.c:275 | 36 | 986957 | 27415.472222222223 |
> > | fs/proc/inode.c:502 | 3 | 63753 | 21251.0 |
> > | fs/pipe.c:125 | 123 | 2143378 | 17425.837398373984 |
> > | net/core/scm.c:84 | 3 | 43267 | 14422.333333333334 |
> > | fs/kernel_read_file.c:80 | 2 | 26910 | 13455.0 |
> > +-----------------------------------------+-------+------------------+--------------------+
> >
> > I think this is another "good" usage for cumulative counters: if a module just keeps alloc/free memory,
> > maybe it is good to move the memory alloc/free to somewhere less frequent.
> >
> > In the case of this patch, a memory allocation for each read-calls, can be moved to opan-calls.
> >
> > If interested, I can re-send the patch for cumulative counters for further discussions.
>
> Yeah, my issue with cumulative counters is that while they might be
> useful for some analyses, most usecases would probably not benefit
> from them while sharing the performance overhead. OTOH making it
> optional with a separate CONFIG that affects the content of the
> /proc/allocinfo seems like a bad idea to me. Userspace parsers now
> would have to check not only the file version but also whether this
> kernel config is enabled, or handle a possibility of an additional
> column in the output. Does not seem like a good solution to me.
Yeah, I don't see much benefit for cumulative counters over just running
a profiler.
Running a profiler is always the first thing you should do when you care
about CPU usage, that's always the thing that will give you the best
overall picture. If memory allocations are an issue, they'll show up
there.
But generally they're not, because slub is _really damn fast_. People
generally worry about memory allocation overhead a bit too much.
(Memory _layout_, otoh, avoid pointer chasing - that's always worth
worrying about, but cumulative counters won't show you that).
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-07 23:42 ` [PATCH] " Suren Baghdasaryan
2025-05-08 0:01 ` Kent Overstreet
@ 2025-05-08 2:24 ` David Wang
1 sibling, 0 replies; 44+ messages in thread
From: David Wang @ 2025-05-08 2:24 UTC (permalink / raw)
To: Suren Baghdasaryan; +Cc: kent.overstreet, akpm, linux-mm, linux-kernel
At 2025-05-08 07:42:56, "Suren Baghdasaryan" <surenb@google.com> wrote:
>On Wed, May 7, 2025 at 6:19 PM David Wang <00107082@163.com> wrote:
>>
>> Hi,
>> Just want to share how I notice those memory allocation behaivors: the cumulative counters~!
>>
>> With cumulative counters, I can identify which module keeps alloc/free memory, by the ratio between
>> cumulative calls and remaining calls, and maybe an optimization could be applied.
>> Following is top16 I got on my system:
>>
>> +-----------------------------------------+-------+------------------+--------------------+
>> | alloc | calls | cumulative calls | ratio |
>> +-----------------------------------------+-------+------------------+--------------------+
>> | fs/seq_file.c:584 | 2 | 18064825 | 9032412.5 |
>> | fs/seq_file.c:38 | 5 | 18148288 | 3629657.6 |
>> | fs/seq_file.c:63 | 15 | 18153271 | 1210218.0666666667 |
>> | net/core/skbuff.c:577 | 9 | 10679975 | 1186663.888888889 |
>> | net/core/skbuff.c:658 | 21 | 11013437 | 524449.380952381 |
>> | fs/select.c:168 | 7 | 2831226 | 404460.85714285716 |
>> | lib/alloc_tag.c:51 | 1 | 340649 | 340649.0 | <--- Here I started
>> | kernel/signal.c:455 | 1 | 300730 | 300730.0 |
>> | fs/notify/inotify/inotify_fsnotify.c:96 | 1 | 249831 | 249831.0 |
>> | fs/ext4/dir.c:675 | 3 | 519734 | 173244.66666666666 |
>> | drivers/usb/host/xhci.c:1555 | 4 | 126402 | 31600.5 |
>> | fs/locks.c:275 | 36 | 986957 | 27415.472222222223 |
>> | fs/proc/inode.c:502 | 3 | 63753 | 21251.0 |
>> | fs/pipe.c:125 | 123 | 2143378 | 17425.837398373984 |
>> | net/core/scm.c:84 | 3 | 43267 | 14422.333333333334 |
>> | fs/kernel_read_file.c:80 | 2 | 26910 | 13455.0 |
>> +-----------------------------------------+-------+------------------+--------------------+
>>
>> I think this is another "good" usage for cumulative counters: if a module just keeps alloc/free memory,
>> maybe it is good to move the memory alloc/free to somewhere less frequent.
>>
>> In the case of this patch, a memory allocation for each read-calls, can be moved to opan-calls.
>>
>> If interested, I can re-send the patch for cumulative counters for further discussions.
>
>Yeah, my issue with cumulative counters is that while they might be
>useful for some analyses, most usecases would probably not benefit
>from them while sharing the performance overhead. OTOH making it
>optional with a separate CONFIG that affects the content of the
>/proc/allocinfo seems like a bad idea to me. Userspace parsers now
>would have to check not only the file version but also whether this
>kernel config is enabled, or handle a possibility of an additional
>column in the output. Does not seem like a good solution to me.
Thanks for the quick feedback~
Agree that this would cause troubles to userspace tools,
and also it add more performance impact for profiling.
David
>
>All that said, I'm open to suggestions if there is a way to
>incorporate cumulative counters that would not tax all other usecases
>that do not need them.
>
>>
>>
>> FYI
>> David
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-08 0:01 ` Kent Overstreet
@ 2025-05-08 3:06 ` David Wang
2025-05-08 3:31 ` Kent Overstreet
0 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-08 3:06 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Suren Baghdasaryan, akpm, linux-mm, linux-kernel
At 2025-05-08 08:01:55, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Wed, May 07, 2025 at 11:42:56PM +0000, Suren Baghdasaryan wrote:
>> On Wed, May 7, 2025 at 6:19 PM David Wang <00107082@163.com> wrote:
>> >
>> > Hi,
>> > Just want to share how I notice those memory allocation behaivors: the cumulative counters~!
>> >
>> > With cumulative counters, I can identify which module keeps alloc/free memory, by the ratio between
>> > cumulative calls and remaining calls, and maybe an optimization could be applied.
>> > Following is top16 I got on my system:
>> >
>> > +-----------------------------------------+-------+------------------+--------------------+
>> > | alloc | calls | cumulative calls | ratio |
>> > +-----------------------------------------+-------+------------------+--------------------+
>> > | fs/seq_file.c:584 | 2 | 18064825 | 9032412.5 |
>> > | fs/seq_file.c:38 | 5 | 18148288 | 3629657.6 |
>> > | fs/seq_file.c:63 | 15 | 18153271 | 1210218.0666666667 |
>> > | net/core/skbuff.c:577 | 9 | 10679975 | 1186663.888888889 |
>> > | net/core/skbuff.c:658 | 21 | 11013437 | 524449.380952381 |
>> > | fs/select.c:168 | 7 | 2831226 | 404460.85714285716 |
>> > | lib/alloc_tag.c:51 | 1 | 340649 | 340649.0 | <--- Here I started
>> > | kernel/signal.c:455 | 1 | 300730 | 300730.0 |
>> > | fs/notify/inotify/inotify_fsnotify.c:96 | 1 | 249831 | 249831.0 |
>> > | fs/ext4/dir.c:675 | 3 | 519734 | 173244.66666666666 |
>> > | drivers/usb/host/xhci.c:1555 | 4 | 126402 | 31600.5 |
>> > | fs/locks.c:275 | 36 | 986957 | 27415.472222222223 |
>> > | fs/proc/inode.c:502 | 3 | 63753 | 21251.0 |
>> > | fs/pipe.c:125 | 123 | 2143378 | 17425.837398373984 |
>> > | net/core/scm.c:84 | 3 | 43267 | 14422.333333333334 |
>> > | fs/kernel_read_file.c:80 | 2 | 26910 | 13455.0 |
>> > +-----------------------------------------+-------+------------------+--------------------+
>> >
>> > I think this is another "good" usage for cumulative counters: if a module just keeps alloc/free memory,
>> > maybe it is good to move the memory alloc/free to somewhere less frequent.
>> >
>> > In the case of this patch, a memory allocation for each read-calls, can be moved to opan-calls.
>> >
>> > If interested, I can re-send the patch for cumulative counters for further discussions.
>>
>> Yeah, my issue with cumulative counters is that while they might be
>> useful for some analyses, most usecases would probably not benefit
>> from them while sharing the performance overhead. OTOH making it
>> optional with a separate CONFIG that affects the content of the
>> /proc/allocinfo seems like a bad idea to me. Userspace parsers now
>> would have to check not only the file version but also whether this
>> kernel config is enabled, or handle a possibility of an additional
>> column in the output. Does not seem like a good solution to me.
>
>Yeah, I don't see much benefit for cumulative counters over just running
>a profiler.
>
>Running a profiler is always the first thing you should do when you care
>about CPU usage, that's always the thing that will give you the best
>overall picture. If memory allocations are an issue, they'll show up
>there.
>
>But generally they're not, because slub is _really damn fast_. People
>generally worry about memory allocation overhead a bit too much.
>
>(Memory _layout_, otoh, avoid pointer chasing - that's always worth
>worrying about, but cumulative counters won't show you that).
Thanks for the feedback~
I agree that memory allocation normally dose not take major part of a profiling report,
even profiling a fio test, kmem_cache_alloc only takes ~1% perf samples.
I don't know why I have this "the less memory allocation, the better' mindset, maybe
I was worrying about memory fragmentation, or something else I learned on some "textbook",
To be honest, I have never had real experience with those worries....
David
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-07 23:36 ` Suren Baghdasaryan
@ 2025-05-08 3:10 ` David Wang
2025-05-08 15:32 ` David Wang
1 sibling, 0 replies; 44+ messages in thread
From: David Wang @ 2025-05-08 3:10 UTC (permalink / raw)
To: Suren Baghdasaryan; +Cc: kent.overstreet, akpm, linux-mm, linux-kernel
At 2025-05-08 07:36:14, "Suren Baghdasaryan" <surenb@google.com> wrote:
>On Wed, May 7, 2025 at 5:55 PM David Wang <00107082@163.com> wrote:
>>
>> ---
>> The patch is not complete, just want to get feedbacks whether this
>> worth carrying on.
>
>In such cases it's customary to mark the patch as RFC which saves you
>time on explaining your motivation :)
roger that~ :)
>
>> ---
>> When reading /proc/allocinfo, for each read syscall, seq_file would
>> invoke start/stop callbacks. In start callback, a memory is alloced
>> to store iterator and the iterator would restart from beginning to
>> walk to its previous position.
>> Each seq_file read() takes at most 4096 bytes, even read with a larger
>> user space buffer, meaning read out /proc/allocinfo, tens of read
>> syscalls are needed. For example, a 306036 bytes allocinfo files need
>> 76 reads:
>>
>> $ sudo cat /proc/allocinfo | wc
>> 3964 16678 306036
>>
>> For those n=3964 lines, each read takes about m=3964/76=52 lines,
>> the iter would be rewinding:
>> m steps on 1st read,
>> 2*m steps on 2nd read
>> 3*m steps on 3rd read
>> ...
>> n steps on the last read
>> totally, the iterator would be iterated O(n*n/m) times.
>> (Each read would take more time than previous one.)
>>
>> To use a private data alloced when /proc/allocinfo is opened,
>> the n/m memory alloction could be avoid, and there is no need
>> to restart the iterator from very beginning everytime.
>> So only 1 memory allocation and n steps for iterating are needed.
>> (Only when module changed, the iterator should be invalidated and
>> restart.)
>
>Yeah, your change makes sense and looks like a good optimization. From
>a quick look at the code, codetag_next_ct() should handle the case
>when a module gets removed from under us while we are not holding
>cttype->mod_lock. I'll need to take another closer look at it once you
>post an official patch.
>Thanks!
I think testing module insert/remove would be a major effort.
I will take time to do it and update soon.
Thanks
>
>>
>> Timings before:
>> $ time cat /proc/allocinfo > /dev/null
>>
>> real 0m0.007s
>> user 0m0.000s
>> sys 0m0.007s
>> read-syscalls get slower and slower:
>> read(3, "allocinfo - version: 1.0\n# <"..., 131072) = 4085 <0.000062>
>> ...
>> read(3, " 0 0 drivers/gp"..., 131072) = 4046 <0.000135>
>> read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
>> ...
>>
>> and with the change:
>> $ time cat /proc/allocinfo > /dev/null
>>
>> real 0m0.004s
>> user 0m0.000s
>> sys 0m0.003s
>> 7ms drop to 4ms, more friendly to monitoring tools.
>>
>> Detailed strace stats before the change:
>> $ sudo strace -T -e read cat /proc/allocinfo > /dev/null
>> read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@>\2\0\0\0\0\0"..., 832) = 832 <0.000040>
>> read(3, "allocinfo - version: 1.0\n# <"..., 131072) = 4085 <0.000062>
>> read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062>
>> read(3, " 0 0 arch/x86/k"..., 131072) = 4029 <0.000061>
>> read(3, " 64 1 kernel/wor"..., 131072) = 4064 <0.000061>
>> read(3, " 0 0 kernel/sch"..., 131072) = 4041 <0.000051>
>> read(3, " 4992 39 kernel/irq"..., 131072) = 4061 <0.000070>
>> read(3, " 0 0 kernel/fut"..., 131072) = 4060 <0.000071>
>> read(3, " 0 0 kernel/aud"..., 131072) = 4025 <0.000122>
>> read(3, " 1536 3 kernel/tra"..., 131072) = 4014 <0.000064>
>> read(3, " 0 0 kernel/tra"..., 131072) = 4079 <0.000067>
>> read(3, " 0 0 kernel/tra"..., 131072) = 4035 <0.000072>
>> read(3, " 0 0 kernel/bpf"..., 131072) = 4044 <0.000074>
>> read(3, " 0 0 kernel/bpf"..., 131072) = 4061 <0.000063>
>> read(3, " 0 0 ./include/"..., 131072) = 4084 <0.000063>
>> read(3, " 0 0 kernel/eve"..., 131072) = 4084 <0.000071>
>> read(3, " 278528 68 mm/memory."..., 131072) = 4068 <0.000070>
>> read(3, " 0 0 mm/zswap.c"..., 131072) = 4091 <0.000082>
>> read(3, " 0 0 mm/balloon"..., 131072) = 4039 <0.000083>
>> read(3, " 0 0 fs/splice."..., 131072) = 4095 <0.000083>
>> read(3, " 0 0 fs/binfmt_"..., 131072) = 4039 <0.000075>
>> read(3, " 192 3 fs/proc/kc"..., 131072) = 4088 <0.000076>
>> read(3, " 0 0 ipc/mqueue"..., 131072) = 4017 <0.000076>
>> read(3, " 0 0 security/s"..., 131072) = 4075 <0.000072>
>> read(3, " 0 0 security/s"..., 131072) = 4029 <0.000084>
>> read(3, " 0 0 security/t"..., 131072) = 4053 <0.000076>
>> read(3, " 0 0 security/a"..., 131072) = 4054 <0.000078>
>> read(3, " 0 0 crypto/rsa"..., 131072) = 4046 <0.000081>
>> read(3, " 0 0 crypto/asy"..., 131072) = 4079 <0.000081>
>> read(3, " 1728 9 block/blk-"..., 131072) = 4034 <0.000084>
>> read(3, " 0 0 io_uring/r"..., 131072) = 4044 <0.000085>
>> read(3, " 0 0 lib/crypto"..., 131072) = 4056 <0.000363>
>> read(3, " 0 0 lib/pldmfw"..., 131072) = 4079 <0.000091>
>> read(3, " 384 6 drivers/pc"..., 131072) = 4091 <0.000098>
>> read(3, " 0 0 drivers/vi"..., 131072) = 4068 <0.000093>
>> read(3, " 4096 1 drivers/ac"..., 131072) = 4086 <0.000089>
>> read(3, " 11040 230 drivers/ac"..., 131072) = 4001 <0.000090>
>> read(3, " 0 0 drivers/ac"..., 131072) = 4030 <0.000094>
>> read(3, " 0 0 drivers/dm"..., 131072) = 4031 <0.000094>
>> read(3, " 0 0 drivers/xe"..., 131072) = 4085 <0.000097>
>> read(3, " 4224 33 drivers/tt"..., 131072) = 4086 <0.000097>
>> read(3, " 0 0 drivers/io"..., 131072) = 4038 <0.000097>
>> read(3, " 0 0 drivers/io"..., 131072) = 4034 <0.000100>
>> read(3, " 64 2 drivers/ba"..., 131072) = 4094 <0.000110>
>> read(3, " 0 0 drivers/ba"..., 131072) = 4025 <0.000100>
>> read(3, " 0 0 drivers/dm"..., 131072) = 4077 <0.000106>
>> read(3, " 0 0 drivers/i2"..., 131072) = 4049 <0.000103>
>> read(3, " 0 0 drivers/ed"..., 131072) = 4035 <0.000108>
>> read(3, " 0 0 drivers/fi"..., 131072) = 4052 <0.000107>
>> read(3, " 0 0 net/core/s"..., 131072) = 4057 <0.000119>
>> read(3, " 5248 82 net/core/r"..., 131072) = 4050 <0.000110>
>> read(3, " 0 0 net/core/d"..., 131072) = 4047 <0.000111>
>> read(3, " 0 0 net/ethtoo"..., 131072) = 4059 <0.000112>
>> read(3, " 32 1 net/ipv4/r"..., 131072) = 4052 <0.000116>
>> read(3, " 1920 10 net/ipv4/i"..., 131072) = 4040 <0.000116>
>> read(3, " 0 0 net/xfrm/x"..., 131072) = 4071 <0.000118>
>> read(3, " 0 0 net/ipv6/i"..., 131072) = 4053 <0.000120>
>> read(3, " 0 0 net/devlin"..., 131072) = 4064 <0.000120>
>> read(3, " 0 0 lib/decomp"..., 131072) = 3996 <0.000123>
>> read(3, " 0 0 drivers/vi"..., 131072) = 4043 <0.000121>
>> read(3, " 0 0 drivers/us"..., 131072) = 4045 <0.000123>
>> read(3, " 0 0 drivers/us"..., 131072) = 4020 <0.000126>
>> read(3, " 64 1 drivers/sc"..., 131072) = 4043 <0.000128>
>> read(3, " 0 0 drivers/ne"..., 131072) = 4094 <0.000130>
>> read(3, " 0 0 drivers/hi"..., 131072) = 4025 <0.000132>
>> read(3, " 672 6 fs/ext4/mb"..., 131072) = 4071 <0.000132>
>> read(3, " 0 0 fs/autofs/"..., 131072) = 4038 <0.000134>
>> read(3, " 0 0 fs/fuse/fi"..., 131072) = 4093 <0.000134>
>> read(3, " 0 0 drivers/gp"..., 131072) = 4035 <0.000133>
>> read(3, " 0 0 drivers/gp"..., 131072) = 4046 <0.000135>
>> read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
>> read(3, " 0 0 sound/core"..., 131072) = 4000 <0.000134>
>> read(3, " 2048 1 drivers/vi"..., 131072) = 4078 <0.000138>
>> read(3, " 3328 104 drivers/gp"..., 131072) = 4040 <0.000143>
>> read(3, " 0 0 fs/overlay"..., 131072) = 4075 <0.000141>
>> read(3, " 0 0 net/netfil"..., 131072) = 4061 <0.000139>
>> read(3, " 0 0 net/xfrm/x"..., 131072) = 2022 <0.000129>
>> read(3, "", 131072) = 0 <0.000114>
>>
>> And with this change:
>> $ sudo strace -T -e read cat /proc/allocinfo > /dev/null
>> read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@>\2\0\0\0\0\0"..., 832) = 832 <0.000018>
>> read(3, "allocinfo - version: 1.0\n# <"..., 131072) = 4085 <0.000066>
>> read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062>
>> read(3, " 0 0 arch/x86/k"..., 131072) = 4029 <0.000062>
>> read(3, " 64 1 kernel/wor"..., 131072) = 4064 <0.000064>
>> read(3, " 0 0 kernel/sch"..., 131072) = 4041 <0.000046>
>> read(3, " 4992 39 kernel/irq"..., 131072) = 4061 <0.000053>
>> read(3, " 0 0 kernel/fut"..., 131072) = 4060 <0.000054>
>> read(3, " 0 0 kernel/aud"..., 131072) = 4025 <0.000053>
>> read(3, " 1536 3 kernel/tra"..., 131072) = 4014 <0.000051>
>> read(3, " 0 0 kernel/tra"..., 131072) = 4079 <0.000050>
>> read(3, " 0 0 kernel/tra"..., 131072) = 4035 <0.000051>
>> read(3, " 0 0 kernel/bpf"..., 131072) = 4044 <0.000053>
>> read(3, " 0 0 kernel/bpf"..., 131072) = 4061 <0.000053>
>> read(3, " 0 0 ./include/"..., 131072) = 4084 <0.000052>
>> read(3, " 0 0 kernel/eve"..., 131072) = 4084 <0.000072>
>> read(3, " 286720 70 mm/memory."..., 131072) = 4068 <0.000071>
>> read(3, " 0 0 mm/zswap.c"..., 131072) = 4091 <0.000056>
>> read(3, " 0 0 mm/balloon"..., 131072) = 4039 <0.000056>
>> read(3, " 0 0 fs/splice."..., 131072) = 4095 <0.000076>
>> read(3, " 0 0 fs/binfmt_"..., 131072) = 4039 <0.000075>
>> read(3, " 192 3 fs/proc/kc"..., 131072) = 4088 <0.000057>
>> read(3, " 0 0 ipc/mqueue"..., 131072) = 4017 <0.000055>
>> read(3, " 0 0 security/s"..., 131072) = 4075 <0.000046>
>> read(3, " 0 0 security/s"..., 131072) = 4029 <0.000045>
>> read(3, " 0 0 security/t"..., 131072) = 4053 <0.000050>
>> read(3, " 0 0 security/a"..., 131072) = 4054 <0.000060>
>> read(3, " 0 0 crypto/rsa"..., 131072) = 4046 <0.000181>
>> read(3, " 0 0 crypto/asy"..., 131072) = 4079 <0.000051>
>> read(3, " 1728 9 block/blk-"..., 131072) = 4034 <0.000070>
>> read(3, " 0 0 io_uring/r"..., 131072) = 4044 <0.000078>
>> read(3, " 0 0 lib/crypto"..., 131072) = 4066 <0.000067>
>> read(3, " 0 0 arch/x86/l"..., 131072) = 4085 <0.000066>
>> read(3, " 7168 7 drivers/pc"..., 131072) = 4090 <0.000066>
>> read(3, " 0 0 drivers/vi"..., 131072) = 4063 <0.000065>
>> read(3, " 1024 1 drivers/ac"..., 131072) = 4094 <0.000067>
>> read(3, " 0 0 drivers/ac"..., 131072) = 4023 <0.000063>
>> read(3, " 2048 2 drivers/ac"..., 131072) = 4020 <0.000066>
>> read(3, " 0 0 drivers/dm"..., 131072) = 4094 <0.000067>
>> read(3, " 0 0 drivers/xe"..., 131072) = 4087 <0.000068>
>> read(3, " 1024 1 drivers/tt"..., 131072) = 4093 <0.000056>
>> read(3, " 0 0 drivers/io"..., 131072) = 4037 <0.000044>
>> read(3, " 0 0 drivers/io"..., 131072) = 4095 <0.000061>
>> read(3, " 0 0 drivers/ba"..., 131072) = 4045 <0.000059>
>> read(3, " 0 0 drivers/ba"..., 131072) = 4011 <0.000057>
>> read(3, " 0 0 drivers/ma"..., 131072) = 4087 <0.000064>
>> read(3, " 0 0 drivers/i2"..., 131072) = 4028 <0.000059>
>> read(3, " 0 0 drivers/ed"..., 131072) = 4048 <0.000080>
>> read(3, " 0 0 drivers/fi"..., 131072) = 4082 <0.000054>
>> read(3, " 0 0 net/core/s"..., 131072) = 4091 <0.000068>
>> read(3, " 0 0 net/core/f"..., 131072) = 4041 <0.000053>
>> read(3, " 20 1 net/sched/"..., 131072) = 4054 <0.000051>
>> read(3, " 0 0 net/ethtoo"..., 131072) = 4045 <0.000051>
>> read(3, " 0 0 net/ipv4/r"..., 131072) = 4040 <0.000051>
>> read(3, " 2048 1 net/ipv4/f"..., 131072) = 4056 <0.000050>
>> read(3, " 0 0 net/xfrm/x"..., 131072) = 4053 <0.000056>
>> read(3, " 3072 6 net/ipv6/i"..., 131072) = 4095 <0.000060>
>> read(3, " 0 0 net/devlin"..., 131072) = 4094 <0.000064>
>> read(3, " 0 0 lib/decomp"..., 131072) = 4023 <0.000063>
>> read(3, " 2048 2 drivers/us"..., 131072) = 4053 <0.000062>
>> read(3, " 0 0 drivers/us"..., 131072) = 4011 <0.000068>
>> read(3, " 0 0 drivers/in"..., 131072) = 4092 <0.000059>
>> read(3, " 64 1 drivers/sc"..., 131072) = 4067 <0.000064>
>> read(3, " 0 0 drivers/ne"..., 131072) = 4059 <0.000067>
>> read(3, " 8 1 drivers/hi"..., 131072) = 4050 <0.000052>
>> read(3, " 256 1 fs/ext4/mb"..., 131072) = 4024 <0.000070>
>> read(3, " 0 0 fs/autofs/"..., 131072) = 4082 <0.000070>
>> read(3, " 0 0 fs/fuse/in"..., 131072) = 4074 <0.000054>
>> read(3, " 0 0 drivers/gp"..., 131072) = 4066 <0.000063>
>> read(3, " 0 0 drivers/gp"..., 131072) = 4091 <0.000052>
>> read(3, " 0 0 drivers/gp"..., 131072) = 4029 <0.000064>
>> read(3, " 0 0 drivers/gp"..., 131072) = 4094 <0.000062>
>> read(3, " 128 2 sound/core"..., 131072) = 4069 <0.000068>
>> read(3, " 64 1 sound/pci/"..., 131072) = 4074 <0.000069>
>> read(3, " 384 4 net/bridge"..., 131072) = 4021 <0.000054>
>> read(3, " 0 0 net/netfil"..., 131072) = 4084 <0.000050>
>> read(3, " 0 0 net/xfrm/x"..., 131072) = 1513 <0.000047>
>> read(3, "", 131072) = 0 <0.000030>
>>
>> Signed-off-by: David Wang <00107082@163.com>
>> ---
>> lib/alloc_tag.c | 30 +++++++++++-------------------
>> 1 file changed, 11 insertions(+), 19 deletions(-)
>>
>> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
>> index 25ecc1334b67..ec83cf798209 100644
>> --- a/lib/alloc_tag.c
>> +++ b/lib/alloc_tag.c
>> @@ -45,21 +45,17 @@ struct allocinfo_private {
>> static void *allocinfo_start(struct seq_file *m, loff_t *pos)
>> {
>> struct allocinfo_private *priv;
>> - struct codetag *ct;
>> loff_t node = *pos;
>>
>> - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>> - m->private = priv;
>> - if (!priv)
>> - return NULL;
>> -
>> - priv->print_header = (node == 0);
>> + priv = (struct allocinfo_private *)m->private;
>> codetag_lock_module_list(alloc_tag_cttype, true);
>> - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
>> - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
>> - node--;
>> -
>> - return ct ? priv : NULL;
>> + if (node == 0) {
>> + priv->print_header = true;
>> + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
>> + codetag_next_ct(&priv->iter);
>> + }
>> + // TODO: need to check iter is valid, or rewinding it.
>> + return priv->iter.ct ? priv : NULL;
>> }
>>
>> static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
>> @@ -76,12 +72,7 @@ static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
>>
>> static void allocinfo_stop(struct seq_file *m, void *arg)
>> {
>> - struct allocinfo_private *priv = (struct allocinfo_private *)m->private;
>> -
>> - if (priv) {
>> - codetag_lock_module_list(alloc_tag_cttype, false);
>> - kfree(priv);
>> - }
>> + codetag_lock_module_list(alloc_tag_cttype, false);
>> }
>>
>> static void print_allocinfo_header(struct seq_buf *buf)
>> @@ -249,7 +240,8 @@ static void __init procfs_init(void)
>> if (!mem_profiling_support)
>> return;
>>
>> - if (!proc_create_seq(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op)) {
>> + if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
>> + sizeof(struct allocinfo_private), NULL)) {
>> pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
>> shutdown_mem_profiling(false);
>> }
>> --
>> 2.39.2
>>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-08 3:06 ` David Wang
@ 2025-05-08 3:31 ` Kent Overstreet
2025-05-08 3:35 ` David Wang
0 siblings, 1 reply; 44+ messages in thread
From: Kent Overstreet @ 2025-05-08 3:31 UTC (permalink / raw)
To: David Wang; +Cc: Suren Baghdasaryan, akpm, linux-mm, linux-kernel
On Thu, May 08, 2025 at 11:06:35AM +0800, David Wang wrote:
> Thanks for the feedback~
> I agree that memory allocation normally dose not take major part of a profiling report,
> even profiling a fio test, kmem_cache_alloc only takes ~1% perf samples.
>
> I don't know why I have this "the less memory allocation, the better' mindset, maybe
> I was worrying about memory fragmentation, or something else I learned on some "textbook",
> To be honest, I have never had real experience with those worries....
It's a common bias. "Memory allocations" take up a lot of conceptual
space in our heads, and generally for good reason - i.e. handling memory
allocation errors is often a major concern, and you do always want to be
aware of memory layout.
But this can turn into an aversion that's entirely disproportionate -
e.g. using linked linked lists and fixed size arrays in ways that are
entirely inappropriate, instead of vectors and other better data
structures; good data structures always require allocations.
Profile, profile, profile, and remember your basic CS (big O notation) -
90% of the time, simple code with good big O running time is all you
need.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-08 3:31 ` Kent Overstreet
@ 2025-05-08 3:35 ` David Wang
2025-05-08 4:07 ` Kent Overstreet
0 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-08 3:35 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Suren Baghdasaryan, akpm, linux-mm, linux-kernel
At 2025-05-08 11:31:12, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Thu, May 08, 2025 at 11:06:35AM +0800, David Wang wrote:
>> Thanks for the feedback~
>> I agree that memory allocation normally dose not take major part of a profiling report,
>> even profiling a fio test, kmem_cache_alloc only takes ~1% perf samples.
>>
>> I don't know why I have this "the less memory allocation, the better' mindset, maybe
>> I was worrying about memory fragmentation, or something else I learned on some "textbook",
>> To be honest, I have never had real experience with those worries....
>
>It's a common bias. "Memory allocations" take up a lot of conceptual
>space in our heads, and generally for good reason - i.e. handling memory
>allocation errors is often a major concern, and you do always want to be
>aware of memory layout.
>
>But this can turn into an aversion that's entirely disproportionate -
>e.g. using linked linked lists and fixed size arrays in ways that are
>entirely inappropriate, instead of vectors and other better data
>structures; good data structures always require allocations.
>
>Profile, profile, profile, and remember your basic CS (big O notation) -
>90% of the time, simple code with good big O running time is all you
>need.
copy that~!
Thanks
David
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-08 3:35 ` David Wang
@ 2025-05-08 4:07 ` Kent Overstreet
2025-05-08 5:51 ` David Wang
0 siblings, 1 reply; 44+ messages in thread
From: Kent Overstreet @ 2025-05-08 4:07 UTC (permalink / raw)
To: David Wang; +Cc: Suren Baghdasaryan, akpm, linux-mm, linux-kernel
On Thu, May 08, 2025 at 11:35:11AM +0800, David Wang wrote:
>
>
> At 2025-05-08 11:31:12, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
> >On Thu, May 08, 2025 at 11:06:35AM +0800, David Wang wrote:
> >> Thanks for the feedback~
> >> I agree that memory allocation normally dose not take major part of a profiling report,
> >> even profiling a fio test, kmem_cache_alloc only takes ~1% perf samples.
> >>
> >> I don't know why I have this "the less memory allocation, the better' mindset, maybe
> >> I was worrying about memory fragmentation, or something else I learned on some "textbook",
> >> To be honest, I have never had real experience with those worries....
> >
> >It's a common bias. "Memory allocations" take up a lot of conceptual
> >space in our heads, and generally for good reason - i.e. handling memory
> >allocation errors is often a major concern, and you do always want to be
> >aware of memory layout.
> >
> >But this can turn into an aversion that's entirely disproportionate -
> >e.g. using linked linked lists and fixed size arrays in ways that are
> >entirely inappropriate, instead of vectors and other better data
> >structures; good data structures always require allocations.
> >
> >Profile, profile, profile, and remember your basic CS (big O notation) -
> >90% of the time, simple code with good big O running time is all you
> >need.
>
> copy that~!
Another thing to note is that memory layout - avoiding pointer chasing -
is hugely important, but it'll almost never show up as allocator calls.
To give you some examples, mempools and biosets used to be separately
allocated. This was mainly to make error paths in outer object
constructors/destructors easier and safer: instead of keeping track of
what's initialized and what's not, if you've got a pointer to a
mempool/bioset you call *_free() on it.
(People hadn't yet clued that you can just kzalloc() the entire outer
object, and then if the inner object is zeroed it wasn't initialized).
But that means you're adding a pointer chase to every mempool_alloc()
call, and since bioset itself has mempools allocating bios had _two_
unnecessary pointer derefs. That's death for performance when you're
running cache cold, but since everyone benchmarks cache-hot...
(I was the one who fixed that).
Another big one was generic_file_buffered_read(). Main buffered read
path, everyone wants it to be as fast as possible.
But the core is (was) a loop that walks the pagecache radix tree to get
the page, then copies 4k of data out to userspace (there goes l1), then
repeats all that pointer chasing for the next 4k. Pre large folios, it
was horrific.
Solution - vectorize it. Look up all the pages we're copying from all at
once, stuff them in a (dynamically allocated! for each read!) vector,
and then do the copying out to userspace all at once. Massive
performance gain.
Of course, to do that I first had to clean up a tangled 250+ line
monstrosity of half baked, poorly thought out "optimizations" (the worst
spaghetti of gotos you'd ever seen) and turn it into something
manageable...
So - keep things simple, don't overthink the little stuff, so you can
spot and tackle the big algorithmic wins :)
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-08 4:07 ` Kent Overstreet
@ 2025-05-08 5:51 ` David Wang
2025-05-08 13:33 ` Kent Overstreet
0 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-08 5:51 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Suren Baghdasaryan, akpm, linux-mm, linux-kernel
At 2025-05-08 12:07:40, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Thu, May 08, 2025 at 11:35:11AM +0800, David Wang wrote:
>>
>>
>> At 2025-05-08 11:31:12, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>> >On Thu, May 08, 2025 at 11:06:35AM +0800, David Wang wrote:
>> >> Thanks for the feedback~
>> >> I agree that memory allocation normally dose not take major part of a profiling report,
>> >> even profiling a fio test, kmem_cache_alloc only takes ~1% perf samples.
>> >>
>> >> I don't know why I have this "the less memory allocation, the better' mindset, maybe
>> >> I was worrying about memory fragmentation, or something else I learned on some "textbook",
>> >> To be honest, I have never had real experience with those worries....
>> >
>> >It's a common bias. "Memory allocations" take up a lot of conceptual
>> >space in our heads, and generally for good reason - i.e. handling memory
>> >allocation errors is often a major concern, and you do always want to be
>> >aware of memory layout.
>> >
>> >But this can turn into an aversion that's entirely disproportionate -
>> >e.g. using linked linked lists and fixed size arrays in ways that are
>> >entirely inappropriate, instead of vectors and other better data
>> >structures; good data structures always require allocations.
>> >
>> >Profile, profile, profile, and remember your basic CS (big O notation) -
>> >90% of the time, simple code with good big O running time is all you
>> >need.
>>
>> copy that~!
>
>Another thing to note is that memory layout - avoiding pointer chasing -
>is hugely important, but it'll almost never show up as allocator calls.
>
>To give you some examples, mempools and biosets used to be separately
>allocated. This was mainly to make error paths in outer object
>constructors/destructors easier and safer: instead of keeping track of
>what's initialized and what's not, if you've got a pointer to a
>mempool/bioset you call *_free() on it.
>
>(People hadn't yet clued that you can just kzalloc() the entire outer
>object, and then if the inner object is zeroed it wasn't initialized).
>
>But that means you're adding a pointer chase to every mempool_alloc()
>call, and since bioset itself has mempools allocating bios had _two_
>unnecessary pointer derefs. That's death for performance when you're
>running cache cold, but since everyone benchmarks cache-hot...
>
>(I was the one who fixed that).
>
>Another big one was generic_file_buffered_read(). Main buffered read
>path, everyone wants it to be as fast as possible.
>
>But the core is (was) a loop that walks the pagecache radix tree to get
>the page, then copies 4k of data out to userspace (there goes l1), then
>repeats all that pointer chasing for the next 4k. Pre large folios, it
>was horrific.
>
>Solution - vectorize it. Look up all the pages we're copying from all at
>once, stuff them in a (dynamically allocated! for each read!) vector,
>and then do the copying out to userspace all at once. Massive
>performance gain.
>
>Of course, to do that I first had to clean up a tangled 250+ line
>monstrosity of half baked, poorly thought out "optimizations" (the worst
>spaghetti of gotos you'd ever seen) and turn it into something
>manageable...
>
>So - keep things simple, don't overthink the little stuff, so you can
>spot and tackle the big algorithmic wins :)
I will keep this in mind~! :)
And thanks for the enlightening notes~!!
Though I could not quite catch up with the first one, I think I got the point:
avoid unnecessary pointer chasing and keep the pointer chasing as short(balanced) as possible~
The second one, about copy 4k by 4k, seems quite similar to seq_file, at least the "4k" part, literally.
seq_file read() defaults to alloc 4k buffer, and read data until EOF or the 4k buffer is full, and start over
again for the next read().
One solution could be make changes to seq_file, do not stop until user buffer is full for each read.
kind of similar to your second note, in a sequential style, I think.
If user read with 128K buffer, and seq_file fill the buffer 4k by 4k, it would only need ~3 read calls for allocinfo.
(I did post a patch for seq_file to fill user buffer, but start/stop still happens at 4k boundary , so no help for
the iterator rewinding when read /proc/allocinfo yet.
https://lore.kernel.org/lkml/20241220140819.9887-1-00107082@163.com/ )
The solution in this patch is keeping the iterator alive and valid cross read boundary, this can also avoid the cost for
each start over.
David
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-08 5:51 ` David Wang
@ 2025-05-08 13:33 ` Kent Overstreet
2025-05-08 16:24 ` David Wang
0 siblings, 1 reply; 44+ messages in thread
From: Kent Overstreet @ 2025-05-08 13:33 UTC (permalink / raw)
To: David Wang; +Cc: Suren Baghdasaryan, akpm, linux-mm, linux-kernel
On Thu, May 08, 2025 at 01:51:48PM +0800, David Wang wrote:
> At 2025-05-08 12:07:40, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
> >Another thing to note is that memory layout - avoiding pointer chasing -
> >is hugely important, but it'll almost never show up as allocator calls.
> >
> >To give you some examples, mempools and biosets used to be separately
> >allocated. This was mainly to make error paths in outer object
> >constructors/destructors easier and safer: instead of keeping track of
> >what's initialized and what's not, if you've got a pointer to a
> >mempool/bioset you call *_free() on it.
> >
> >(People hadn't yet clued that you can just kzalloc() the entire outer
> >object, and then if the inner object is zeroed it wasn't initialized).
> >
> >But that means you're adding a pointer chase to every mempool_alloc()
> >call, and since bioset itself has mempools allocating bios had _two_
> >unnecessary pointer derefs. That's death for performance when you're
> >running cache cold, but since everyone benchmarks cache-hot...
> >
> >(I was the one who fixed that).
> >
> >Another big one was generic_file_buffered_read(). Main buffered read
> >path, everyone wants it to be as fast as possible.
> >
> >But the core is (was) a loop that walks the pagecache radix tree to get
> >the page, then copies 4k of data out to userspace (there goes l1), then
> >repeats all that pointer chasing for the next 4k. Pre large folios, it
> >was horrific.
> >
> >Solution - vectorize it. Look up all the pages we're copying from all at
> >once, stuff them in a (dynamically allocated! for each read!) vector,
> >and then do the copying out to userspace all at once. Massive
> >performance gain.
> >
> >Of course, to do that I first had to clean up a tangled 250+ line
> >monstrosity of half baked, poorly thought out "optimizations" (the worst
> >spaghetti of gotos you'd ever seen) and turn it into something
> >manageable...
> >
> >So - keep things simple, don't overthink the little stuff, so you can
> >spot and tackle the big algorithmic wins :)
> I will keep this in mind~! :)
>
> And thanks for the enlightening notes~!!
>
> Though I could not quite catch up with the first one, I think I got
> the point: avoid unnecessary pointer chasing and keep the pointer
> chasing as short(balanced) as possible~
To illustrate - DRAM latency is 30-70n.
At 4GHz, that's 120-280 cycles, and a properly fed CPU can do multiple
instructions per clock - so a cache miss all the way to DRAM can cost
you hundreds of instructions.
> The second one, about copy 4k by 4k, seems quite similar to seq_file,
> at least the "4k" part, literally. seq_file read() defaults to alloc
> 4k buffer, and read data until EOF or the 4k buffer is full, and
> start over again for the next read().
>
> One solution could be make changes to seq_file, do not stop until user
> buffer is full for each read. kind of similar to your second note, in
> a sequential style, I think.
>
> If user read with 128K buffer, and seq_file fill the buffer 4k by
> 4k, it would only need ~3 read calls for allocinfo. (I did post a
> patch for seq_file to fill user buffer, but start/stop still happens
> at 4k boundary , so no help for
> the iterator rewinding when read /proc/allocinfo yet.
> https://lore.kernel.org/lkml/20241220140819.9887-1-00107082@163.com/ )
> The solution in this patch is keeping the iterator alive and valid
> cross read boundary, this can also avoid the cost for each start
> over.
The first question is - does it matter? If the optimization is just for
/proc/allocinfo, who's reading it at a high enough rate that we care?
If it's only being used interactively, it doesn't matter. If it's being
read at a high rate by some sort of profiling program, we'd want to skip
the text interface entirely and add an ioctl to read the data out in a
binary format.
The idea of changing seq_file to continue until the user buffer is full
- that'd be a good one, if you're making changes that benefit all
seq_file users.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-07 23:36 ` Suren Baghdasaryan
2025-05-08 3:10 ` David Wang
@ 2025-05-08 15:32 ` David Wang
2025-05-08 21:41 ` Suren Baghdasaryan
1 sibling, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-08 15:32 UTC (permalink / raw)
To: Suren Baghdasaryan; +Cc: kent.overstreet, akpm, linux-mm, linux-kernel
At 2025-05-08 07:36:14, "Suren Baghdasaryan" <surenb@google.com> wrote:
>On Wed, May 7, 2025 at 5:55 PM David Wang <00107082@163.com> wrote:
>>
>> ---
>> The patch is not complete, just want to get feedbacks whether this
>> worth carrying on.
>
>In such cases it's customary to mark the patch as RFC which saves you
>time on explaining your motivation :)
>
>> ---
>> When reading /proc/allocinfo, for each read syscall, seq_file would
>> invoke start/stop callbacks. In start callback, a memory is alloced
>> to store iterator and the iterator would restart from beginning to
>> walk to its previous position.
>> Each seq_file read() takes at most 4096 bytes, even read with a larger
>> user space buffer, meaning read out /proc/allocinfo, tens of read
>> syscalls are needed. For example, a 306036 bytes allocinfo files need
>> 76 reads:
>>
>> $ sudo cat /proc/allocinfo | wc
>> 3964 16678 306036
>>
>> For those n=3964 lines, each read takes about m=3964/76=52 lines,
>> the iter would be rewinding:
>> m steps on 1st read,
>> 2*m steps on 2nd read
>> 3*m steps on 3rd read
>> ...
>> n steps on the last read
>> totally, the iterator would be iterated O(n*n/m) times.
>> (Each read would take more time than previous one.)
>>
>> To use a private data alloced when /proc/allocinfo is opened,
>> the n/m memory alloction could be avoid, and there is no need
>> to restart the iterator from very beginning everytime.
>> So only 1 memory allocation and n steps for iterating are needed.
>> (Only when module changed, the iterator should be invalidated and
>> restart.)
>
>Yeah, your change makes sense and looks like a good optimization. From
>a quick look at the code, codetag_next_ct() should handle the case
>when a module gets removed from under us while we are not holding
>cttype->mod_lock. I'll need to take another closer look at it once you
>post an official patch.
>Thanks!
>
The module tag container designed more "compact" than I imaged. It seems that no
extra iterator validation needed for most situations, but I get anxious about the following
possibility:
In between read() calls, module A removed and then module B inserted, accidentally A
and B have same IDR id (id reused) and same "struct module" address (kmalloc happened
to pick the cmod address kfree by module A).
If this happened, the `if (cmod != iter->cmod)` check in codetag_next_ct may not be
solid safe....
What about adding a clock/timestamp/expiration to cttype/module/iterator:
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index d14dbd26b370..fc9f430090ae 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -54,6 +54,7 @@ struct codetag_iterator {
struct codetag_module *cmod;
unsigned long mod_id;
struct codetag *ct;
+ unsigned long expiration;
};
#ifdef MODULE
diff --git a/lib/codetag.c b/lib/codetag.c
index 42aadd6c1454..a795b152ce92 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -13,6 +13,8 @@ struct codetag_type {
struct idr mod_idr;
struct rw_semaphore mod_lock; /* protects mod_idr */
struct codetag_type_desc desc;
+ /* timestamping iterator expiration */
+ unsigned long clock;
};
struct codetag_range {
@@ -23,6 +25,8 @@ struct codetag_range {
struct codetag_module {
struct module *mod;
struct codetag_range range;
+ /* creation timestamp */
+ unsigned long timestamp;
};
static DEFINE_MUTEX(codetag_lock);
@@ -48,6 +52,7 @@ struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
.cmod = NULL,
.mod_id = 0,
.ct = NULL,
+ .expiration = 0,
};
return iter;
@@ -93,6 +98,11 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
if (cmod != iter->cmod) {
iter->cmod = cmod;
+ iter->expiration = cmod->timestamp;
+ ct = get_first_module_ct(cmod);
+ } else if (cmod->timestamp != iter->expiration) {
+ pr_warn("Same IDR id and module address, but different module!");
+ iter->expiration = cmod->timestamp;
ct = get_first_module_ct(cmod);
} else
ct = get_next_module_ct(iter);
@@ -101,6 +111,7 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
break;
iter->mod_id++;
+ iter->cmod = NULL;
}
iter->ct = ct;
@@ -169,6 +180,7 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
struct codetag_module *cmod;
int err;
+ cttype->clock++;
range = get_section_range(mod, cttype->desc.section);
if (!range.start || !range.stop) {
pr_warn("Failed to load code tags of type %s from the module %s\n",
@@ -188,6 +200,7 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
cmod->mod = mod;
cmod->range = range;
+ cmod->timestamp = cttype->clock;
down_write(&cttype->mod_lock);
err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
And I notice another issue: there are duplicating keys(file:line+module+func) in allocinfo even without this patch:
On my workstation :
=======================
1400832 114 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 840764
0 0 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 1
0 0 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 2
0 0 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 758
0 0 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 62951
0 0 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 1
0 0 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 325450
12288 1 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 1
=======================
81920 20 drivers/iommu/amd/../iommu-pages.h:94 func:iommu_alloc_pages_node 20
1441792 352 drivers/iommu/amd/../iommu-pages.h:94 func:iommu_alloc_pages_node 352
=======================
112 7 kernel/sched/sched.h:2587 func:alloc_user_cpus_ptr 1591
48 3 kernel/sched/sched.h:2587 func:alloc_user_cpus_ptr 12
=======================
48 1 mm/mm_slot.h:28 func:mm_slot_alloc 4
2160 54 mm/mm_slot.h:28 func:mm_slot_alloc 10705
Most duplicating keys are from "*.h" files
On a KVM, duplication also happens to some "*.c" files:
=======================
0 0 ./include/crypto/kpp.h:185 func:kpp_request_alloc
0 0 ./include/crypto/kpp.h:185 func:kpp_request_alloc
=======================
0 0 ./include/net/tcp.h:2548 func:tcp_v4_save_options
0 0 ./include/net/tcp.h:2548 func:tcp_v4_save_options
=======================
0 0 drivers/iommu/amd/../iommu-pages.h:94 func:iommu_alloc_pages_node
0 0 drivers/iommu/amd/../iommu-pages.h:94 func:iommu_alloc_pages_node
0 0 drivers/iommu/amd/../iommu-pages.h:94 func:iommu_alloc_pages_node
=======================
0 0 drivers/iommu/intel/../iommu-pages.h:94 func:iommu_alloc_pages_node
0 0 drivers/iommu/intel/../iommu-pages.h:94 func:iommu_alloc_pages_node
0 0 drivers/iommu/intel/../iommu-pages.h:94 func:iommu_alloc_pages_node
0 0 drivers/iommu/intel/../iommu-pages.h:94 func:iommu_alloc_pages_node
0 0 drivers/iommu/intel/../iommu-pages.h:94 func:iommu_alloc_pages_node
=======================
0 0 fs/binfmt_elf.c:1395 func:load_elf_library
0 0 fs/binfmt_elf.c:1395 func:load_elf_library
=======================
0 0 fs/binfmt_elf.c:1653 func:fill_files_note
0 0 fs/binfmt_elf.c:1653 func:fill_files_note
=======================
0 0 fs/binfmt_elf.c:1851 func:fill_note_info
0 0 fs/binfmt_elf.c:1851 func:fill_note_info
=======================
0 0 fs/binfmt_elf.c:1891 func:fill_note_info
0 0 fs/binfmt_elf.c:1891 func:fill_note_info
=======================
0 0 fs/binfmt_elf.c:1899 func:fill_note_info
0 0 fs/binfmt_elf.c:1899 func:fill_note_info
=======================
0 0 fs/binfmt_elf.c:2057 func:elf_core_dump
0 0 fs/binfmt_elf.c:2057 func:elf_core_dump
=======================
0 0 fs/binfmt_elf.c:2072 func:elf_core_dump
0 0 fs/binfmt_elf.c:2072 func:elf_core_dump
=======================
0 0 fs/binfmt_elf.c:532 func:load_elf_phdrs
0 0 fs/binfmt_elf.c:532 func:load_elf_phdrs
=======================
0 0 fs/binfmt_elf.c:885 func:load_elf_binary
0 0 fs/binfmt_elf.c:885 func:load_elf_binary
=======================
0 0 fs/binfmt_elf.c:910 func:load_elf_binary
0 0 fs/binfmt_elf.c:910 func:load_elf_binary
=======================
0 0 fs/fuse/fuse_i.h:1082 [fuse] func:fuse_folios_alloc
0 0 fs/fuse/fuse_i.h:1082 [fuse] func:fuse_folios_alloc
=======================
0 0 io_uring/io_uring.h:253 func:io_uring_alloc_async_data
0 0 io_uring/io_uring.h:253 func:io_uring_alloc_async_data
0 0 io_uring/io_uring.h:253 func:io_uring_alloc_async_data
0 0 io_uring/io_uring.h:253 func:io_uring_alloc_async_data
0 0 io_uring/io_uring.h:253 func:io_uring_alloc_async_data
=======================
0 0 kernel/sched/sched.h:2587 func:alloc_user_cpus_ptr
0 0 kernel/sched/sched.h:2587 func:alloc_user_cpus_ptr
=======================
0 0 mm/mm_slot.h:28 func:mm_slot_alloc
160 4 mm/mm_slot.h:28 func:mm_slot_alloc
=======================
0 0 security/apparmor/domain.c:1136 func:change_hat
0 0 security/apparmor/domain.c:1136 func:change_hat
=======================
0 0 security/apparmor/domain.c:1455 func:aa_change_profile
0 0 security/apparmor/domain.c:1455 func:aa_change_profile
=======================
0 0 security/apparmor/domain.c:835 func:handle_onexec
0 0 security/apparmor/domain.c:835 func:handle_onexec
=======================
0 0 security/apparmor/domain.c:909 func:apparmor_bprm_creds_for_exec
0 0 security/apparmor/domain.c:909 func:apparmor_bprm_creds_for_exec
=======================
0 0 security/apparmor/mount.c:738 func:aa_pivotroot
0 0 security/apparmor/mount.c:738 func:aa_pivotroot
My workstation have my own changes based on 6.15-rc5, but I didn't touch any code about tags...
The KVM runs 6.15-rc5.
The script for checking:
#!/bin/env python
def fetch():
r = {}
with open("/proc/allocinfo") as f:
for l in f:
f = l.strip().split()[2]
if f not in r: r[f]=[]
r[f].append(l)
keys = []
for f, ls in r.items():
if len(ls) > 1: keys.append(f)
keys.sort()
for f in keys:
print "======================="
for l in r[f]: print l,
fetch()
David
>>
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-08 13:33 ` Kent Overstreet
@ 2025-05-08 16:24 ` David Wang
2025-05-08 16:34 ` Kent Overstreet
0 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-08 16:24 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Suren Baghdasaryan, akpm, linux-mm, linux-kernel
At 2025-05-08 21:33:50, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Thu, May 08, 2025 at 01:51:48PM +0800, David Wang wrote:
>> At 2025-05-08 12:07:40, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>> >Another thing to note is that memory layout - avoiding pointer chasing -
>> >is hugely important, but it'll almost never show up as allocator calls.
>> >
>> >To give you some examples, mempools and biosets used to be separately
>> >allocated. This was mainly to make error paths in outer object
>> >constructors/destructors easier and safer: instead of keeping track of
>> >what's initialized and what's not, if you've got a pointer to a
>> >mempool/bioset you call *_free() on it.
>> >
>> >(People hadn't yet clued that you can just kzalloc() the entire outer
>> >object, and then if the inner object is zeroed it wasn't initialized).
>> >
>> >But that means you're adding a pointer chase to every mempool_alloc()
>> >call, and since bioset itself has mempools allocating bios had _two_
>> >unnecessary pointer derefs. That's death for performance when you're
>> >running cache cold, but since everyone benchmarks cache-hot...
>> >
>> >(I was the one who fixed that).
>> >
>> >Another big one was generic_file_buffered_read(). Main buffered read
>> >path, everyone wants it to be as fast as possible.
>> >
>> >But the core is (was) a loop that walks the pagecache radix tree to get
>> >the page, then copies 4k of data out to userspace (there goes l1), then
>> >repeats all that pointer chasing for the next 4k. Pre large folios, it
>> >was horrific.
>> >
>> >Solution - vectorize it. Look up all the pages we're copying from all at
>> >once, stuff them in a (dynamically allocated! for each read!) vector,
>> >and then do the copying out to userspace all at once. Massive
>> >performance gain.
>> >
>> >Of course, to do that I first had to clean up a tangled 250+ line
>> >monstrosity of half baked, poorly thought out "optimizations" (the worst
>> >spaghetti of gotos you'd ever seen) and turn it into something
>> >manageable...
>> >
>> >So - keep things simple, don't overthink the little stuff, so you can
>> >spot and tackle the big algorithmic wins :)
>> I will keep this in mind~! :)
>>
>> And thanks for the enlightening notes~!!
>>
>> Though I could not quite catch up with the first one, I think I got
>> the point: avoid unnecessary pointer chasing and keep the pointer
>> chasing as short(balanced) as possible~
>
>To illustrate - DRAM latency is 30-70n.
>
>At 4GHz, that's 120-280 cycles, and a properly fed CPU can do multiple
>instructions per clock - so a cache miss all the way to DRAM can cost
>you hundreds of instructions.
Oh, I understand cache miss is bad, it is the "mempools and biosets" I
that I have hard time to connect dots with, due to lack of knowledge.....
>
>> The second one, about copy 4k by 4k, seems quite similar to seq_file,
>> at least the "4k" part, literally. seq_file read() defaults to alloc
>> 4k buffer, and read data until EOF or the 4k buffer is full, and
>> start over again for the next read().
>>
>> One solution could be make changes to seq_file, do not stop until user
>> buffer is full for each read. kind of similar to your second note, in
>> a sequential style, I think.
>>
>> If user read with 128K buffer, and seq_file fill the buffer 4k by
>> 4k, it would only need ~3 read calls for allocinfo. (I did post a
>> patch for seq_file to fill user buffer, but start/stop still happens
>> at 4k boundary , so no help for
>> the iterator rewinding when read /proc/allocinfo yet.
>> https://lore.kernel.org/lkml/20241220140819.9887-1-00107082@163.com/ )
>> The solution in this patch is keeping the iterator alive and valid
>> cross read boundary, this can also avoid the cost for each start
>> over.
>
>The first question is - does it matter? If the optimization is just for
>/proc/allocinfo, who's reading it at a high enough rate that we care?
>
>If it's only being used interactively, it doesn't matter. If it's being
>read at a high rate by some sort of profiling program, we'd want to skip
>the text interface entirely and add an ioctl to read the data out in a
>binary format.
...^_^, Actually, I have been running tools parsing /proc/allocinfo every 5 seconds
,and feeding data to a prometheus server for a quite long while...
5 seconds seems not that frequent, but I also have all other proc files to read,
I would like optimization for all the proc files......
Ioctl or other binary interfaces are indeed more efficient, but most are
not well documented, while most proc files are self-documented. If proc files
are efficient enough, I think I would stay with proc files even with a binary
interface alternate tens of fold faster.
>
>The idea of changing seq_file to continue until the user buffer is full
>- that'd be a good one, if you're making changes that benefit all
>seq_file users.
I did make that patch, I think I am still waiting feedback......
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-08 16:24 ` David Wang
@ 2025-05-08 16:34 ` Kent Overstreet
2025-05-08 16:58 ` David Wang
0 siblings, 1 reply; 44+ messages in thread
From: Kent Overstreet @ 2025-05-08 16:34 UTC (permalink / raw)
To: David Wang; +Cc: Suren Baghdasaryan, akpm, linux-mm, linux-kernel
On Fri, May 09, 2025 at 12:24:56AM +0800, David Wang wrote:
> At 2025-05-08 21:33:50, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
> >The first question is - does it matter? If the optimization is just for
> >/proc/allocinfo, who's reading it at a high enough rate that we care?
> >
> >If it's only being used interactively, it doesn't matter. If it's being
> >read at a high rate by some sort of profiling program, we'd want to skip
> >the text interface entirely and add an ioctl to read the data out in a
> >binary format.
> ...^_^, Actually, I have been running tools parsing /proc/allocinfo every 5 seconds
> ,and feeding data to a prometheus server for a quite long while...
> 5 seconds seems not that frequent, but I also have all other proc files to read,
> I would like optimization for all the proc files......
>
> Ioctl or other binary interfaces are indeed more efficient, but most are
> not well documented, while most proc files are self-documented. If proc files
> are efficient enough, I think I would stay with proc files even with a binary
> interface alternate tens of fold faster.
This would be a perfect place for a binary interface, you just want to
return an array of
struct allocated_by_ip {
u64 ip;
u64 bytes;
};
Printing it in text form requires symbol table lookup, what you're
optimizing is noise compared to that and vsnprintf().
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-08 16:34 ` Kent Overstreet
@ 2025-05-08 16:58 ` David Wang
2025-05-08 17:17 ` David Wang
0 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-08 16:58 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Suren Baghdasaryan, akpm, linux-mm, linux-kernel
At 2025-05-09 00:34:27, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Fri, May 09, 2025 at 12:24:56AM +0800, David Wang wrote:
>> At 2025-05-08 21:33:50, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>> >The first question is - does it matter? If the optimization is just for
>> >/proc/allocinfo, who's reading it at a high enough rate that we care?
>> >
>> >If it's only being used interactively, it doesn't matter. If it's being
>> >read at a high rate by some sort of profiling program, we'd want to skip
>> >the text interface entirely and add an ioctl to read the data out in a
>> >binary format.
>> ...^_^, Actually, I have been running tools parsing /proc/allocinfo every 5 seconds
>> ,and feeding data to a prometheus server for a quite long while...
>> 5 seconds seems not that frequent, but I also have all other proc files to read,
>> I would like optimization for all the proc files......
>>
>> Ioctl or other binary interfaces are indeed more efficient, but most are
>> not well documented, while most proc files are self-documented. If proc files
>> are efficient enough, I think I would stay with proc files even with a binary
>> interface alternate tens of fold faster.
>
>This would be a perfect place for a binary interface, you just want to
>return an array of
>
>struct allocated_by_ip {
> u64 ip;
> u64 bytes;
>};
>
>Printing it in text form requires symbol table lookup, what you're
>optimizing is noise compared to that and vsnprintf().
Oh, no, this optimization is mostly achieved by avoiding iter rewinding, I think
I talk about the extra memory allocation "too much"....
These lines of code:
- while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
- node--;
have accumulated way too much.
Think it this way, advancing iterator n times takes 1%, reasonable noise
compared to symbol lookup and printf(). The problem is seq_file() would
restart about 80 times to read out all content of /proc/allocinfo, accumulated
to a total 40*n iterator advancement, hence 1% become 40*1%, noise become significant.
My test result shows an improvement from 7ms to 4ms:
Timings before:
$ time cat /proc/allocinfo > /dev/null
real 0m0.007s
user 0m0.000s
sys 0m0.007s
read-syscalls get slower and slower:
read(3, "allocinfo - version: 1.0\n# <"..., 131072) = 4085 <0.000062>
...
read(3, " 0 0 drivers/gp"..., 131072) = 4046 <0.000135>
read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
...
and with the change:
$ time cat /proc/allocinfo > /dev/null
real 0m0.004s
user 0m0.000s
sys 0m0.003s
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-08 16:58 ` David Wang
@ 2025-05-08 17:17 ` David Wang
2025-05-08 17:26 ` Kent Overstreet
0 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-08 17:17 UTC (permalink / raw)
To: Kent Overstreet; +Cc: Suren Baghdasaryan, akpm, linux-mm, linux-kernel
The impact of accumulated iterator rewinding could be observed via:
sudo strace -T -e read cat /proc/allocinfo > /dev/null
Tens of read should be observed and each read should become slower and slower.
At 2025-05-09 00:58:25, "David Wang" <00107082@163.com> wrote:
>
>At 2025-05-09 00:34:27, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>>On Fri, May 09, 2025 at 12:24:56AM +0800, David Wang wrote:
>>> At 2025-05-08 21:33:50, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>>> >The first question is - does it matter? If the optimization is just for
>>> >/proc/allocinfo, who's reading it at a high enough rate that we care?
>>> >
>>> >If it's only being used interactively, it doesn't matter. If it's being
>>> >read at a high rate by some sort of profiling program, we'd want to skip
>>> >the text interface entirely and add an ioctl to read the data out in a
>>> >binary format.
>>> ...^_^, Actually, I have been running tools parsing /proc/allocinfo every 5 seconds
>>> ,and feeding data to a prometheus server for a quite long while...
>>> 5 seconds seems not that frequent, but I also have all other proc files to read,
>>> I would like optimization for all the proc files......
>>>
>>> Ioctl or other binary interfaces are indeed more efficient, but most are
>>> not well documented, while most proc files are self-documented. If proc files
>>> are efficient enough, I think I would stay with proc files even with a binary
>>> interface alternate tens of fold faster.
>>
>>This would be a perfect place for a binary interface, you just want to
>>return an array of
>>
>>struct allocated_by_ip {
>> u64 ip;
>> u64 bytes;
>>};
>
>
>
>>
>>Printing it in text form requires symbol table lookup, what you're
>>optimizing is noise compared to that and vsnprintf().
>
>Oh, no, this optimization is mostly achieved by avoiding iter rewinding, I think
>I talk about the extra memory allocation "too much"....
>These lines of code:
>- while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
>- node--;
>have accumulated way too much.
>Think it this way, advancing iterator n times takes 1%, reasonable noise
>compared to symbol lookup and printf(). The problem is seq_file() would
>restart about 80 times to read out all content of /proc/allocinfo, accumulated
>to a total 40*n iterator advancement, hence 1% become 40*1%, noise become significant.
>
>My test result shows an improvement from 7ms to 4ms:
>
>Timings before:
> $ time cat /proc/allocinfo > /dev/null
>
> real 0m0.007s
> user 0m0.000s
> sys 0m0.007s
>read-syscalls get slower and slower:
> read(3, "allocinfo - version: 1.0\n# <"..., 131072) = 4085 <0.000062>
> ...
> read(3, " 0 0 drivers/gp"..., 131072) = 4046 <0.000135>
> read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
> ...
>
>and with the change:
> $ time cat /proc/allocinfo > /dev/null
>
> real 0m0.004s
> user 0m0.000s
> sys 0m0.003s
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-08 17:17 ` David Wang
@ 2025-05-08 17:26 ` Kent Overstreet
0 siblings, 0 replies; 44+ messages in thread
From: Kent Overstreet @ 2025-05-08 17:26 UTC (permalink / raw)
To: David Wang; +Cc: Suren Baghdasaryan, akpm, linux-mm, linux-kernel
On Fri, May 09, 2025 at 01:17:46AM +0800, David Wang wrote:
>
> The impact of accumulated iterator rewinding could be observed via:
>
> sudo strace -T -e read cat /proc/allocinfo > /dev/null
Well, that sounds like it's worth fixing then :)
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo
2025-05-08 15:32 ` David Wang
@ 2025-05-08 21:41 ` Suren Baghdasaryan
0 siblings, 0 replies; 44+ messages in thread
From: Suren Baghdasaryan @ 2025-05-08 21:41 UTC (permalink / raw)
To: David Wang; +Cc: kent.overstreet, akpm, linux-mm, linux-kernel
On Thu, May 8, 2025 at 8:32 AM David Wang <00107082@163.com> wrote:
>
>
>
> At 2025-05-08 07:36:14, "Suren Baghdasaryan" <surenb@google.com> wrote:
> >On Wed, May 7, 2025 at 5:55 PM David Wang <00107082@163.com> wrote:
> >>
> >> ---
> >> The patch is not complete, just want to get feedbacks whether this
> >> worth carrying on.
> >
> >In such cases it's customary to mark the patch as RFC which saves you
> >time on explaining your motivation :)
> >
> >> ---
> >> When reading /proc/allocinfo, for each read syscall, seq_file would
> >> invoke start/stop callbacks. In start callback, a memory is alloced
> >> to store iterator and the iterator would restart from beginning to
> >> walk to its previous position.
> >> Each seq_file read() takes at most 4096 bytes, even read with a larger
> >> user space buffer, meaning read out /proc/allocinfo, tens of read
> >> syscalls are needed. For example, a 306036 bytes allocinfo files need
> >> 76 reads:
> >>
> >> $ sudo cat /proc/allocinfo | wc
> >> 3964 16678 306036
> >>
> >> For those n=3964 lines, each read takes about m=3964/76=52 lines,
> >> the iter would be rewinding:
> >> m steps on 1st read,
> >> 2*m steps on 2nd read
> >> 3*m steps on 3rd read
> >> ...
> >> n steps on the last read
> >> totally, the iterator would be iterated O(n*n/m) times.
> >> (Each read would take more time than previous one.)
> >>
> >> To use a private data alloced when /proc/allocinfo is opened,
> >> the n/m memory alloction could be avoid, and there is no need
> >> to restart the iterator from very beginning everytime.
> >> So only 1 memory allocation and n steps for iterating are needed.
> >> (Only when module changed, the iterator should be invalidated and
> >> restart.)
> >
> >Yeah, your change makes sense and looks like a good optimization. From
> >a quick look at the code, codetag_next_ct() should handle the case
> >when a module gets removed from under us while we are not holding
> >cttype->mod_lock. I'll need to take another closer look at it once you
> >post an official patch.
> >Thanks!
> >
> The module tag container designed more "compact" than I imaged. It seems that no
> extra iterator validation needed for most situations, but I get anxious about the following
> possibility:
>
> In between read() calls, module A removed and then module B inserted, accidentally A
> and B have same IDR id (id reused) and same "struct module" address (kmalloc happened
> to pick the cmod address kfree by module A).
> If this happened, the `if (cmod != iter->cmod)` check in codetag_next_ct may not be
> solid safe....
>
> What about adding a clock/timestamp/expiration to cttype/module/iterator:
I see there was a followup discussion but I don't think your question
was answered. Instead of expiration I would suggest adding a timestamp
in the struct codetag_module that would store the time module was
loaded (basically the time when struct codetag_module gets created)
and also add a timestamp in the struct codetag_iterator. Whenever
iter->cmod gets assigned a new module during the walk (see
https://elixir.bootlin.com/linux/v6.14.5/source/lib/codetag.c#L95) we
update iterator's timestamp (iter->timestamp = cmod->timestamp) and
then we can validate that the module was not replaced from under us by
comparing ter->timestamp and cmod->timestamp. If the module was
replaced from under us, the timestamps will not be equal, so we can
reset the iterator.
>
> diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> index d14dbd26b370..fc9f430090ae 100644
> --- a/include/linux/codetag.h
> +++ b/include/linux/codetag.h
> @@ -54,6 +54,7 @@ struct codetag_iterator {
> struct codetag_module *cmod;
> unsigned long mod_id;
> struct codetag *ct;
> + unsigned long expiration;
> };
>
> #ifdef MODULE
> diff --git a/lib/codetag.c b/lib/codetag.c
> index 42aadd6c1454..a795b152ce92 100644
> --- a/lib/codetag.c
> +++ b/lib/codetag.c
> @@ -13,6 +13,8 @@ struct codetag_type {
> struct idr mod_idr;
> struct rw_semaphore mod_lock; /* protects mod_idr */
> struct codetag_type_desc desc;
> + /* timestamping iterator expiration */
> + unsigned long clock;
> };
>
> struct codetag_range {
> @@ -23,6 +25,8 @@ struct codetag_range {
> struct codetag_module {
> struct module *mod;
> struct codetag_range range;
> + /* creation timestamp */
> + unsigned long timestamp;
> };
>
> static DEFINE_MUTEX(codetag_lock);
> @@ -48,6 +52,7 @@ struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
> .cmod = NULL,
> .mod_id = 0,
> .ct = NULL,
> + .expiration = 0,
> };
>
> return iter;
> @@ -93,6 +98,11 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
>
> if (cmod != iter->cmod) {
> iter->cmod = cmod;
> + iter->expiration = cmod->timestamp;
> + ct = get_first_module_ct(cmod);
> + } else if (cmod->timestamp != iter->expiration) {
> + pr_warn("Same IDR id and module address, but different module!");
> + iter->expiration = cmod->timestamp;
> ct = get_first_module_ct(cmod);
> } else
> ct = get_next_module_ct(iter);
> @@ -101,6 +111,7 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
> break;
>
> iter->mod_id++;
> + iter->cmod = NULL;
> }
>
> iter->ct = ct;
> @@ -169,6 +180,7 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
> struct codetag_module *cmod;
> int err;
>
> + cttype->clock++;
> range = get_section_range(mod, cttype->desc.section);
> if (!range.start || !range.stop) {
> pr_warn("Failed to load code tags of type %s from the module %s\n",
> @@ -188,6 +200,7 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
>
> cmod->mod = mod;
> cmod->range = range;
> + cmod->timestamp = cttype->clock;
>
> down_write(&cttype->mod_lock);
> err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
>
>
>
>
>
> And I notice another issue: there are duplicating keys(file:line+module+func) in allocinfo even without this patch:
> On my workstation :
> =======================
> 1400832 114 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 840764
> 0 0 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 1
> 0 0 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 2
> 0 0 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 758
> 0 0 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 62951
> 0 0 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 1
> 0 0 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 325450
> 12288 1 ././common/inc/nv-linux.h:1117 [nvidia] func:nv_kmem_cache_alloc_stack 1
> =======================
> 81920 20 drivers/iommu/amd/../iommu-pages.h:94 func:iommu_alloc_pages_node 20
> 1441792 352 drivers/iommu/amd/../iommu-pages.h:94 func:iommu_alloc_pages_node 352
> =======================
> 112 7 kernel/sched/sched.h:2587 func:alloc_user_cpus_ptr 1591
> 48 3 kernel/sched/sched.h:2587 func:alloc_user_cpus_ptr 12
> =======================
> 48 1 mm/mm_slot.h:28 func:mm_slot_alloc 4
> 2160 54 mm/mm_slot.h:28 func:mm_slot_alloc 10705
>
> Most duplicating keys are from "*.h" files
> On a KVM, duplication also happens to some "*.c" files:
> =======================
> 0 0 ./include/crypto/kpp.h:185 func:kpp_request_alloc
> 0 0 ./include/crypto/kpp.h:185 func:kpp_request_alloc
> =======================
> 0 0 ./include/net/tcp.h:2548 func:tcp_v4_save_options
> 0 0 ./include/net/tcp.h:2548 func:tcp_v4_save_options
> =======================
> 0 0 drivers/iommu/amd/../iommu-pages.h:94 func:iommu_alloc_pages_node
> 0 0 drivers/iommu/amd/../iommu-pages.h:94 func:iommu_alloc_pages_node
> 0 0 drivers/iommu/amd/../iommu-pages.h:94 func:iommu_alloc_pages_node
> =======================
> 0 0 drivers/iommu/intel/../iommu-pages.h:94 func:iommu_alloc_pages_node
> 0 0 drivers/iommu/intel/../iommu-pages.h:94 func:iommu_alloc_pages_node
> 0 0 drivers/iommu/intel/../iommu-pages.h:94 func:iommu_alloc_pages_node
> 0 0 drivers/iommu/intel/../iommu-pages.h:94 func:iommu_alloc_pages_node
> 0 0 drivers/iommu/intel/../iommu-pages.h:94 func:iommu_alloc_pages_node
> =======================
> 0 0 fs/binfmt_elf.c:1395 func:load_elf_library
> 0 0 fs/binfmt_elf.c:1395 func:load_elf_library
> =======================
> 0 0 fs/binfmt_elf.c:1653 func:fill_files_note
> 0 0 fs/binfmt_elf.c:1653 func:fill_files_note
> =======================
> 0 0 fs/binfmt_elf.c:1851 func:fill_note_info
> 0 0 fs/binfmt_elf.c:1851 func:fill_note_info
> =======================
> 0 0 fs/binfmt_elf.c:1891 func:fill_note_info
> 0 0 fs/binfmt_elf.c:1891 func:fill_note_info
> =======================
> 0 0 fs/binfmt_elf.c:1899 func:fill_note_info
> 0 0 fs/binfmt_elf.c:1899 func:fill_note_info
> =======================
> 0 0 fs/binfmt_elf.c:2057 func:elf_core_dump
> 0 0 fs/binfmt_elf.c:2057 func:elf_core_dump
> =======================
> 0 0 fs/binfmt_elf.c:2072 func:elf_core_dump
> 0 0 fs/binfmt_elf.c:2072 func:elf_core_dump
> =======================
> 0 0 fs/binfmt_elf.c:532 func:load_elf_phdrs
> 0 0 fs/binfmt_elf.c:532 func:load_elf_phdrs
> =======================
> 0 0 fs/binfmt_elf.c:885 func:load_elf_binary
> 0 0 fs/binfmt_elf.c:885 func:load_elf_binary
> =======================
> 0 0 fs/binfmt_elf.c:910 func:load_elf_binary
> 0 0 fs/binfmt_elf.c:910 func:load_elf_binary
> =======================
> 0 0 fs/fuse/fuse_i.h:1082 [fuse] func:fuse_folios_alloc
> 0 0 fs/fuse/fuse_i.h:1082 [fuse] func:fuse_folios_alloc
> =======================
> 0 0 io_uring/io_uring.h:253 func:io_uring_alloc_async_data
> 0 0 io_uring/io_uring.h:253 func:io_uring_alloc_async_data
> 0 0 io_uring/io_uring.h:253 func:io_uring_alloc_async_data
> 0 0 io_uring/io_uring.h:253 func:io_uring_alloc_async_data
> 0 0 io_uring/io_uring.h:253 func:io_uring_alloc_async_data
> =======================
> 0 0 kernel/sched/sched.h:2587 func:alloc_user_cpus_ptr
> 0 0 kernel/sched/sched.h:2587 func:alloc_user_cpus_ptr
> =======================
> 0 0 mm/mm_slot.h:28 func:mm_slot_alloc
> 160 4 mm/mm_slot.h:28 func:mm_slot_alloc
> =======================
> 0 0 security/apparmor/domain.c:1136 func:change_hat
> 0 0 security/apparmor/domain.c:1136 func:change_hat
> =======================
> 0 0 security/apparmor/domain.c:1455 func:aa_change_profile
> 0 0 security/apparmor/domain.c:1455 func:aa_change_profile
> =======================
> 0 0 security/apparmor/domain.c:835 func:handle_onexec
> 0 0 security/apparmor/domain.c:835 func:handle_onexec
> =======================
> 0 0 security/apparmor/domain.c:909 func:apparmor_bprm_creds_for_exec
> 0 0 security/apparmor/domain.c:909 func:apparmor_bprm_creds_for_exec
> =======================
> 0 0 security/apparmor/mount.c:738 func:aa_pivotroot
> 0 0 security/apparmor/mount.c:738 func:aa_pivotroot
>
>
>
> My workstation have my own changes based on 6.15-rc5, but I didn't touch any code about tags...
> The KVM runs 6.15-rc5.
>
> The script for checking:
>
> #!/bin/env python
> def fetch():
> r = {}
> with open("/proc/allocinfo") as f:
> for l in f:
> f = l.strip().split()[2]
> if f not in r: r[f]=[]
> r[f].append(l)
> keys = []
> for f, ls in r.items():
> if len(ls) > 1: keys.append(f)
> keys.sort()
> for f in keys:
> print "======================="
> for l in r[f]: print l,
>
> fetch()
>
>
>
>
>
> David
>
>
> >>
^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH 1/2] alloc_tag: add timestamp to codetag iterator
2025-05-07 17:55 [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo David Wang
2025-05-07 18:19 ` David Wang
2025-05-07 23:36 ` Suren Baghdasaryan
@ 2025-05-09 5:51 ` David Wang
2025-05-09 16:10 ` Suren Baghdasaryan
2025-05-09 5:53 ` [PATCH 2/2] alloc_tag: keep codetag iterator cross read() calls David Wang
` (2 subsequent siblings)
5 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-09 5:51 UTC (permalink / raw)
To: surenb, kent.overstreet, akpm; +Cc: linux-kernel, David Wang
Codetag iterator use <id,address> pair to guarantee the
validness. But both id and address can be reused, there is
theoretical possibility when module inserted right after
another module removed, kmalloc return an address kfree by
previous module and IDR key reuse the key recently removed.
Add timestamp to codetag_module and code_iterator, the
timestamp is generated by a clock which is strickly
incremented whenever a module is loaded. An iterator is
valid if and only if its timestamp match codetag_module's.
Signed-off-by: David Wang <00107082@163.com>
---
include/linux/codetag.h | 1 +
lib/codetag.c | 12 ++++++++++--
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index d14dbd26b370..61d43c3fbd19 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -54,6 +54,7 @@ struct codetag_iterator {
struct codetag_module *cmod;
unsigned long mod_id;
struct codetag *ct;
+ unsigned long timestamp;
};
#ifdef MODULE
diff --git a/lib/codetag.c b/lib/codetag.c
index 42aadd6c1454..973bfa5dd5db 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -13,6 +13,8 @@ struct codetag_type {
struct idr mod_idr;
struct rw_semaphore mod_lock; /* protects mod_idr */
struct codetag_type_desc desc;
+ /* generates timestamp for module load */
+ unsigned long clock;
};
struct codetag_range {
@@ -23,6 +25,7 @@ struct codetag_range {
struct codetag_module {
struct module *mod;
struct codetag_range range;
+ unsigned long timestamp;
};
static DEFINE_MUTEX(codetag_lock);
@@ -48,6 +51,7 @@ struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
.cmod = NULL,
.mod_id = 0,
.ct = NULL,
+ .timestamp = 0,
};
return iter;
@@ -91,11 +95,13 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
if (!cmod)
break;
- if (cmod != iter->cmod) {
+ if (!iter->cmod || iter->timestamp != cmod->timestamp) {
iter->cmod = cmod;
+ iter->timestamp = cmod->timestamp;
ct = get_first_module_ct(cmod);
- } else
+ } else {
ct = get_next_module_ct(iter);
+ }
if (ct)
break;
@@ -190,6 +196,8 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
cmod->range = range;
down_write(&cttype->mod_lock);
+ cttype->clock++;
+ cmod->timestamp = cttype->clock;
err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
if (err >= 0) {
cttype->count += range_size(cttype, &range);
--
2.39.2
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH 2/2] alloc_tag: keep codetag iterator cross read() calls
2025-05-07 17:55 [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo David Wang
` (2 preceding siblings ...)
2025-05-09 5:51 ` [PATCH 1/2] alloc_tag: add timestamp to codetag iterator David Wang
@ 2025-05-09 5:53 ` David Wang
2025-05-09 17:34 ` Suren Baghdasaryan
2025-05-09 17:38 ` [PATCH v2 1/2] alloc_tag: add sequence number for module and iterator David Wang
2025-05-09 17:39 ` [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls David Wang
5 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-09 5:53 UTC (permalink / raw)
To: surenb, kent.overstreet, akpm; +Cc: linux-kernel, linux-mm, David Wang
When reading /proc/allocinfo, for each read syscall, seq_file would
invoke start/stop callbacks. In start callback, a memory is alloced
to store iterator and the iterator would start from beginning to
walk to current read position.
seq_file read() takes at most 4096 bytes, even if read with a larger
user space buffer, meaning read out all of /proc/allocinfo, tens of read
syscalls are needed. For example, a 306036 bytes allocinfo files need
76 reads:
$ sudo cat /proc/allocinfo | wc
3964 16678 306036
$ sudo strace -T -e read cat /proc/allocinfo
...
read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062>
...
read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
...
For those n=3964 lines, each read takes about m=3964/76=52 lines,
since iterator restart from beginning for each read(),
it would move forward
m steps on 1st read
2*m steps on 2nd read
3*m steps on 3rd read
...
n steps on last read
As read() along, more iterator steps make read() calls slower and
slower. Adding those up, codetag iterator moves about O(n*n/m) steps,
making data structure traversal take significant part of the whole reading.
Profiling when stress reading /proc/allocinfo confirms it:
vfs_read(99.959% 1677299/1677995)
proc_reg_read_iter(99.856% 1674881/1677299)
seq_read_iter(99.959% 1674191/1674881)
allocinfo_start(75.664% 1266755/1674191)
codetag_next_ct(79.217% 1003487/1266755) <---
srso_return_thunk(1.264% 16011/1266755)
__kmalloc_cache_noprof(0.102% 1296/1266755)
...
allocinfo_show(21.287% 356378/1674191)
allocinfo_next(1.530% 25621/1674191)
allocinfo_start() takes about 75% of seq_read().
A private data alloced at open() time can be used to carry iterator
alive across read() calls, and avoid the memory allocation and
iterator reset for each read(). This way, only O(1) memory allocation
and O(n) steps iterating, and `time` shows performance improvement
from ~7ms to ~4ms.
Profiling with the change:
vfs_read(99.865% 1581073/1583214)
proc_reg_read_iter(99.485% 1572934/1581073)
seq_read_iter(99.846% 1570519/1572934)
allocinfo_show(87.428% 1373074/1570519)
seq_buf_printf(83.695% 1149196/1373074)
seq_buf_putc(1.917% 26321/1373074)
_find_next_bit(1.531% 21023/1373074)
...
codetag_to_text(0.490% 6727/1373074)
...
allocinfo_next(6.275% 98543/1570519)
...
allocinfo_start(0.369% 5790/1570519)
...
allocinfo_start taks less than 1%.
Signed-off-by: David Wang <00107082@163.com>
---
lib/alloc_tag.c | 29 ++++++++++-------------------
1 file changed, 10 insertions(+), 19 deletions(-)
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 25ecc1334b67..fdd5887769a6 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -45,21 +45,16 @@ struct allocinfo_private {
static void *allocinfo_start(struct seq_file *m, loff_t *pos)
{
struct allocinfo_private *priv;
- struct codetag *ct;
loff_t node = *pos;
- priv = kzalloc(sizeof(*priv), GFP_KERNEL);
- m->private = priv;
- if (!priv)
- return NULL;
-
- priv->print_header = (node == 0);
+ priv = (struct allocinfo_private *)m->private;
codetag_lock_module_list(alloc_tag_cttype, true);
- priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
- while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
- node--;
-
- return ct ? priv : NULL;
+ if (node == 0) {
+ priv->print_header = true;
+ priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
+ codetag_next_ct(&priv->iter);
+ }
+ return priv->iter.ct ? priv : NULL;
}
static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
@@ -76,12 +71,7 @@ static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
static void allocinfo_stop(struct seq_file *m, void *arg)
{
- struct allocinfo_private *priv = (struct allocinfo_private *)m->private;
-
- if (priv) {
- codetag_lock_module_list(alloc_tag_cttype, false);
- kfree(priv);
- }
+ codetag_lock_module_list(alloc_tag_cttype, false);
}
static void print_allocinfo_header(struct seq_buf *buf)
@@ -249,7 +239,8 @@ static void __init procfs_init(void)
if (!mem_profiling_support)
return;
- if (!proc_create_seq(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op)) {
+ if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
+ sizeof(struct allocinfo_private), NULL)) {
pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
shutdown_mem_profiling(false);
}
--
2.39.2
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] alloc_tag: add timestamp to codetag iterator
2025-05-09 5:51 ` [PATCH 1/2] alloc_tag: add timestamp to codetag iterator David Wang
@ 2025-05-09 16:10 ` Suren Baghdasaryan
2025-05-09 16:16 ` David Wang
0 siblings, 1 reply; 44+ messages in thread
From: Suren Baghdasaryan @ 2025-05-09 16:10 UTC (permalink / raw)
To: David Wang; +Cc: kent.overstreet, akpm, linux-kernel
On Thu, May 8, 2025 at 10:52 PM David Wang <00107082@163.com> wrote:
>
> Codetag iterator use <id,address> pair to guarantee the
> validness. But both id and address can be reused, there is
> theoretical possibility when module inserted right after
> another module removed, kmalloc return an address kfree by
> previous module and IDR key reuse the key recently removed.
>
> Add timestamp to codetag_module and code_iterator, the
> timestamp is generated by a clock which is strickly
> incremented whenever a module is loaded. An iterator is
> valid if and only if its timestamp match codetag_module's.
>
> Signed-off-by: David Wang <00107082@163.com>
> ---
> include/linux/codetag.h | 1 +
> lib/codetag.c | 12 ++++++++++--
> 2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> index d14dbd26b370..61d43c3fbd19 100644
> --- a/include/linux/codetag.h
> +++ b/include/linux/codetag.h
> @@ -54,6 +54,7 @@ struct codetag_iterator {
> struct codetag_module *cmod;
> unsigned long mod_id;
> struct codetag *ct;
> + unsigned long timestamp;
The way you implement this, it is not really a timestamp, more like a
unique id or sequence number. I would suggest calling it mod_seq or
something similar. Maybe: cttype->next_mod_seq, iter->mod_seq and
cmod->mod_seq.
> };
>
> #ifdef MODULE
> diff --git a/lib/codetag.c b/lib/codetag.c
> index 42aadd6c1454..973bfa5dd5db 100644
> --- a/lib/codetag.c
> +++ b/lib/codetag.c
> @@ -13,6 +13,8 @@ struct codetag_type {
> struct idr mod_idr;
> struct rw_semaphore mod_lock; /* protects mod_idr */
The comment above should be expanded to say that mod_lock also
protects accesses to cttype->clock, iter->timestamp and
cmod->timestamp.
> struct codetag_type_desc desc;
> + /* generates timestamp for module load */
> + unsigned long clock;
> };
>
> struct codetag_range {
> @@ -23,6 +25,7 @@ struct codetag_range {
> struct codetag_module {
> struct module *mod;
> struct codetag_range range;
> + unsigned long timestamp;
> };
>
> static DEFINE_MUTEX(codetag_lock);
> @@ -48,6 +51,7 @@ struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
> .cmod = NULL,
> .mod_id = 0,
> .ct = NULL,
> + .timestamp = 0,
> };
>
> return iter;
> @@ -91,11 +95,13 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
> if (!cmod)
> break;
>
> - if (cmod != iter->cmod) {
> + if (!iter->cmod || iter->timestamp != cmod->timestamp) {
> iter->cmod = cmod;
> + iter->timestamp = cmod->timestamp;
> ct = get_first_module_ct(cmod);
> - } else
> + } else {
> ct = get_next_module_ct(iter);
> + }
>
> if (ct)
> break;
> @@ -190,6 +196,8 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
> cmod->range = range;
>
> down_write(&cttype->mod_lock);
> + cttype->clock++;
> + cmod->timestamp = cttype->clock;
> err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
> if (err >= 0) {
> cttype->count += range_size(cttype, &range);
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] alloc_tag: add timestamp to codetag iterator
2025-05-09 16:10 ` Suren Baghdasaryan
@ 2025-05-09 16:16 ` David Wang
0 siblings, 0 replies; 44+ messages in thread
From: David Wang @ 2025-05-09 16:16 UTC (permalink / raw)
To: Suren Baghdasaryan; +Cc: kent.overstreet, akpm, linux-kernel
At 2025-05-10 00:10:01, "Suren Baghdasaryan" <surenb@google.com> wrote:
>On Thu, May 8, 2025 at 10:52 PM David Wang <00107082@163.com> wrote:
>>
>> Codetag iterator use <id,address> pair to guarantee the
>> validness. But both id and address can be reused, there is
>> theoretical possibility when module inserted right after
>> another module removed, kmalloc return an address kfree by
>> previous module and IDR key reuse the key recently removed.
>>
>> Add timestamp to codetag_module and code_iterator, the
>> timestamp is generated by a clock which is strickly
>> incremented whenever a module is loaded. An iterator is
>> valid if and only if its timestamp match codetag_module's.
>>
>> Signed-off-by: David Wang <00107082@163.com>
>> ---
>> include/linux/codetag.h | 1 +
>> lib/codetag.c | 12 ++++++++++--
>> 2 files changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/codetag.h b/include/linux/codetag.h
>> index d14dbd26b370..61d43c3fbd19 100644
>> --- a/include/linux/codetag.h
>> +++ b/include/linux/codetag.h
>> @@ -54,6 +54,7 @@ struct codetag_iterator {
>> struct codetag_module *cmod;
>> unsigned long mod_id;
>> struct codetag *ct;
>> + unsigned long timestamp;
>
>The way you implement this, it is not really a timestamp, more like a
>unique id or sequence number. I would suggest calling it mod_seq or
>something similar. Maybe: cttype->next_mod_seq, iter->mod_seq and
>cmod->mod_seq.
Copy that~
>
>> };
>>
>> #ifdef MODULE
>> diff --git a/lib/codetag.c b/lib/codetag.c
>> index 42aadd6c1454..973bfa5dd5db 100644
>> --- a/lib/codetag.c
>> +++ b/lib/codetag.c
>> @@ -13,6 +13,8 @@ struct codetag_type {
>> struct idr mod_idr;
>> struct rw_semaphore mod_lock; /* protects mod_idr */
>
>The comment above should be expanded to say that mod_lock also
>protects accesses to cttype->clock, iter->timestamp and
>cmod->timestamp.
Good catch~!
Thanks
>
>> struct codetag_type_desc desc;
>> + /* generates timestamp for module load */
>> + unsigned long clock;
>> };
>>
>> struct codetag_range {
>> @@ -23,6 +25,7 @@ struct codetag_range {
>> struct codetag_module {
>> struct module *mod;
>> struct codetag_range range;
>> + unsigned long timestamp;
>> };
>>
>> static DEFINE_MUTEX(codetag_lock);
>> @@ -48,6 +51,7 @@ struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
>> .cmod = NULL,
>> .mod_id = 0,
>> .ct = NULL,
>> + .timestamp = 0,
>> };
>>
>> return iter;
>> @@ -91,11 +95,13 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
>> if (!cmod)
>> break;
>>
>> - if (cmod != iter->cmod) {
>> + if (!iter->cmod || iter->timestamp != cmod->timestamp) {
>> iter->cmod = cmod;
>> + iter->timestamp = cmod->timestamp;
>> ct = get_first_module_ct(cmod);
>> - } else
>> + } else {
>> ct = get_next_module_ct(iter);
>> + }
>>
>> if (ct)
>> break;
>> @@ -190,6 +196,8 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
>> cmod->range = range;
>>
>> down_write(&cttype->mod_lock);
>> + cttype->clock++;
>> + cmod->timestamp = cttype->clock;
>> err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
>> if (err >= 0) {
>> cttype->count += range_size(cttype, &range);
>> --
>> 2.39.2
>>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 2/2] alloc_tag: keep codetag iterator cross read() calls
2025-05-09 5:53 ` [PATCH 2/2] alloc_tag: keep codetag iterator cross read() calls David Wang
@ 2025-05-09 17:34 ` Suren Baghdasaryan
2025-05-09 17:45 ` David Wang
0 siblings, 1 reply; 44+ messages in thread
From: Suren Baghdasaryan @ 2025-05-09 17:34 UTC (permalink / raw)
To: David Wang; +Cc: kent.overstreet, akpm, linux-kernel, linux-mm
On Thu, May 8, 2025 at 10:53 PM David Wang <00107082@163.com> wrote:
>
> When reading /proc/allocinfo, for each read syscall, seq_file would
> invoke start/stop callbacks. In start callback, a memory is alloced
> to store iterator and the iterator would start from beginning to
> walk to current read position.
>
> seq_file read() takes at most 4096 bytes, even if read with a larger
> user space buffer, meaning read out all of /proc/allocinfo, tens of read
> syscalls are needed. For example, a 306036 bytes allocinfo files need
> 76 reads:
>
> $ sudo cat /proc/allocinfo | wc
> 3964 16678 306036
> $ sudo strace -T -e read cat /proc/allocinfo
> ...
> read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062>
> ...
> read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
> ...
> For those n=3964 lines, each read takes about m=3964/76=52 lines,
> since iterator restart from beginning for each read(),
> it would move forward
> m steps on 1st read
> 2*m steps on 2nd read
> 3*m steps on 3rd read
> ...
> n steps on last read
> As read() along, more iterator steps make read() calls slower and
> slower. Adding those up, codetag iterator moves about O(n*n/m) steps,
> making data structure traversal take significant part of the whole reading.
> Profiling when stress reading /proc/allocinfo confirms it:
>
> vfs_read(99.959% 1677299/1677995)
> proc_reg_read_iter(99.856% 1674881/1677299)
> seq_read_iter(99.959% 1674191/1674881)
> allocinfo_start(75.664% 1266755/1674191)
> codetag_next_ct(79.217% 1003487/1266755) <---
> srso_return_thunk(1.264% 16011/1266755)
> __kmalloc_cache_noprof(0.102% 1296/1266755)
> ...
> allocinfo_show(21.287% 356378/1674191)
> allocinfo_next(1.530% 25621/1674191)
> allocinfo_start() takes about 75% of seq_read().
>
> A private data alloced at open() time can be used to carry iterator
> alive across read() calls, and avoid the memory allocation and
> iterator reset for each read(). This way, only O(1) memory allocation
> and O(n) steps iterating, and `time` shows performance improvement
> from ~7ms to ~4ms.
> Profiling with the change:
>
> vfs_read(99.865% 1581073/1583214)
> proc_reg_read_iter(99.485% 1572934/1581073)
> seq_read_iter(99.846% 1570519/1572934)
> allocinfo_show(87.428% 1373074/1570519)
> seq_buf_printf(83.695% 1149196/1373074)
> seq_buf_putc(1.917% 26321/1373074)
> _find_next_bit(1.531% 21023/1373074)
> ...
> codetag_to_text(0.490% 6727/1373074)
> ...
> allocinfo_next(6.275% 98543/1570519)
> ...
> allocinfo_start(0.369% 5790/1570519)
> ...
> allocinfo_start taks less than 1%.
>
> Signed-off-by: David Wang <00107082@163.com>
I think you will be posting another version to address comments in the
first patch, but for this patch feel free to add:
Acked-by: Suren Baghdasaryan <surenb@google.com>
Thanks!
> ---
> lib/alloc_tag.c | 29 ++++++++++-------------------
> 1 file changed, 10 insertions(+), 19 deletions(-)
>
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index 25ecc1334b67..fdd5887769a6 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -45,21 +45,16 @@ struct allocinfo_private {
> static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> {
> struct allocinfo_private *priv;
> - struct codetag *ct;
> loff_t node = *pos;
>
> - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> - m->private = priv;
> - if (!priv)
> - return NULL;
> -
> - priv->print_header = (node == 0);
> + priv = (struct allocinfo_private *)m->private;
> codetag_lock_module_list(alloc_tag_cttype, true);
> - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
> - node--;
> -
> - return ct ? priv : NULL;
> + if (node == 0) {
> + priv->print_header = true;
> + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> + codetag_next_ct(&priv->iter);
> + }
> + return priv->iter.ct ? priv : NULL;
> }
>
> static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
> @@ -76,12 +71,7 @@ static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
>
> static void allocinfo_stop(struct seq_file *m, void *arg)
> {
> - struct allocinfo_private *priv = (struct allocinfo_private *)m->private;
> -
> - if (priv) {
> - codetag_lock_module_list(alloc_tag_cttype, false);
> - kfree(priv);
> - }
> + codetag_lock_module_list(alloc_tag_cttype, false);
> }
>
> static void print_allocinfo_header(struct seq_buf *buf)
> @@ -249,7 +239,8 @@ static void __init procfs_init(void)
> if (!mem_profiling_support)
> return;
>
> - if (!proc_create_seq(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op)) {
> + if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
> + sizeof(struct allocinfo_private), NULL)) {
> pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
> shutdown_mem_profiling(false);
> }
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v2 1/2] alloc_tag: add sequence number for module and iterator
2025-05-07 17:55 [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo David Wang
` (3 preceding siblings ...)
2025-05-09 5:53 ` [PATCH 2/2] alloc_tag: keep codetag iterator cross read() calls David Wang
@ 2025-05-09 17:38 ` David Wang
2025-05-09 19:25 ` Suren Baghdasaryan
2025-05-09 17:39 ` [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls David Wang
5 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-09 17:38 UTC (permalink / raw)
To: surenb, kent.overstreet, akpm; +Cc: linux-kernel, David Wang
Codetag iterator use <id,address> pair to guarantee the
validness. But both id and address can be reused, there is
theoretical possibility when module inserted right after
another module removed, kmalloc returns an address same as
the address kfree by previous module and IDR key reuses
the key recently removed.
Add a sequence number to codetag_module and code_iterator,
the sequence number is strickly incremented whenever a module
is loaded. An iterator is valid if and only if its sequence
number match codetag_module's.
Signed-off-by: David Wang <00107082@163.com>
---
include/linux/codetag.h | 1 +
lib/codetag.c | 17 ++++++++++++++---
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index d14dbd26b370..90f707c3821f 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -54,6 +54,7 @@ struct codetag_iterator {
struct codetag_module *cmod;
unsigned long mod_id;
struct codetag *ct;
+ unsigned long mod_seq;
};
#ifdef MODULE
diff --git a/lib/codetag.c b/lib/codetag.c
index 42aadd6c1454..496cef7cdad3 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -11,8 +11,14 @@ struct codetag_type {
struct list_head link;
unsigned int count;
struct idr mod_idr;
- struct rw_semaphore mod_lock; /* protects mod_idr */
+ /*
+ * protects mod_idr, next_mod_seq,
+ * iter->mod_seq and cmod->mod_seq
+ */
+ struct rw_semaphore mod_lock;
struct codetag_type_desc desc;
+ /* generates unique sequence number for module load */
+ unsigned long next_mod_seq;
};
struct codetag_range {
@@ -23,6 +29,7 @@ struct codetag_range {
struct codetag_module {
struct module *mod;
struct codetag_range range;
+ unsigned long mod_seq;
};
static DEFINE_MUTEX(codetag_lock);
@@ -48,6 +55,7 @@ struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
.cmod = NULL,
.mod_id = 0,
.ct = NULL,
+ .mod_seq = 0,
};
return iter;
@@ -91,11 +99,13 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
if (!cmod)
break;
- if (cmod != iter->cmod) {
+ if (!iter->cmod || iter->mod_seq != cmod->mod_seq) {
iter->cmod = cmod;
+ iter->mod_seq = cmod->mod_seq;
ct = get_first_module_ct(cmod);
- } else
+ } else {
ct = get_next_module_ct(iter);
+ }
if (ct)
break;
@@ -190,6 +200,7 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
cmod->range = range;
down_write(&cttype->mod_lock);
+ cmod->mod_seq = ++cttype->next_mod_seq;
err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
if (err >= 0) {
cttype->count += range_size(cttype, &range);
--
2.39.2
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls
2025-05-07 17:55 [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo David Wang
` (4 preceding siblings ...)
2025-05-09 17:38 ` [PATCH v2 1/2] alloc_tag: add sequence number for module and iterator David Wang
@ 2025-05-09 17:39 ` David Wang
2025-05-09 18:33 ` Tim Chen
5 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-09 17:39 UTC (permalink / raw)
To: surenb, kent.overstreet, akpm; +Cc: linux-kernel, linux-mm, David Wang
When reading /proc/allocinfo, for each read syscall, seq_file would
invoke start/stop callbacks. In start callback, a memory is alloced
to store iterator and the iterator would start from beginning to
walk linearly to current read position.
seq_file read() takes at most 4096 bytes, even if read with a larger
user space buffer, meaning read out all of /proc/allocinfo, tens of read
syscalls are needed. For example, a 306036 bytes allocinfo files need
76 reads:
$ sudo cat /proc/allocinfo | wc
3964 16678 306036
$ sudo strace -T -e read cat /proc/allocinfo
...
read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062>
...
read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
...
For those n=3964 lines, each read takes about m=3964/76=52 lines,
since iterator restart from beginning for each read(),
it would move forward
m steps on 1st read
2*m steps on 2nd read
3*m steps on 3rd read
...
n steps on last read
As read() along, those linear seek steps make read() calls slower and
slower. Adding those up, codetag iterator moves about O(n*n/m) steps,
making data structure traversal take significant part of the whole reading.
Profiling when stress reading /proc/allocinfo confirms it:
vfs_read(99.959% 1677299/1677995)
proc_reg_read_iter(99.856% 1674881/1677299)
seq_read_iter(99.959% 1674191/1674881)
allocinfo_start(75.664% 1266755/1674191)
codetag_next_ct(79.217% 1003487/1266755) <---
srso_return_thunk(1.264% 16011/1266755)
__kmalloc_cache_noprof(0.102% 1296/1266755)
...
allocinfo_show(21.287% 356378/1674191)
allocinfo_next(1.530% 25621/1674191)
codetag_next_ct() takes major part.
A private data alloced at open() time can be used to carry iterator
alive across read() calls, and avoid the memory allocation and
iterator reset for each read(). This way, only O(1) memory allocation
and O(n) steps iterating, and `time` shows performance improvement
from ~7ms to ~4ms.
Profiling with the change:
vfs_read(99.865% 1581073/1583214)
proc_reg_read_iter(99.485% 1572934/1581073)
seq_read_iter(99.846% 1570519/1572934)
allocinfo_show(87.428% 1373074/1570519)
seq_buf_printf(83.695% 1149196/1373074)
seq_buf_putc(1.917% 26321/1373074)
_find_next_bit(1.531% 21023/1373074)
...
codetag_to_text(0.490% 6727/1373074)
...
allocinfo_next(6.275% 98543/1570519)
...
allocinfo_start(0.369% 5790/1570519)
...
Now seq_buf_printf() takes major part.
Signed-off-by: David Wang <00107082@163.com>
---
lib/alloc_tag.c | 29 ++++++++++-------------------
1 file changed, 10 insertions(+), 19 deletions(-)
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index 25ecc1334b67..fdd5887769a6 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -45,21 +45,16 @@ struct allocinfo_private {
static void *allocinfo_start(struct seq_file *m, loff_t *pos)
{
struct allocinfo_private *priv;
- struct codetag *ct;
loff_t node = *pos;
- priv = kzalloc(sizeof(*priv), GFP_KERNEL);
- m->private = priv;
- if (!priv)
- return NULL;
-
- priv->print_header = (node == 0);
+ priv = (struct allocinfo_private *)m->private;
codetag_lock_module_list(alloc_tag_cttype, true);
- priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
- while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
- node--;
-
- return ct ? priv : NULL;
+ if (node == 0) {
+ priv->print_header = true;
+ priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
+ codetag_next_ct(&priv->iter);
+ }
+ return priv->iter.ct ? priv : NULL;
}
static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
@@ -76,12 +71,7 @@ static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
static void allocinfo_stop(struct seq_file *m, void *arg)
{
- struct allocinfo_private *priv = (struct allocinfo_private *)m->private;
-
- if (priv) {
- codetag_lock_module_list(alloc_tag_cttype, false);
- kfree(priv);
- }
+ codetag_lock_module_list(alloc_tag_cttype, false);
}
static void print_allocinfo_header(struct seq_buf *buf)
@@ -249,7 +239,8 @@ static void __init procfs_init(void)
if (!mem_profiling_support)
return;
- if (!proc_create_seq(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op)) {
+ if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
+ sizeof(struct allocinfo_private), NULL)) {
pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
shutdown_mem_profiling(false);
}
--
2.39.2
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 2/2] alloc_tag: keep codetag iterator cross read() calls
2025-05-09 17:34 ` Suren Baghdasaryan
@ 2025-05-09 17:45 ` David Wang
0 siblings, 0 replies; 44+ messages in thread
From: David Wang @ 2025-05-09 17:45 UTC (permalink / raw)
To: Suren Baghdasaryan; +Cc: kent.overstreet, akpm, linux-kernel, linux-mm
At 2025-05-10 01:34:52, "Suren Baghdasaryan" <surenb@google.com> wrote:
>On Thu, May 8, 2025 at 10:53 PM David Wang <00107082@163.com> wrote:
>>
>> When reading /proc/allocinfo, for each read syscall, seq_file would
>> invoke start/stop callbacks. In start callback, a memory is alloced
>> to store iterator and the iterator would start from beginning to
>> walk to current read position.
>>
>> seq_file read() takes at most 4096 bytes, even if read with a larger
>> user space buffer, meaning read out all of /proc/allocinfo, tens of read
>> syscalls are needed. For example, a 306036 bytes allocinfo files need
>> 76 reads:
>>
>> $ sudo cat /proc/allocinfo | wc
>> 3964 16678 306036
>> $ sudo strace -T -e read cat /proc/allocinfo
>> ...
>> read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062>
>> ...
>> read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
>> ...
>> For those n=3964 lines, each read takes about m=3964/76=52 lines,
>> since iterator restart from beginning for each read(),
>> it would move forward
>> m steps on 1st read
>> 2*m steps on 2nd read
>> 3*m steps on 3rd read
>> ...
>> n steps on last read
>> As read() along, more iterator steps make read() calls slower and
>> slower. Adding those up, codetag iterator moves about O(n*n/m) steps,
>> making data structure traversal take significant part of the whole reading.
>> Profiling when stress reading /proc/allocinfo confirms it:
>>
>> vfs_read(99.959% 1677299/1677995)
>> proc_reg_read_iter(99.856% 1674881/1677299)
>> seq_read_iter(99.959% 1674191/1674881)
>> allocinfo_start(75.664% 1266755/1674191)
>> codetag_next_ct(79.217% 1003487/1266755) <---
>> srso_return_thunk(1.264% 16011/1266755)
>> __kmalloc_cache_noprof(0.102% 1296/1266755)
>> ...
>> allocinfo_show(21.287% 356378/1674191)
>> allocinfo_next(1.530% 25621/1674191)
>> allocinfo_start() takes about 75% of seq_read().
>>
>> A private data alloced at open() time can be used to carry iterator
>> alive across read() calls, and avoid the memory allocation and
>> iterator reset for each read(). This way, only O(1) memory allocation
>> and O(n) steps iterating, and `time` shows performance improvement
>> from ~7ms to ~4ms.
>> Profiling with the change:
>>
>> vfs_read(99.865% 1581073/1583214)
>> proc_reg_read_iter(99.485% 1572934/1581073)
>> seq_read_iter(99.846% 1570519/1572934)
>> allocinfo_show(87.428% 1373074/1570519)
>> seq_buf_printf(83.695% 1149196/1373074)
>> seq_buf_putc(1.917% 26321/1373074)
>> _find_next_bit(1.531% 21023/1373074)
>> ...
>> codetag_to_text(0.490% 6727/1373074)
>> ...
>> allocinfo_next(6.275% 98543/1570519)
>> ...
>> allocinfo_start(0.369% 5790/1570519)
>> ...
>> allocinfo_start taks less than 1%.
>>
>> Signed-off-by: David Wang <00107082@163.com>
>
>I think you will be posting another version to address comments in the
>first patch, but for this patch feel free to add:
>
>Acked-by: Suren Baghdasaryan <surenb@google.com>
>
>Thanks!
>
Oops.... I just made some changes to commit message.
(Found some typo, and made some re-wordings....)
>> ---
>> lib/alloc_tag.c | 29 ++++++++++-------------------
>> 1 file changed, 10 insertions(+), 19 deletions(-)
>>
>> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
>> index 25ecc1334b67..fdd5887769a6 100644
>> --- a/lib/alloc_tag.c
>> +++ b/lib/alloc_tag.c
>> @@ -45,21 +45,16 @@ struct allocinfo_private {
>> static void *allocinfo_start(struct seq_file *m, loff_t *pos)
>> {
>> struct allocinfo_private *priv;
>> - struct codetag *ct;
>> loff_t node = *pos;
>>
>> - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>> - m->private = priv;
>> - if (!priv)
>> - return NULL;
>> -
>> - priv->print_header = (node == 0);
>> + priv = (struct allocinfo_private *)m->private;
>> codetag_lock_module_list(alloc_tag_cttype, true);
>> - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
>> - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
>> - node--;
>> -
>> - return ct ? priv : NULL;
>> + if (node == 0) {
>> + priv->print_header = true;
>> + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
>> + codetag_next_ct(&priv->iter);
>> + }
>> + return priv->iter.ct ? priv : NULL;
>> }
>>
>> static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
>> @@ -76,12 +71,7 @@ static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
>>
>> static void allocinfo_stop(struct seq_file *m, void *arg)
>> {
>> - struct allocinfo_private *priv = (struct allocinfo_private *)m->private;
>> -
>> - if (priv) {
>> - codetag_lock_module_list(alloc_tag_cttype, false);
>> - kfree(priv);
>> - }
>> + codetag_lock_module_list(alloc_tag_cttype, false);
>> }
>>
>> static void print_allocinfo_header(struct seq_buf *buf)
>> @@ -249,7 +239,8 @@ static void __init procfs_init(void)
>> if (!mem_profiling_support)
>> return;
>>
>> - if (!proc_create_seq(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op)) {
>> + if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
>> + sizeof(struct allocinfo_private), NULL)) {
>> pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
>> shutdown_mem_profiling(false);
>> }
>> --
>> 2.39.2
>>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls
2025-05-09 17:39 ` [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls David Wang
@ 2025-05-09 18:33 ` Tim Chen
2025-05-09 19:36 ` Suren Baghdasaryan
2025-05-10 3:25 ` David Wang
0 siblings, 2 replies; 44+ messages in thread
From: Tim Chen @ 2025-05-09 18:33 UTC (permalink / raw)
To: David Wang, surenb, kent.overstreet, akpm; +Cc: linux-kernel, linux-mm
On Sat, 2025-05-10 at 01:39 +0800, David Wang wrote:
>
>
> Signed-off-by: David Wang <00107082@163.com>
> ---
> lib/alloc_tag.c | 29 ++++++++++-------------------
> 1 file changed, 10 insertions(+), 19 deletions(-)
>
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index 25ecc1334b67..fdd5887769a6 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -45,21 +45,16 @@ struct allocinfo_private {
> static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> {
> struct allocinfo_private *priv;
> - struct codetag *ct;
> loff_t node = *pos;
>
> - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> - m->private = priv;
> - if (!priv)
> - return NULL;
> -
> - priv->print_header = (node == 0);
> + priv = (struct allocinfo_private *)m->private;
> codetag_lock_module_list(alloc_tag_cttype, true);
> - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
> - node--;
> -
> - return ct ? priv : NULL;
> + if (node == 0) {
> + priv->print_header = true;
> + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> + codetag_next_ct(&priv->iter);
> + }
Do you need to skip print header when *pos != 0? i.e add
} else {
priv->print_header = false;
}
Tim
> + return priv->iter.ct ? priv : NULL;
> }
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 1/2] alloc_tag: add sequence number for module and iterator
2025-05-09 17:38 ` [PATCH v2 1/2] alloc_tag: add sequence number for module and iterator David Wang
@ 2025-05-09 19:25 ` Suren Baghdasaryan
2025-06-09 6:42 ` [PATCH v3 " David Wang
0 siblings, 1 reply; 44+ messages in thread
From: Suren Baghdasaryan @ 2025-05-09 19:25 UTC (permalink / raw)
To: David Wang; +Cc: kent.overstreet, akpm, linux-kernel
On Fri, May 9, 2025 at 10:39 AM David Wang <00107082@163.com> wrote:
>
> Codetag iterator use <id,address> pair to guarantee the
> validness. But both id and address can be reused, there is
> theoretical possibility when module inserted right after
> another module removed, kmalloc returns an address same as
> the address kfree by previous module and IDR key reuses
> the key recently removed.
>
> Add a sequence number to codetag_module and code_iterator,
> the sequence number is strickly incremented whenever a module
> is loaded. An iterator is valid if and only if its sequence
> number match codetag_module's.
>
> Signed-off-by: David Wang <00107082@163.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
> ---
> include/linux/codetag.h | 1 +
> lib/codetag.c | 17 ++++++++++++++---
> 2 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> index d14dbd26b370..90f707c3821f 100644
> --- a/include/linux/codetag.h
> +++ b/include/linux/codetag.h
> @@ -54,6 +54,7 @@ struct codetag_iterator {
> struct codetag_module *cmod;
> unsigned long mod_id;
> struct codetag *ct;
> + unsigned long mod_seq;
> };
>
> #ifdef MODULE
> diff --git a/lib/codetag.c b/lib/codetag.c
> index 42aadd6c1454..496cef7cdad3 100644
> --- a/lib/codetag.c
> +++ b/lib/codetag.c
> @@ -11,8 +11,14 @@ struct codetag_type {
> struct list_head link;
> unsigned int count;
> struct idr mod_idr;
> - struct rw_semaphore mod_lock; /* protects mod_idr */
> + /*
> + * protects mod_idr, next_mod_seq,
> + * iter->mod_seq and cmod->mod_seq
> + */
> + struct rw_semaphore mod_lock;
> struct codetag_type_desc desc;
> + /* generates unique sequence number for module load */
> + unsigned long next_mod_seq;
> };
>
> struct codetag_range {
> @@ -23,6 +29,7 @@ struct codetag_range {
> struct codetag_module {
> struct module *mod;
> struct codetag_range range;
> + unsigned long mod_seq;
> };
>
> static DEFINE_MUTEX(codetag_lock);
> @@ -48,6 +55,7 @@ struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
> .cmod = NULL,
> .mod_id = 0,
> .ct = NULL,
> + .mod_seq = 0,
> };
>
> return iter;
> @@ -91,11 +99,13 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
> if (!cmod)
> break;
>
> - if (cmod != iter->cmod) {
> + if (!iter->cmod || iter->mod_seq != cmod->mod_seq) {
> iter->cmod = cmod;
> + iter->mod_seq = cmod->mod_seq;
> ct = get_first_module_ct(cmod);
> - } else
> + } else {
> ct = get_next_module_ct(iter);
> + }
>
> if (ct)
> break;
> @@ -190,6 +200,7 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
> cmod->range = range;
>
> down_write(&cttype->mod_lock);
> + cmod->mod_seq = ++cttype->next_mod_seq;
> err = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
> if (err >= 0) {
> cttype->count += range_size(cttype, &range);
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls
2025-05-09 18:33 ` Tim Chen
@ 2025-05-09 19:36 ` Suren Baghdasaryan
2025-05-09 19:46 ` Tim Chen
2025-05-10 3:25 ` David Wang
1 sibling, 1 reply; 44+ messages in thread
From: Suren Baghdasaryan @ 2025-05-09 19:36 UTC (permalink / raw)
To: Tim Chen; +Cc: David Wang, kent.overstreet, akpm, linux-kernel, linux-mm
On Fri, May 9, 2025 at 11:33 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Sat, 2025-05-10 at 01:39 +0800, David Wang wrote:
> >
> >
> > Signed-off-by: David Wang <00107082@163.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> > lib/alloc_tag.c | 29 ++++++++++-------------------
> > 1 file changed, 10 insertions(+), 19 deletions(-)
> >
> > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > index 25ecc1334b67..fdd5887769a6 100644
> > --- a/lib/alloc_tag.c
> > +++ b/lib/alloc_tag.c
> > @@ -45,21 +45,16 @@ struct allocinfo_private {
> > static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> > {
> > struct allocinfo_private *priv;
> > - struct codetag *ct;
> > loff_t node = *pos;
> >
> > - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> > - m->private = priv;
> > - if (!priv)
> > - return NULL;
> > -
> > - priv->print_header = (node == 0);
> > + priv = (struct allocinfo_private *)m->private;
> > codetag_lock_module_list(alloc_tag_cttype, true);
> > - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> > - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
> > - node--;
> > -
> > - return ct ? priv : NULL;
> > + if (node == 0) {
> > + priv->print_header = true;
> > + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> > + codetag_next_ct(&priv->iter);
> > + }
>
> Do you need to skip print header when *pos != 0? i.e add
Technically not needed since proc_create_seq_private() allocates
seq->private using kzalloc(), so the initial value of
priv->print_header is always false.
>
> } else {
> priv->print_header = false;
> }
>
> Tim
>
> > + return priv->iter.ct ? priv : NULL;
> > }
> >
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls
2025-05-09 19:36 ` Suren Baghdasaryan
@ 2025-05-09 19:46 ` Tim Chen
2025-05-09 20:46 ` Suren Baghdasaryan
2025-05-10 3:35 ` [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls David Wang
0 siblings, 2 replies; 44+ messages in thread
From: Tim Chen @ 2025-05-09 19:46 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: David Wang, kent.overstreet, akpm, linux-kernel, linux-mm
On Fri, 2025-05-09 at 12:36 -0700, Suren Baghdasaryan wrote:
> On Fri, May 9, 2025 at 11:33 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >
> > On Sat, 2025-05-10 at 01:39 +0800, David Wang wrote:
> > >
> > >
> > > Signed-off-by: David Wang <00107082@163.com>
>
> Acked-by: Suren Baghdasaryan <surenb@google.com>
>
> > > ---
> > > lib/alloc_tag.c | 29 ++++++++++-------------------
> > > 1 file changed, 10 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > > index 25ecc1334b67..fdd5887769a6 100644
> > > --- a/lib/alloc_tag.c
> > > +++ b/lib/alloc_tag.c
> > > @@ -45,21 +45,16 @@ struct allocinfo_private {
> > > static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> > > {
> > > struct allocinfo_private *priv;
> > > - struct codetag *ct;
> > > loff_t node = *pos;
> > >
> > > - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> > > - m->private = priv;
> > > - if (!priv)
> > > - return NULL;
> > > -
> > > - priv->print_header = (node == 0);
> > > + priv = (struct allocinfo_private *)m->private;
> > > codetag_lock_module_list(alloc_tag_cttype, true);
> > > - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> > > - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
> > > - node--;
> > > -
> > > - return ct ? priv : NULL;
> > > + if (node == 0) {
> > > + priv->print_header = true;
> > > + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> > > + codetag_next_ct(&priv->iter);
> > > + }
> >
> > Do you need to skip print header when *pos != 0? i.e add
>
> Technically not needed since proc_create_seq_private() allocates
> seq->private using kzalloc(), so the initial value of
> priv->print_header is always false.
But we'll start with first call to allocinfo_start() with *pos == 0,
then print_header will be initialized to true.
Will there be subsequent calls of allocinfo_start() with *pos !=0,
but priv->print_header stays at 0?
Tim
>
> >
> > } else {
> > priv->print_header = false;
> > }
> >
> > Tim
> >
> > > + return priv->iter.ct ? priv : NULL;
> > > }
> > >
> >
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls
2025-05-09 19:46 ` Tim Chen
@ 2025-05-09 20:46 ` Suren Baghdasaryan
2025-05-09 21:15 ` Suren Baghdasaryan
2025-05-10 3:35 ` [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls David Wang
1 sibling, 1 reply; 44+ messages in thread
From: Suren Baghdasaryan @ 2025-05-09 20:46 UTC (permalink / raw)
To: Tim Chen; +Cc: David Wang, kent.overstreet, akpm, linux-kernel, linux-mm
On Fri, May 9, 2025 at 12:46 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Fri, 2025-05-09 at 12:36 -0700, Suren Baghdasaryan wrote:
> > On Fri, May 9, 2025 at 11:33 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > >
> > > On Sat, 2025-05-10 at 01:39 +0800, David Wang wrote:
> > > >
> > > >
> > > > Signed-off-by: David Wang <00107082@163.com>
> >
> > Acked-by: Suren Baghdasaryan <surenb@google.com>
> >
> > > > ---
> > > > lib/alloc_tag.c | 29 ++++++++++-------------------
> > > > 1 file changed, 10 insertions(+), 19 deletions(-)
> > > >
> > > > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > > > index 25ecc1334b67..fdd5887769a6 100644
> > > > --- a/lib/alloc_tag.c
> > > > +++ b/lib/alloc_tag.c
> > > > @@ -45,21 +45,16 @@ struct allocinfo_private {
> > > > static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> > > > {
> > > > struct allocinfo_private *priv;
> > > > - struct codetag *ct;
> > > > loff_t node = *pos;
> > > >
> > > > - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> > > > - m->private = priv;
> > > > - if (!priv)
> > > > - return NULL;
> > > > -
> > > > - priv->print_header = (node == 0);
> > > > + priv = (struct allocinfo_private *)m->private;
> > > > codetag_lock_module_list(alloc_tag_cttype, true);
> > > > - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> > > > - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
> > > > - node--;
> > > > -
> > > > - return ct ? priv : NULL;
> > > > + if (node == 0) {
> > > > + priv->print_header = true;
> > > > + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> > > > + codetag_next_ct(&priv->iter);
> > > > + }
> > >
> > > Do you need to skip print header when *pos != 0? i.e add
> >
> > Technically not needed since proc_create_seq_private() allocates
> > seq->private using kzalloc(), so the initial value of
> > priv->print_header is always false.
>
> But we'll start with first call to allocinfo_start() with *pos == 0,
Usually but not always if we do lseek() to a non-zero position beforehand.
> then print_header will be initialized to true.
After the first call to allocinfo_show() print_header will be reset
back to false.
> Will there be subsequent calls of allocinfo_start() with *pos !=0,
> but priv->print_header stays at 0?
Yes, there will be subsequent calls to allocinfo_start() with *pos !=0
and priv->print_header=false, which is what we want, right? We want to
print the header only at the beginning of the file (node == 0).
>
> Tim
> >
> > >
> > > } else {
> > > priv->print_header = false;
> > > }
> > >
> > > Tim
> > >
> > > > + return priv->iter.ct ? priv : NULL;
> > > > }
> > > >
> > >
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls
2025-05-09 20:46 ` Suren Baghdasaryan
@ 2025-05-09 21:15 ` Suren Baghdasaryan
2025-05-10 3:10 ` David Wang
0 siblings, 1 reply; 44+ messages in thread
From: Suren Baghdasaryan @ 2025-05-09 21:15 UTC (permalink / raw)
To: Tim Chen; +Cc: David Wang, kent.overstreet, akpm, linux-kernel, linux-mm
On Fri, May 9, 2025 at 1:46 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Fri, May 9, 2025 at 12:46 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >
> > On Fri, 2025-05-09 at 12:36 -0700, Suren Baghdasaryan wrote:
> > > On Fri, May 9, 2025 at 11:33 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > > >
> > > > On Sat, 2025-05-10 at 01:39 +0800, David Wang wrote:
> > > > >
> > > > >
> > > > > Signed-off-by: David Wang <00107082@163.com>
> > >
> > > Acked-by: Suren Baghdasaryan <surenb@google.com>
> > >
> > > > > ---
> > > > > lib/alloc_tag.c | 29 ++++++++++-------------------
> > > > > 1 file changed, 10 insertions(+), 19 deletions(-)
> > > > >
> > > > > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> > > > > index 25ecc1334b67..fdd5887769a6 100644
> > > > > --- a/lib/alloc_tag.c
> > > > > +++ b/lib/alloc_tag.c
> > > > > @@ -45,21 +45,16 @@ struct allocinfo_private {
> > > > > static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> > > > > {
> > > > > struct allocinfo_private *priv;
> > > > > - struct codetag *ct;
> > > > > loff_t node = *pos;
> > > > >
> > > > > - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> > > > > - m->private = priv;
> > > > > - if (!priv)
> > > > > - return NULL;
> > > > > -
> > > > > - priv->print_header = (node == 0);
> > > > > + priv = (struct allocinfo_private *)m->private;
> > > > > codetag_lock_module_list(alloc_tag_cttype, true);
> > > > > - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> > > > > - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
> > > > > - node--;
> > > > > -
> > > > > - return ct ? priv : NULL;
> > > > > + if (node == 0) {
> > > > > + priv->print_header = true;
> > > > > + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> > > > > + codetag_next_ct(&priv->iter);
> > > > > + }
> > > >
> > > > Do you need to skip print header when *pos != 0? i.e add
> > >
> > > Technically not needed since proc_create_seq_private() allocates
> > > seq->private using kzalloc(), so the initial value of
> > > priv->print_header is always false.
> >
> > But we'll start with first call to allocinfo_start() with *pos == 0,
>
> Usually but not always if we do lseek() to a non-zero position beforehand.
Actually, this change will break the lseek() case. We can't always
assume that we start reading from *pos == 0. Current patch will fail
to initialize priv if we start reading with *pos != 0.
priv->iter should be tracking current position and allocinfo_start()
should detect a mismatch between *pos and iter->pos and re-walk the
tags if there was a position change.
>
> > then print_header will be initialized to true.
>
> After the first call to allocinfo_show() print_header will be reset
> back to false.
>
> > Will there be subsequent calls of allocinfo_start() with *pos !=0,
> > but priv->print_header stays at 0?
>
> Yes, there will be subsequent calls to allocinfo_start() with *pos !=0
> and priv->print_header=false, which is what we want, right? We want to
> print the header only at the beginning of the file (node == 0).
>
> >
> > Tim
> > >
> > > >
> > > > } else {
> > > > priv->print_header = false;
> > > > }
> > > >
> > > > Tim
> > > >
> > > > > + return priv->iter.ct ? priv : NULL;
> > > > > }
> > > > >
> > > >
> >
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls
2025-05-09 21:15 ` Suren Baghdasaryan
@ 2025-05-10 3:10 ` David Wang
2025-05-10 3:30 ` Suren Baghdasaryan
0 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-10 3:10 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Tim Chen, kent.overstreet, akpm, linux-kernel, linux-mm
At 2025-05-10 05:15:43, "Suren Baghdasaryan" <surenb@google.com> wrote:
>On Fri, May 9, 2025 at 1:46 PM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> On Fri, May 9, 2025 at 12:46 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>> >
>> > On Fri, 2025-05-09 at 12:36 -0700, Suren Baghdasaryan wrote:
>> > > On Fri, May 9, 2025 at 11:33 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>> > > >
>> > > > On Sat, 2025-05-10 at 01:39 +0800, David Wang wrote:
>> > > > >
>> > > > >
>> > > > > Signed-off-by: David Wang <00107082@163.com>
>> > >
>> > > Acked-by: Suren Baghdasaryan <surenb@google.com>
>> > >
>> > > > > ---
>> > > > > lib/alloc_tag.c | 29 ++++++++++-------------------
>> > > > > 1 file changed, 10 insertions(+), 19 deletions(-)
>> > > > >
>> > > > > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
>> > > > > index 25ecc1334b67..fdd5887769a6 100644
>> > > > > --- a/lib/alloc_tag.c
>> > > > > +++ b/lib/alloc_tag.c
>> > > > > @@ -45,21 +45,16 @@ struct allocinfo_private {
>> > > > > static void *allocinfo_start(struct seq_file *m, loff_t *pos)
>> > > > > {
>> > > > > struct allocinfo_private *priv;
>> > > > > - struct codetag *ct;
>> > > > > loff_t node = *pos;
>> > > > >
>> > > > > - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>> > > > > - m->private = priv;
>> > > > > - if (!priv)
>> > > > > - return NULL;
>> > > > > -
>> > > > > - priv->print_header = (node == 0);
>> > > > > + priv = (struct allocinfo_private *)m->private;
>> > > > > codetag_lock_module_list(alloc_tag_cttype, true);
>> > > > > - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
>> > > > > - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
>> > > > > - node--;
>> > > > > -
>> > > > > - return ct ? priv : NULL;
>> > > > > + if (node == 0) {
>> > > > > + priv->print_header = true;
>> > > > > + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
>> > > > > + codetag_next_ct(&priv->iter);
>> > > > > + }
>> > > >
>> > > > Do you need to skip print header when *pos != 0? i.e add
>> > >
>> > > Technically not needed since proc_create_seq_private() allocates
>> > > seq->private using kzalloc(), so the initial value of
>> > > priv->print_header is always false.
>> >
>> > But we'll start with first call to allocinfo_start() with *pos == 0,
>>
>> Usually but not always if we do lseek() to a non-zero position beforehand.
>
>Actually, this change will break the lseek() case. We can't always
>assume that we start reading from *pos == 0. Current patch will fail
>to initialize priv if we start reading with *pos != 0.
>priv->iter should be tracking current position and allocinfo_start()
>should detect a mismatch between *pos and iter->pos and re-walk the
>tags if there was a position change.
seq_file works line by line, I think even if it support lseek, seq_file would still start with line #0,
since seq_file have on clue the byte size for each line.
I will check the code, make some tests and update later.
>
>>
>> > then print_header will be initialized to true.
>>
>> After the first call to allocinfo_show() print_header will be reset
>> back to false.
>>
>> > Will there be subsequent calls of allocinfo_start() with *pos !=0,
>> > but priv->print_header stays at 0?
>>
>> Yes, there will be subsequent calls to allocinfo_start() with *pos !=0
>> and priv->print_header=false, which is what we want, right? We want to
>> print the header only at the beginning of the file (node == 0).
>>
>> >
>> > Tim
>> > >
>> > > >
>> > > > } else {
>> > > > priv->print_header = false;
>> > > > }
>> > > >
>> > > > Tim
>> > > >
>> > > > > + return priv->iter.ct ? priv : NULL;
>> > > > > }
>> > > > >
>> > > >
>> >
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls
2025-05-09 18:33 ` Tim Chen
2025-05-09 19:36 ` Suren Baghdasaryan
@ 2025-05-10 3:25 ` David Wang
1 sibling, 0 replies; 44+ messages in thread
From: David Wang @ 2025-05-10 3:25 UTC (permalink / raw)
To: Tim Chen; +Cc: surenb, kent.overstreet, akpm, linux-kernel, linux-mm
At 2025-05-10 02:33:48, "Tim Chen" <tim.c.chen@linux.intel.com> wrote:
>On Sat, 2025-05-10 at 01:39 +0800, David Wang wrote:
>>
>>
>> Signed-off-by: David Wang <00107082@163.com>
>> ---
>> lib/alloc_tag.c | 29 ++++++++++-------------------
>> 1 file changed, 10 insertions(+), 19 deletions(-)
>>
>> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
>> index 25ecc1334b67..fdd5887769a6 100644
>> --- a/lib/alloc_tag.c
>> +++ b/lib/alloc_tag.c
>> @@ -45,21 +45,16 @@ struct allocinfo_private {
>> static void *allocinfo_start(struct seq_file *m, loff_t *pos)
>> {
>> struct allocinfo_private *priv;
>> - struct codetag *ct;
>> loff_t node = *pos;
>>
>> - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>> - m->private = priv;
>> - if (!priv)
>> - return NULL;
>> -
>> - priv->print_header = (node == 0);
>> + priv = (struct allocinfo_private *)m->private;
>> codetag_lock_module_list(alloc_tag_cttype, true);
>> - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
>> - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
>> - node--;
>> -
>> - return ct ? priv : NULL;
>> + if (node == 0) {
>> + priv->print_header = true;
>> + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
>> + codetag_next_ct(&priv->iter);
>> + }
>
>Do you need to skip print header when *pos != 0? i.e add
>
> } else {
> priv->print_header = false;
> }
>
>Tim
print_header flag will be set to false once header is printed, in allocinfo_show():
114 if (priv->print_header) {
115 print_allocinfo_header(&buf);
116 priv->print_header = false;
117 }
once priv->print_header) is set to true, the header will be printed once and only once until it is
set to true next time.
Thanks
David
>
>> + return priv->iter.ct ? priv : NULL;
>> }
>>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls
2025-05-10 3:10 ` David Wang
@ 2025-05-10 3:30 ` Suren Baghdasaryan
2025-05-10 3:58 ` David Wang
0 siblings, 1 reply; 44+ messages in thread
From: Suren Baghdasaryan @ 2025-05-10 3:30 UTC (permalink / raw)
To: David Wang; +Cc: Tim Chen, kent.overstreet, akpm, linux-kernel, linux-mm
On Fri, May 9, 2025 at 8:10 PM David Wang <00107082@163.com> wrote:
>
>
> At 2025-05-10 05:15:43, "Suren Baghdasaryan" <surenb@google.com> wrote:
> >On Fri, May 9, 2025 at 1:46 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> On Fri, May 9, 2025 at 12:46 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >> >
> >> > On Fri, 2025-05-09 at 12:36 -0700, Suren Baghdasaryan wrote:
> >> > > On Fri, May 9, 2025 at 11:33 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >> > > >
> >> > > > On Sat, 2025-05-10 at 01:39 +0800, David Wang wrote:
> >> > > > >
> >> > > > >
> >> > > > > Signed-off-by: David Wang <00107082@163.com>
> >> > >
> >> > > Acked-by: Suren Baghdasaryan <surenb@google.com>
> >> > >
> >> > > > > ---
> >> > > > > lib/alloc_tag.c | 29 ++++++++++-------------------
> >> > > > > 1 file changed, 10 insertions(+), 19 deletions(-)
> >> > > > >
> >> > > > > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> >> > > > > index 25ecc1334b67..fdd5887769a6 100644
> >> > > > > --- a/lib/alloc_tag.c
> >> > > > > +++ b/lib/alloc_tag.c
> >> > > > > @@ -45,21 +45,16 @@ struct allocinfo_private {
> >> > > > > static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> >> > > > > {
> >> > > > > struct allocinfo_private *priv;
> >> > > > > - struct codetag *ct;
> >> > > > > loff_t node = *pos;
> >> > > > >
> >> > > > > - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> >> > > > > - m->private = priv;
> >> > > > > - if (!priv)
> >> > > > > - return NULL;
> >> > > > > -
> >> > > > > - priv->print_header = (node == 0);
> >> > > > > + priv = (struct allocinfo_private *)m->private;
> >> > > > > codetag_lock_module_list(alloc_tag_cttype, true);
> >> > > > > - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> >> > > > > - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
> >> > > > > - node--;
> >> > > > > -
> >> > > > > - return ct ? priv : NULL;
> >> > > > > + if (node == 0) {
> >> > > > > + priv->print_header = true;
> >> > > > > + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> >> > > > > + codetag_next_ct(&priv->iter);
> >> > > > > + }
> >> > > >
> >> > > > Do you need to skip print header when *pos != 0? i.e add
> >> > >
> >> > > Technically not needed since proc_create_seq_private() allocates
> >> > > seq->private using kzalloc(), so the initial value of
> >> > > priv->print_header is always false.
> >> >
> >> > But we'll start with first call to allocinfo_start() with *pos == 0,
> >>
> >> Usually but not always if we do lseek() to a non-zero position beforehand.
> >
> >Actually, this change will break the lseek() case. We can't always
> >assume that we start reading from *pos == 0. Current patch will fail
> >to initialize priv if we start reading with *pos != 0.
> >priv->iter should be tracking current position and allocinfo_start()
> >should detect a mismatch between *pos and iter->pos and re-walk the
> >tags if there was a position change.
>
> seq_file works line by line, I think even if it support lseek, seq_file would still start with line #0,
> since seq_file have on clue the byte size for each line.
>
> I will check the code, make some tests and update later.
Ah, yes. You are correct.
seq_lseek() will traverse restarting from 0:
https://elixir.bootlin.com/linux/v6.14.6/source/fs/seq_file.c#L323.
Position jumps are similarly handled with traversal from 0:
https://elixir.bootlin.com/linux/v6.14.6/source/fs/seq_file.c#L194.
>
>
> >
> >>
> >> > then print_header will be initialized to true.
> >>
> >> After the first call to allocinfo_show() print_header will be reset
> >> back to false.
> >>
> >> > Will there be subsequent calls of allocinfo_start() with *pos !=0,
> >> > but priv->print_header stays at 0?
> >>
> >> Yes, there will be subsequent calls to allocinfo_start() with *pos !=0
> >> and priv->print_header=false, which is what we want, right? We want to
> >> print the header only at the beginning of the file (node == 0).
> >>
> >> >
> >> > Tim
> >> > >
> >> > > >
> >> > > > } else {
> >> > > > priv->print_header = false;
> >> > > > }
> >> > > >
> >> > > > Tim
> >> > > >
> >> > > > > + return priv->iter.ct ? priv : NULL;
> >> > > > > }
> >> > > > >
> >> > > >
> >> >
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls
2025-05-09 19:46 ` Tim Chen
2025-05-09 20:46 ` Suren Baghdasaryan
@ 2025-05-10 3:35 ` David Wang
1 sibling, 0 replies; 44+ messages in thread
From: David Wang @ 2025-05-10 3:35 UTC (permalink / raw)
To: Tim Chen
Cc: Suren Baghdasaryan, kent.overstreet, akpm, linux-kernel, linux-mm
At 2025-05-10 03:46:06, "Tim Chen" <tim.c.chen@linux.intel.com> wrote:
>On Fri, 2025-05-09 at 12:36 -0700, Suren Baghdasaryan wrote:
>> On Fri, May 9, 2025 at 11:33 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>> >
>> > On Sat, 2025-05-10 at 01:39 +0800, David Wang wrote:
>> > >
>> > >
>> > > Signed-off-by: David Wang <00107082@163.com>
>>
>> Acked-by: Suren Baghdasaryan <surenb@google.com>
>>
>> > > ---
>> > > lib/alloc_tag.c | 29 ++++++++++-------------------
>> > > 1 file changed, 10 insertions(+), 19 deletions(-)
>> > >
>> > > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
>> > > index 25ecc1334b67..fdd5887769a6 100644
>> > > --- a/lib/alloc_tag.c
>> > > +++ b/lib/alloc_tag.c
>> > > @@ -45,21 +45,16 @@ struct allocinfo_private {
>> > > static void *allocinfo_start(struct seq_file *m, loff_t *pos)
>> > > {
>> > > struct allocinfo_private *priv;
>> > > - struct codetag *ct;
>> > > loff_t node = *pos;
>> > >
>> > > - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>> > > - m->private = priv;
>> > > - if (!priv)
>> > > - return NULL;
>> > > -
>> > > - priv->print_header = (node == 0);
>> > > + priv = (struct allocinfo_private *)m->private;
>> > > codetag_lock_module_list(alloc_tag_cttype, true);
>> > > - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
>> > > - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
>> > > - node--;
>> > > -
>> > > - return ct ? priv : NULL;
>> > > + if (node == 0) {
>> > > + priv->print_header = true;
>> > > + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
>> > > + codetag_next_ct(&priv->iter);
>> > > + }
>> >
>> > Do you need to skip print header when *pos != 0? i.e add
>>
>> Technically not needed since proc_create_seq_private() allocates
>> seq->private using kzalloc(), so the initial value of
>> priv->print_header is always false.
>
>But we'll start with first call to allocinfo_start() with *pos == 0,
>then print_header will be initialized to true.
>Will there be subsequent calls of allocinfo_start() with *pos !=0,
>but priv->print_header stays at 0?
Header only needs to be printed at line #0:
Normal seq_file operates;
start(0)
show(0) <---print header here
show(1)
stop()
start(2) <----no need to print header
show(2)
show(3)
...
If start(0)/show(0) happened twice, two header would be printed; or if there is no start(0) nothing would be printed.
If those two scenario happens, I think it is a bug in seq_file.
Thanks
David
>
>Tim
>>
>> >
>> > } else {
>> > priv->print_header = false;
>> > }
>> >
>> > Tim
>> >
>> > > + return priv->iter.ct ? priv : NULL;
>> > > }
>> > >
>> >
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls
2025-05-10 3:30 ` Suren Baghdasaryan
@ 2025-05-10 3:58 ` David Wang
2025-05-10 4:03 ` Suren Baghdasaryan
0 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-05-10 3:58 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: Tim Chen, kent.overstreet, akpm, linux-kernel, linux-mm
At 2025-05-10 11:30:50, "Suren Baghdasaryan" <surenb@google.com> wrote:
>On Fri, May 9, 2025 at 8:10 PM David Wang <00107082@163.com> wrote:
>>
>>
>> At 2025-05-10 05:15:43, "Suren Baghdasaryan" <surenb@google.com> wrote:
>> >On Fri, May 9, 2025 at 1:46 PM Suren Baghdasaryan <surenb@google.com> wrote:
>> >>
>> >> On Fri, May 9, 2025 at 12:46 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>> >> >
>> >> > On Fri, 2025-05-09 at 12:36 -0700, Suren Baghdasaryan wrote:
>> >> > > On Fri, May 9, 2025 at 11:33 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>> >> > > >
>> >> > > > On Sat, 2025-05-10 at 01:39 +0800, David Wang wrote:
>> >> > > > >
>> >> > > > >
>> >> > > > > Signed-off-by: David Wang <00107082@163.com>
>> >> > >
>> >> > > Acked-by: Suren Baghdasaryan <surenb@google.com>
>> >> > >
>> >> > > > > ---
>> >> > > > > lib/alloc_tag.c | 29 ++++++++++-------------------
>> >> > > > > 1 file changed, 10 insertions(+), 19 deletions(-)
>> >> > > > >
>> >> > > > > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
>> >> > > > > index 25ecc1334b67..fdd5887769a6 100644
>> >> > > > > --- a/lib/alloc_tag.c
>> >> > > > > +++ b/lib/alloc_tag.c
>> >> > > > > @@ -45,21 +45,16 @@ struct allocinfo_private {
>> >> > > > > static void *allocinfo_start(struct seq_file *m, loff_t *pos)
>> >> > > > > {
>> >> > > > > struct allocinfo_private *priv;
>> >> > > > > - struct codetag *ct;
>> >> > > > > loff_t node = *pos;
>> >> > > > >
>> >> > > > > - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
>> >> > > > > - m->private = priv;
>> >> > > > > - if (!priv)
>> >> > > > > - return NULL;
>> >> > > > > -
>> >> > > > > - priv->print_header = (node == 0);
>> >> > > > > + priv = (struct allocinfo_private *)m->private;
>> >> > > > > codetag_lock_module_list(alloc_tag_cttype, true);
>> >> > > > > - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
>> >> > > > > - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
>> >> > > > > - node--;
>> >> > > > > -
>> >> > > > > - return ct ? priv : NULL;
>> >> > > > > + if (node == 0) {
>> >> > > > > + priv->print_header = true;
>> >> > > > > + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
>> >> > > > > + codetag_next_ct(&priv->iter);
>> >> > > > > + }
>> >> > > >
>> >> > > > Do you need to skip print header when *pos != 0? i.e add
>> >> > >
>> >> > > Technically not needed since proc_create_seq_private() allocates
>> >> > > seq->private using kzalloc(), so the initial value of
>> >> > > priv->print_header is always false.
>> >> >
>> >> > But we'll start with first call to allocinfo_start() with *pos == 0,
>> >>
>> >> Usually but not always if we do lseek() to a non-zero position beforehand.
>> >
>> >Actually, this change will break the lseek() case. We can't always
>> >assume that we start reading from *pos == 0. Current patch will fail
>> >to initialize priv if we start reading with *pos != 0.
>> >priv->iter should be tracking current position and allocinfo_start()
>> >should detect a mismatch between *pos and iter->pos and re-walk the
>> >tags if there was a position change.
>>
>> seq_file works line by line, I think even if it support lseek, seq_file would still start with line #0,
>> since seq_file have on clue the byte size for each line.
>>
>> I will check the code, make some tests and update later.
>
>Ah, yes. You are correct.
>seq_lseek() will traverse restarting from 0:
>https://elixir.bootlin.com/linux/v6.14.6/source/fs/seq_file.c#L323.
>Position jumps are similarly handled with traversal from 0:
>https://elixir.bootlin.com/linux/v6.14.6/source/fs/seq_file.c#L194.
>
Actually I was expecting EOPNOTSUPP when lseek on seq files, surprised to see it works...... :)
If seq_file somehow skips start(0), then nothing would be displayed since
priv->iter.ct would be 0 and `return priv->iter.ct ? priv : NULL;` would return NULL;
But I think that case would be seq_file's bug, starting with 0 is kind of protocol promised by seq_file.
>>
>>
>> >
>> >>
>> >> > then print_header will be initialized to true.
>> >>
>> >> After the first call to allocinfo_show() print_header will be reset
>> >> back to false.
>> >>
>> >> > Will there be subsequent calls of allocinfo_start() with *pos !=0,
>> >> > but priv->print_header stays at 0?
>> >>
>> >> Yes, there will be subsequent calls to allocinfo_start() with *pos !=0
>> >> and priv->print_header=false, which is what we want, right? We want to
>> >> print the header only at the beginning of the file (node == 0).
>> >>
>> >> >
>> >> > Tim
>> >> > >
>> >> > > >
>> >> > > > } else {
>> >> > > > priv->print_header = false;
>> >> > > > }
>> >> > > >
>> >> > > > Tim
>> >> > > >
>> >> > > > > + return priv->iter.ct ? priv : NULL;
>> >> > > > > }
>> >> > > > >
>> >> > > >
>> >> >
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls
2025-05-10 3:58 ` David Wang
@ 2025-05-10 4:03 ` Suren Baghdasaryan
2025-06-09 6:44 ` [PATCH v3 2/2] alloc_tag: keep codetag iterator active between read() David Wang
0 siblings, 1 reply; 44+ messages in thread
From: Suren Baghdasaryan @ 2025-05-10 4:03 UTC (permalink / raw)
To: David Wang; +Cc: Tim Chen, kent.overstreet, akpm, linux-kernel, linux-mm
On Fri, May 9, 2025 at 8:58 PM David Wang <00107082@163.com> wrote:
>
>
>
> At 2025-05-10 11:30:50, "Suren Baghdasaryan" <surenb@google.com> wrote:
> >On Fri, May 9, 2025 at 8:10 PM David Wang <00107082@163.com> wrote:
> >>
> >>
> >> At 2025-05-10 05:15:43, "Suren Baghdasaryan" <surenb@google.com> wrote:
> >> >On Fri, May 9, 2025 at 1:46 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >> >>
> >> >> On Fri, May 9, 2025 at 12:46 PM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >> >> >
> >> >> > On Fri, 2025-05-09 at 12:36 -0700, Suren Baghdasaryan wrote:
> >> >> > > On Fri, May 9, 2025 at 11:33 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >> >> > > >
> >> >> > > > On Sat, 2025-05-10 at 01:39 +0800, David Wang wrote:
> >> >> > > > >
> >> >> > > > >
> >> >> > > > > Signed-off-by: David Wang <00107082@163.com>
> >> >> > >
> >> >> > > Acked-by: Suren Baghdasaryan <surenb@google.com>
> >> >> > >
> >> >> > > > > ---
> >> >> > > > > lib/alloc_tag.c | 29 ++++++++++-------------------
> >> >> > > > > 1 file changed, 10 insertions(+), 19 deletions(-)
> >> >> > > > >
> >> >> > > > > diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> >> >> > > > > index 25ecc1334b67..fdd5887769a6 100644
> >> >> > > > > --- a/lib/alloc_tag.c
> >> >> > > > > +++ b/lib/alloc_tag.c
> >> >> > > > > @@ -45,21 +45,16 @@ struct allocinfo_private {
> >> >> > > > > static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> >> >> > > > > {
> >> >> > > > > struct allocinfo_private *priv;
> >> >> > > > > - struct codetag *ct;
> >> >> > > > > loff_t node = *pos;
> >> >> > > > >
> >> >> > > > > - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> >> >> > > > > - m->private = priv;
> >> >> > > > > - if (!priv)
> >> >> > > > > - return NULL;
> >> >> > > > > -
> >> >> > > > > - priv->print_header = (node == 0);
> >> >> > > > > + priv = (struct allocinfo_private *)m->private;
> >> >> > > > > codetag_lock_module_list(alloc_tag_cttype, true);
> >> >> > > > > - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> >> >> > > > > - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
> >> >> > > > > - node--;
> >> >> > > > > -
> >> >> > > > > - return ct ? priv : NULL;
> >> >> > > > > + if (node == 0) {
> >> >> > > > > + priv->print_header = true;
> >> >> > > > > + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> >> >> > > > > + codetag_next_ct(&priv->iter);
> >> >> > > > > + }
> >> >> > > >
> >> >> > > > Do you need to skip print header when *pos != 0? i.e add
> >> >> > >
> >> >> > > Technically not needed since proc_create_seq_private() allocates
> >> >> > > seq->private using kzalloc(), so the initial value of
> >> >> > > priv->print_header is always false.
> >> >> >
> >> >> > But we'll start with first call to allocinfo_start() with *pos == 0,
> >> >>
> >> >> Usually but not always if we do lseek() to a non-zero position beforehand.
> >> >
> >> >Actually, this change will break the lseek() case. We can't always
> >> >assume that we start reading from *pos == 0. Current patch will fail
> >> >to initialize priv if we start reading with *pos != 0.
> >> >priv->iter should be tracking current position and allocinfo_start()
> >> >should detect a mismatch between *pos and iter->pos and re-walk the
> >> >tags if there was a position change.
> >>
> >> seq_file works line by line, I think even if it support lseek, seq_file would still start with line #0,
> >> since seq_file have on clue the byte size for each line.
> >>
> >> I will check the code, make some tests and update later.
> >
> >Ah, yes. You are correct.
> >seq_lseek() will traverse restarting from 0:
> >https://elixir.bootlin.com/linux/v6.14.6/source/fs/seq_file.c#L323.
> >Position jumps are similarly handled with traversal from 0:
> >https://elixir.bootlin.com/linux/v6.14.6/source/fs/seq_file.c#L194.
> >
>
> Actually I was expecting EOPNOTSUPP when lseek on seq files, surprised to see it works...... :)
>
> If seq_file somehow skips start(0), then nothing would be displayed since
> priv->iter.ct would be 0 and `return priv->iter.ct ? priv : NULL;` would return NULL;
> But I think that case would be seq_file's bug, starting with 0 is kind of protocol promised by seq_file.
Yes, I think that is the correct assumption. Even pread() with a jump
in position should work the same way, restarting at 0.
>
>
> >>
>
> >>
> >> >
> >> >>
> >> >> > then print_header will be initialized to true.
> >> >>
> >> >> After the first call to allocinfo_show() print_header will be reset
> >> >> back to false.
> >> >>
> >> >> > Will there be subsequent calls of allocinfo_start() with *pos !=0,
> >> >> > but priv->print_header stays at 0?
> >> >>
> >> >> Yes, there will be subsequent calls to allocinfo_start() with *pos !=0
> >> >> and priv->print_header=false, which is what we want, right? We want to
> >> >> print the header only at the beginning of the file (node == 0).
> >> >>
> >> >> >
> >> >> > Tim
> >> >> > >
> >> >> > > >
> >> >> > > > } else {
> >> >> > > > priv->print_header = false;
> >> >> > > > }
> >> >> > > >
> >> >> > > > Tim
> >> >> > > >
> >> >> > > > > + return priv->iter.ct ? priv : NULL;
> >> >> > > > > }
> >> >> > > > >
> >> >> > > >
> >> >> >
^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH v3 1/2] alloc_tag: add sequence number for module and iterator
2025-05-09 19:25 ` Suren Baghdasaryan
@ 2025-06-09 6:42 ` David Wang
2025-06-09 16:41 ` Suren Baghdasaryan
0 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-06-09 6:42 UTC (permalink / raw)
To: surenb; +Cc: kent.overstreet, akpm, linux-kernel, David Wang
Codetag iterator use <id,address> pair to guarantee the
validness. But both id and address can be reused, there is
theoretical possibility when module inserted right after
another module removed, kmalloc returns an address same as
the address kfree by previous module and IDR key reuses
the key recently removed.
Add a sequence number to codetag_module and code_iterator,
the sequence number is strickly incremented whenever a module
is loaded. An iterator is valid if and only if its sequence
number match codetag_module's.
Signed-off-by: David Wang <00107082@163.com>
---
Changes since v2:
Rebase to 6.16-rc1
---
include/linux/codetag.h | 1 +
lib/codetag.c | 17 ++++++++++++++---
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/include/linux/codetag.h b/include/linux/codetag.h
index 5f2b9a1f722c..457ed8fd3214 100644
--- a/include/linux/codetag.h
+++ b/include/linux/codetag.h
@@ -54,6 +54,7 @@ struct codetag_iterator {
struct codetag_module *cmod;
unsigned long mod_id;
struct codetag *ct;
+ unsigned long mod_seq;
};
#ifdef MODULE
diff --git a/lib/codetag.c b/lib/codetag.c
index 650d54d7e14d..545911cebd25 100644
--- a/lib/codetag.c
+++ b/lib/codetag.c
@@ -11,8 +11,14 @@ struct codetag_type {
struct list_head link;
unsigned int count;
struct idr mod_idr;
- struct rw_semaphore mod_lock; /* protects mod_idr */
+ /*
+ * protects mod_idr, next_mod_seq,
+ * iter->mod_seq and cmod->mod_seq
+ */
+ struct rw_semaphore mod_lock;
struct codetag_type_desc desc;
+ /* generates unique sequence number for module load */
+ unsigned long next_mod_seq;
};
struct codetag_range {
@@ -23,6 +29,7 @@ struct codetag_range {
struct codetag_module {
struct module *mod;
struct codetag_range range;
+ unsigned long mod_seq;
};
static DEFINE_MUTEX(codetag_lock);
@@ -48,6 +55,7 @@ struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
.cmod = NULL,
.mod_id = 0,
.ct = NULL,
+ .mod_seq = 0,
};
return iter;
@@ -91,11 +99,13 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
if (!cmod)
break;
- if (cmod != iter->cmod) {
+ if (!iter->cmod || iter->mod_seq != cmod->mod_seq) {
iter->cmod = cmod;
+ iter->mod_seq = cmod->mod_seq;
ct = get_first_module_ct(cmod);
- } else
+ } else {
ct = get_next_module_ct(iter);
+ }
if (ct)
break;
@@ -191,6 +201,7 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
cmod->range = range;
down_write(&cttype->mod_lock);
+ cmod->mod_seq = ++cttype->next_mod_seq;
mod_id = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
if (mod_id >= 0) {
if (cttype->desc.module_load) {
--
2.39.2
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH v3 2/2] alloc_tag: keep codetag iterator active between read()
2025-05-10 4:03 ` Suren Baghdasaryan
@ 2025-06-09 6:44 ` David Wang
2025-06-10 16:22 ` Suren Baghdasaryan
0 siblings, 1 reply; 44+ messages in thread
From: David Wang @ 2025-06-09 6:44 UTC (permalink / raw)
To: surenb; +Cc: tim.c.chen, kent.overstreet, akpm, linux-kernel, David Wang
When reading /proc/allocinfo, for each read syscall, seq_file would
invoke start/stop callbacks. In start callback, a memory is alloced
to store iterator and the iterator would start from beginning to
walk linearly to current read position.
seq_file read() takes at most 4096 bytes, even if read with a larger
user space buffer, meaning read out all of /proc/allocinfo, tens of read
syscalls are needed. For example, a 306036 bytes allocinfo files need
76 reads:
$ sudo cat /proc/allocinfo | wc
3964 16678 306036
$ sudo strace -T -e read cat /proc/allocinfo
...
read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062>
...
read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
...
For those n=3964 lines, each read takes about m=3964/76=52 lines,
since iterator restart from beginning for each read(),
it would move forward
m steps on 1st read
2*m steps on 2nd read
3*m steps on 3rd read
...
n steps on last read
As read() along, those linear seek steps make read() calls slower and
slower. Adding those up, codetag iterator moves about O(n*n/m) steps,
making data structure traversal take significant part of the whole reading.
Profiling when stress reading /proc/allocinfo confirms it:
vfs_read(99.959% 1677299/1677995)
proc_reg_read_iter(99.856% 1674881/1677299)
seq_read_iter(99.959% 1674191/1674881)
allocinfo_start(75.664% 1266755/1674191)
codetag_next_ct(79.217% 1003487/1266755) <---
srso_return_thunk(1.264% 16011/1266755)
__kmalloc_cache_noprof(0.102% 1296/1266755)
...
allocinfo_show(21.287% 356378/1674191)
allocinfo_next(1.530% 25621/1674191)
codetag_next_ct() takes major part.
A private data alloced at open() time can be used to carry iterator
alive across read() calls, and avoid the memory allocation and
iterator reset for each read(). This way, only O(1) memory allocation
and O(n) steps iterating, and `time` shows performance improvement
from ~7ms to ~4ms.
Profiling with the change:
vfs_read(99.865% 1581073/1583214)
proc_reg_read_iter(99.485% 1572934/1581073)
seq_read_iter(99.846% 1570519/1572934)
allocinfo_show(87.428% 1373074/1570519)
seq_buf_printf(83.695% 1149196/1373074)
seq_buf_putc(1.917% 26321/1373074)
_find_next_bit(1.531% 21023/1373074)
...
codetag_to_text(0.490% 6727/1373074)
...
allocinfo_next(6.275% 98543/1570519)
...
allocinfo_start(0.369% 5790/1570519)
...
Now seq_buf_printf() takes major part.
Signed-off-by: David Wang <00107082@163.com>
---
Changes since v2:
Rebase to 6.16-rc1, resolve conflicts.
---
lib/alloc_tag.c | 29 ++++++++++-------------------
1 file changed, 10 insertions(+), 19 deletions(-)
diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
index d48b80f3f007..ac7e50b41b6a 100644
--- a/lib/alloc_tag.c
+++ b/lib/alloc_tag.c
@@ -45,21 +45,16 @@ struct allocinfo_private {
static void *allocinfo_start(struct seq_file *m, loff_t *pos)
{
struct allocinfo_private *priv;
- struct codetag *ct;
loff_t node = *pos;
- priv = kzalloc(sizeof(*priv), GFP_KERNEL);
- m->private = priv;
- if (!priv)
- return NULL;
-
- priv->print_header = (node == 0);
+ priv = (struct allocinfo_private *)m->private;
codetag_lock_module_list(alloc_tag_cttype, true);
- priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
- while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
- node--;
-
- return ct ? priv : NULL;
+ if (node == 0) {
+ priv->print_header = true;
+ priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
+ codetag_next_ct(&priv->iter);
+ }
+ return priv->iter.ct ? priv : NULL;
}
static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
@@ -76,12 +71,7 @@ static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
static void allocinfo_stop(struct seq_file *m, void *arg)
{
- struct allocinfo_private *priv = (struct allocinfo_private *)m->private;
-
- if (priv) {
- codetag_lock_module_list(alloc_tag_cttype, false);
- kfree(priv);
- }
+ codetag_lock_module_list(alloc_tag_cttype, false);
}
static void print_allocinfo_header(struct seq_buf *buf)
@@ -811,7 +801,8 @@ static int __init alloc_tag_init(void)
return 0;
}
- if (!proc_create_seq(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op)) {
+ if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
+ sizeof(struct allocinfo_private), NULL)) {
pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
shutdown_mem_profiling(false);
return -ENOMEM;
--
2.39.2
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH v3 1/2] alloc_tag: add sequence number for module and iterator
2025-06-09 6:42 ` [PATCH v3 " David Wang
@ 2025-06-09 16:41 ` Suren Baghdasaryan
0 siblings, 0 replies; 44+ messages in thread
From: Suren Baghdasaryan @ 2025-06-09 16:41 UTC (permalink / raw)
To: David Wang; +Cc: kent.overstreet, akpm, linux-kernel
On Sun, Jun 8, 2025 at 11:42 PM David Wang <00107082@163.com> wrote:
>
> Codetag iterator use <id,address> pair to guarantee the
> validness. But both id and address can be reused, there is
> theoretical possibility when module inserted right after
> another module removed, kmalloc returns an address same as
> the address kfree by previous module and IDR key reuses
> the key recently removed.
>
> Add a sequence number to codetag_module and code_iterator,
> the sequence number is strickly incremented whenever a module
> is loaded. An iterator is valid if and only if its sequence
> number match codetag_module's.
>
> Signed-off-by: David Wang <00107082@163.com>
> ---
> Changes since v2:
> Rebase to 6.16-rc1
If all you did is rebase then please add my Acked-by. Thanks!
> ---
> include/linux/codetag.h | 1 +
> lib/codetag.c | 17 ++++++++++++++---
> 2 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/codetag.h b/include/linux/codetag.h
> index 5f2b9a1f722c..457ed8fd3214 100644
> --- a/include/linux/codetag.h
> +++ b/include/linux/codetag.h
> @@ -54,6 +54,7 @@ struct codetag_iterator {
> struct codetag_module *cmod;
> unsigned long mod_id;
> struct codetag *ct;
> + unsigned long mod_seq;
> };
>
> #ifdef MODULE
> diff --git a/lib/codetag.c b/lib/codetag.c
> index 650d54d7e14d..545911cebd25 100644
> --- a/lib/codetag.c
> +++ b/lib/codetag.c
> @@ -11,8 +11,14 @@ struct codetag_type {
> struct list_head link;
> unsigned int count;
> struct idr mod_idr;
> - struct rw_semaphore mod_lock; /* protects mod_idr */
> + /*
> + * protects mod_idr, next_mod_seq,
> + * iter->mod_seq and cmod->mod_seq
> + */
> + struct rw_semaphore mod_lock;
> struct codetag_type_desc desc;
> + /* generates unique sequence number for module load */
> + unsigned long next_mod_seq;
> };
>
> struct codetag_range {
> @@ -23,6 +29,7 @@ struct codetag_range {
> struct codetag_module {
> struct module *mod;
> struct codetag_range range;
> + unsigned long mod_seq;
> };
>
> static DEFINE_MUTEX(codetag_lock);
> @@ -48,6 +55,7 @@ struct codetag_iterator codetag_get_ct_iter(struct codetag_type *cttype)
> .cmod = NULL,
> .mod_id = 0,
> .ct = NULL,
> + .mod_seq = 0,
> };
>
> return iter;
> @@ -91,11 +99,13 @@ struct codetag *codetag_next_ct(struct codetag_iterator *iter)
> if (!cmod)
> break;
>
> - if (cmod != iter->cmod) {
> + if (!iter->cmod || iter->mod_seq != cmod->mod_seq) {
> iter->cmod = cmod;
> + iter->mod_seq = cmod->mod_seq;
> ct = get_first_module_ct(cmod);
> - } else
> + } else {
> ct = get_next_module_ct(iter);
> + }
>
> if (ct)
> break;
> @@ -191,6 +201,7 @@ static int codetag_module_init(struct codetag_type *cttype, struct module *mod)
> cmod->range = range;
>
> down_write(&cttype->mod_lock);
> + cmod->mod_seq = ++cttype->next_mod_seq;
> mod_id = idr_alloc(&cttype->mod_idr, cmod, 0, 0, GFP_KERNEL);
> if (mod_id >= 0) {
> if (cttype->desc.module_load) {
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH v3 2/2] alloc_tag: keep codetag iterator active between read()
2025-06-09 6:44 ` [PATCH v3 2/2] alloc_tag: keep codetag iterator active between read() David Wang
@ 2025-06-10 16:22 ` Suren Baghdasaryan
0 siblings, 0 replies; 44+ messages in thread
From: Suren Baghdasaryan @ 2025-06-10 16:22 UTC (permalink / raw)
To: David Wang; +Cc: tim.c.chen, kent.overstreet, akpm, linux-kernel
On Sun, Jun 8, 2025 at 11:44 PM David Wang <00107082@163.com> wrote:
>
> When reading /proc/allocinfo, for each read syscall, seq_file would
> invoke start/stop callbacks. In start callback, a memory is alloced
> to store iterator and the iterator would start from beginning to
> walk linearly to current read position.
>
> seq_file read() takes at most 4096 bytes, even if read with a larger
> user space buffer, meaning read out all of /proc/allocinfo, tens of read
> syscalls are needed. For example, a 306036 bytes allocinfo files need
> 76 reads:
>
> $ sudo cat /proc/allocinfo | wc
> 3964 16678 306036
> $ sudo strace -T -e read cat /proc/allocinfo
> ...
> read(3, " 4096 1 arch/x86/k"..., 131072) = 4063 <0.000062>
> ...
> read(3, " 0 0 sound/core"..., 131072) = 4021 <0.000150>
> ...
> For those n=3964 lines, each read takes about m=3964/76=52 lines,
> since iterator restart from beginning for each read(),
> it would move forward
> m steps on 1st read
> 2*m steps on 2nd read
> 3*m steps on 3rd read
> ...
> n steps on last read
> As read() along, those linear seek steps make read() calls slower and
> slower. Adding those up, codetag iterator moves about O(n*n/m) steps,
> making data structure traversal take significant part of the whole reading.
> Profiling when stress reading /proc/allocinfo confirms it:
>
> vfs_read(99.959% 1677299/1677995)
> proc_reg_read_iter(99.856% 1674881/1677299)
> seq_read_iter(99.959% 1674191/1674881)
> allocinfo_start(75.664% 1266755/1674191)
> codetag_next_ct(79.217% 1003487/1266755) <---
> srso_return_thunk(1.264% 16011/1266755)
> __kmalloc_cache_noprof(0.102% 1296/1266755)
> ...
> allocinfo_show(21.287% 356378/1674191)
> allocinfo_next(1.530% 25621/1674191)
> codetag_next_ct() takes major part.
>
> A private data alloced at open() time can be used to carry iterator
> alive across read() calls, and avoid the memory allocation and
> iterator reset for each read(). This way, only O(1) memory allocation
> and O(n) steps iterating, and `time` shows performance improvement
> from ~7ms to ~4ms.
> Profiling with the change:
>
> vfs_read(99.865% 1581073/1583214)
> proc_reg_read_iter(99.485% 1572934/1581073)
> seq_read_iter(99.846% 1570519/1572934)
> allocinfo_show(87.428% 1373074/1570519)
> seq_buf_printf(83.695% 1149196/1373074)
> seq_buf_putc(1.917% 26321/1373074)
> _find_next_bit(1.531% 21023/1373074)
> ...
> codetag_to_text(0.490% 6727/1373074)
> ...
> allocinfo_next(6.275% 98543/1570519)
> ...
> allocinfo_start(0.369% 5790/1570519)
> ...
> Now seq_buf_printf() takes major part.
>
> Signed-off-by: David Wang <00107082@163.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
> ---
> Changes since v2:
> Rebase to 6.16-rc1, resolve conflicts.
> ---
> lib/alloc_tag.c | 29 ++++++++++-------------------
> 1 file changed, 10 insertions(+), 19 deletions(-)
>
> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> index d48b80f3f007..ac7e50b41b6a 100644
> --- a/lib/alloc_tag.c
> +++ b/lib/alloc_tag.c
> @@ -45,21 +45,16 @@ struct allocinfo_private {
> static void *allocinfo_start(struct seq_file *m, loff_t *pos)
> {
> struct allocinfo_private *priv;
> - struct codetag *ct;
> loff_t node = *pos;
>
> - priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> - m->private = priv;
> - if (!priv)
> - return NULL;
> -
> - priv->print_header = (node == 0);
> + priv = (struct allocinfo_private *)m->private;
> codetag_lock_module_list(alloc_tag_cttype, true);
> - priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> - while ((ct = codetag_next_ct(&priv->iter)) != NULL && node)
> - node--;
> -
> - return ct ? priv : NULL;
> + if (node == 0) {
> + priv->print_header = true;
> + priv->iter = codetag_get_ct_iter(alloc_tag_cttype);
> + codetag_next_ct(&priv->iter);
> + }
> + return priv->iter.ct ? priv : NULL;
> }
>
> static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
> @@ -76,12 +71,7 @@ static void *allocinfo_next(struct seq_file *m, void *arg, loff_t *pos)
>
> static void allocinfo_stop(struct seq_file *m, void *arg)
> {
> - struct allocinfo_private *priv = (struct allocinfo_private *)m->private;
> -
> - if (priv) {
> - codetag_lock_module_list(alloc_tag_cttype, false);
> - kfree(priv);
> - }
> + codetag_lock_module_list(alloc_tag_cttype, false);
> }
>
> static void print_allocinfo_header(struct seq_buf *buf)
> @@ -811,7 +801,8 @@ static int __init alloc_tag_init(void)
> return 0;
> }
>
> - if (!proc_create_seq(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op)) {
> + if (!proc_create_seq_private(ALLOCINFO_FILE_NAME, 0400, NULL, &allocinfo_seq_op,
> + sizeof(struct allocinfo_private), NULL)) {
> pr_err("Failed to create %s file\n", ALLOCINFO_FILE_NAME);
> shutdown_mem_profiling(false);
> return -ENOMEM;
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2025-06-10 16:22 UTC | newest]
Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 17:55 [PATCH] alloc_tag: avoid mem alloc and iter reset when reading allocinfo David Wang
2025-05-07 18:19 ` David Wang
2025-05-07 23:42 ` [PATCH] " Suren Baghdasaryan
2025-05-08 0:01 ` Kent Overstreet
2025-05-08 3:06 ` David Wang
2025-05-08 3:31 ` Kent Overstreet
2025-05-08 3:35 ` David Wang
2025-05-08 4:07 ` Kent Overstreet
2025-05-08 5:51 ` David Wang
2025-05-08 13:33 ` Kent Overstreet
2025-05-08 16:24 ` David Wang
2025-05-08 16:34 ` Kent Overstreet
2025-05-08 16:58 ` David Wang
2025-05-08 17:17 ` David Wang
2025-05-08 17:26 ` Kent Overstreet
2025-05-08 2:24 ` David Wang
2025-05-07 23:36 ` Suren Baghdasaryan
2025-05-08 3:10 ` David Wang
2025-05-08 15:32 ` David Wang
2025-05-08 21:41 ` Suren Baghdasaryan
2025-05-09 5:51 ` [PATCH 1/2] alloc_tag: add timestamp to codetag iterator David Wang
2025-05-09 16:10 ` Suren Baghdasaryan
2025-05-09 16:16 ` David Wang
2025-05-09 5:53 ` [PATCH 2/2] alloc_tag: keep codetag iterator cross read() calls David Wang
2025-05-09 17:34 ` Suren Baghdasaryan
2025-05-09 17:45 ` David Wang
2025-05-09 17:38 ` [PATCH v2 1/2] alloc_tag: add sequence number for module and iterator David Wang
2025-05-09 19:25 ` Suren Baghdasaryan
2025-06-09 6:42 ` [PATCH v3 " David Wang
2025-06-09 16:41 ` Suren Baghdasaryan
2025-05-09 17:39 ` [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls David Wang
2025-05-09 18:33 ` Tim Chen
2025-05-09 19:36 ` Suren Baghdasaryan
2025-05-09 19:46 ` Tim Chen
2025-05-09 20:46 ` Suren Baghdasaryan
2025-05-09 21:15 ` Suren Baghdasaryan
2025-05-10 3:10 ` David Wang
2025-05-10 3:30 ` Suren Baghdasaryan
2025-05-10 3:58 ` David Wang
2025-05-10 4:03 ` Suren Baghdasaryan
2025-06-09 6:44 ` [PATCH v3 2/2] alloc_tag: keep codetag iterator active between read() David Wang
2025-06-10 16:22 ` Suren Baghdasaryan
2025-05-10 3:35 ` [PATCH v2 2/2] alloc_tag: keep codetag iterator active between read() calls David Wang
2025-05-10 3:25 ` David Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).