All of lore.kernel.org
 help / color / mirror / Atom feed
* [for-linus][PATCH 0/2] tracing/eventfs: Fixes for 6.8
@ 2024-01-23  3:08 Steven Rostedt
  2024-01-23  3:08 ` [for-linus][PATCH 1/2] tracing: Ensure visibility when inserting an element into tracing_map Steven Rostedt
  2024-01-23  3:08 ` [for-linus][PATCH 2/2] eventfs: Save directory inodes in the eventfs_inode structure Steven Rostedt
  0 siblings, 2 replies; 4+ messages in thread
From: Steven Rostedt @ 2024-01-23  3:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton

Tracing and eventfs fixes for 6.8:

- Fix histogram tracing_map insertion.
  The tracing_map_insert copies the value into the elt variable and
  then assigns the elt to the entry value. But it is possible that
  the entry value becomes visible on other CPUs before the elt is
  fully initialized. This is fixed by adding a wmb() between the
  initialization of the elt variable and assigning it.

- Have eventfs directory have unique inode numbers. Having them be
  all the same proved to be a failure as the find application will
  think that the directories are causing loops, as it checks for
  directory loops via their inodes. Have the evenfs dir entries
  get their inodes assigned when they are referenced and then save
  them in the eventfs_inode structure.

Petr Pavlu (1):
      tracing: Ensure visibility when inserting an element into tracing_map

Steven Rostedt (Google) (1):
      eventfs: Save directory inodes in the eventfs_inode structure

----
 fs/tracefs/event_inode.c   | 14 +++++++++++---
 fs/tracefs/internal.h      |  7 ++++---
 kernel/trace/tracing_map.c |  7 ++++++-
 3 files changed, 21 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [for-linus][PATCH 1/2] tracing: Ensure visibility when inserting an element into tracing_map
  2024-01-23  3:08 [for-linus][PATCH 0/2] tracing/eventfs: Fixes for 6.8 Steven Rostedt
@ 2024-01-23  3:08 ` Steven Rostedt
  2024-01-23  3:08 ` [for-linus][PATCH 2/2] eventfs: Save directory inodes in the eventfs_inode structure Steven Rostedt
  1 sibling, 0 replies; 4+ messages in thread
From: Steven Rostedt @ 2024-01-23  3:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Petr Pavlu, Tom Zanussi

From: Petr Pavlu <petr.pavlu@suse.com>

Running the following two commands in parallel on a multi-processor
AArch64 machine can sporadically produce an unexpected warning about
duplicate histogram entries:

 $ while true; do
     echo hist:key=id.syscall:val=hitcount > \
       /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
     cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
     sleep 0.001
   done
 $ stress-ng --sysbadaddr $(nproc)

The warning looks as follows:

[ 2911.172474] ------------[ cut here ]------------
[ 2911.173111] Duplicates detected: 1
[ 2911.173574] WARNING: CPU: 2 PID: 12247 at kernel/trace/tracing_map.c:983 tracing_map_sort_entries+0x3e0/0x408
[ 2911.174702] Modules linked in: iscsi_ibft(E) iscsi_boot_sysfs(E) rfkill(E) af_packet(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) ena(E) tiny_power_button(E) qemu_fw_cfg(E) button(E) fuse(E) efi_pstore(E) ip_tables(E) x_tables(E) xfs(E) libcrc32c(E) aes_ce_blk(E) aes_ce_cipher(E) crct10dif_ce(E) polyval_ce(E) polyval_generic(E) ghash_ce(E) gf128mul(E) sm4_ce_gcm(E) sm4_ce_ccm(E) sm4_ce(E) sm4_ce_cipher(E) sm4(E) sm3_ce(E) sm3(E) sha3_ce(E) sha512_ce(E) sha512_arm64(E) sha2_ce(E) sha256_arm64(E) nvme(E) sha1_ce(E) nvme_core(E) nvme_auth(E) t10_pi(E) sg(E) scsi_mod(E) scsi_common(E) efivarfs(E)
[ 2911.174738] Unloaded tainted modules: cppc_cpufreq(E):1
[ 2911.180985] CPU: 2 PID: 12247 Comm: cat Kdump: loaded Tainted: G            E      6.7.0-default #2 1b58bbb22c97e4399dc09f92d309344f69c44a01
[ 2911.182398] Hardware name: Amazon EC2 c7g.8xlarge/, BIOS 1.0 11/1/2018
[ 2911.183208] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[ 2911.184038] pc : tracing_map_sort_entries+0x3e0/0x408
[ 2911.184667] lr : tracing_map_sort_entries+0x3e0/0x408
[ 2911.185310] sp : ffff8000a1513900
[ 2911.185750] x29: ffff8000a1513900 x28: ffff0003f272fe80 x27: 0000000000000001
[ 2911.186600] x26: ffff0003f272fe80 x25: 0000000000000030 x24: 0000000000000008
[ 2911.187458] x23: ffff0003c5788000 x22: ffff0003c16710c8 x21: ffff80008017f180
[ 2911.188310] x20: ffff80008017f000 x19: ffff80008017f180 x18: ffffffffffffffff
[ 2911.189160] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8000a15134b8
[ 2911.190015] x14: 0000000000000000 x13: 205d373432323154 x12: 5b5d313131333731
[ 2911.190844] x11: 00000000fffeffff x10: 00000000fffeffff x9 : ffffd1b78274a13c
[ 2911.191716] x8 : 000000000017ffe8 x7 : c0000000fffeffff x6 : 000000000057ffa8
[ 2911.192554] x5 : ffff0012f6c24ec0 x4 : 0000000000000000 x3 : ffff2e5b72b5d000
[ 2911.193404] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0003ff254480
[ 2911.194259] Call trace:
[ 2911.194626]  tracing_map_sort_entries+0x3e0/0x408
[ 2911.195220]  hist_show+0x124/0x800
[ 2911.195692]  seq_read_iter+0x1d4/0x4e8
[ 2911.196193]  seq_read+0xe8/0x138
[ 2911.196638]  vfs_read+0xc8/0x300
[ 2911.197078]  ksys_read+0x70/0x108
[ 2911.197534]  __arm64_sys_read+0x24/0x38
[ 2911.198046]  invoke_syscall+0x78/0x108
[ 2911.198553]  el0_svc_common.constprop.0+0xd0/0xf8
[ 2911.199157]  do_el0_svc+0x28/0x40
[ 2911.199613]  el0_svc+0x40/0x178
[ 2911.200048]  el0t_64_sync_handler+0x13c/0x158
[ 2911.200621]  el0t_64_sync+0x1a8/0x1b0
[ 2911.201115] ---[ end trace 0000000000000000 ]---

The problem appears to be caused by CPU reordering of writes issued from
__tracing_map_insert().

The check for the presence of an element with a given key in this
function is:

 val = READ_ONCE(entry->val);
 if (val && keys_match(key, val->key, map->key_size)) ...

The write of a new entry is:

 elt = get_free_elt(map);
 memcpy(elt->key, key, map->key_size);
 entry->val = elt;

The "memcpy(elt->key, key, map->key_size);" and "entry->val = elt;"
stores may become visible in the reversed order on another CPU. This
second CPU might then incorrectly determine that a new key doesn't match
an already present val->key and subsequently insert a new element,
resulting in a duplicate.

Fix the problem by adding a write barrier between
"memcpy(elt->key, key, map->key_size);" and "entry->val = elt;", and for
good measure, also use WRITE_ONCE(entry->val, elt) for publishing the
element. The sequence pairs with the mentioned "READ_ONCE(entry->val);"
and the "val->key" check which has an address dependency.

The barrier is placed on a path executed when adding an element for
a new key. Subsequent updates targeting the same key remain unaffected.

From the user's perspective, the issue was introduced by commit
c193707dde77 ("tracing: Remove code which merges duplicates"), which
followed commit cbf4100efb8f ("tracing: Add support to detect and avoid
duplicates"). The previous code operated differently; it inherently
expected potential races which result in duplicates but merged them
later when they occurred.

Link: https://lore.kernel.org/linux-trace-kernel/20240122150928.27725-1-petr.pavlu@suse.com

Fixes: c193707dde77 ("tracing: Remove code which merges duplicates")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Tom Zanussi <tom.zanussi@linux.intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/tracing_map.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/tracing_map.c b/kernel/trace/tracing_map.c
index c774e560f2f9..a4dcf0f24352 100644
--- a/kernel/trace/tracing_map.c
+++ b/kernel/trace/tracing_map.c
@@ -574,7 +574,12 @@ __tracing_map_insert(struct tracing_map *map, void *key, bool lookup_only)
 				}
 
 				memcpy(elt->key, key, map->key_size);
-				entry->val = elt;
+				/*
+				 * Ensure the initialization is visible and
+				 * publish the elt.
+				 */
+				smp_wmb();
+				WRITE_ONCE(entry->val, elt);
 				atomic64_inc(&map->hits);
 
 				return entry->val;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [for-linus][PATCH 2/2] eventfs: Save directory inodes in the eventfs_inode structure
  2024-01-23  3:08 [for-linus][PATCH 0/2] tracing/eventfs: Fixes for 6.8 Steven Rostedt
  2024-01-23  3:08 ` [for-linus][PATCH 1/2] tracing: Ensure visibility when inserting an element into tracing_map Steven Rostedt
@ 2024-01-23  3:08 ` Steven Rostedt
  2024-01-23  9:16   ` Geert Uytterhoeven
  1 sibling, 1 reply; 4+ messages in thread
From: Steven Rostedt @ 2024-01-23  3:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
	Linus Torvalds, Geert Uytterhoeven, Kees Cook

From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

The eventfs inodes and directories are allocated when referenced. But this
leaves the issue of keeping consistent inode numbers and the number is
only saved in the inode structure itself. When the inode is no longer
referenced, it can be freed. When the file that the inode was representing
is referenced again, the inode is once again created, but the inode number
needs to be the same as it was before.

Just making the inode numbers the same for all files is fine, but that
does not work with directories. The find command will check for loops via
the inode number and having the same inode number for directories triggers:

  # find /sys/kernel/tracing
find: File system loop detected;
'/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the same file system loop as
'/sys/kernel/debug/tracing/events/initcall'.
[..]

Linus pointed out that the eventfs_inode structure ends with a single
32bit int, and on 64 bit machines, there's likely a 4 byte hole due to
alignment. We can use this hole to store the inode number for the
eventfs_inode. All directories in eventfs are represented by an
eventfs_inode and that data structure can hold its inode number.

That last int was also purposely placed at the end of the structure to
prevent holes from within. Now that there's a 4 byte number to hold the
inode, both the inode number and the last integer can be moved up in the
structure for better cache locality, where the llist and rcu fields can be
moved to the end as they are only used when the eventfs_inode is being
deleted.

Link: https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Ynrfb8=X+c7x0Eewxn-YRdCA@mail.gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20240122152748.46897388@gandalf.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Fixes: 53c41052ba31 ("eventfs: Have the inodes all for files and directories all be the same")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 fs/tracefs/event_inode.c | 14 +++++++++++---
 fs/tracefs/internal.h    |  7 ++++---
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 6795fda2af19..6b211522a13e 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -34,7 +34,15 @@ static DEFINE_MUTEX(eventfs_mutex);
 
 /* Choose something "unique" ;-) */
 #define EVENTFS_FILE_INODE_INO		0x12c4e37
-#define EVENTFS_DIR_INODE_INO		0x134b2f5
+
+/* Just try to make something consistent and unique */
+static int eventfs_dir_ino(struct eventfs_inode *ei)
+{
+	if (!ei->ino)
+		ei->ino = get_next_ino();
+
+	return ei->ino;
+}
 
 /*
  * The eventfs_inode (ei) itself is protected by SRCU. It is released from
@@ -396,7 +404,7 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
 	inode->i_fop = &eventfs_file_operations;
 
 	/* All directories will have the same inode number */
-	inode->i_ino = EVENTFS_DIR_INODE_INO;
+	inode->i_ino = eventfs_dir_ino(ei);
 
 	ti = get_tracefs(inode);
 	ti->flags |= TRACEFS_EVENT_INODE;
@@ -802,7 +810,7 @@ static int eventfs_iterate(struct file *file, struct dir_context *ctx)
 
 		name = ei_child->name;
 
-		ino = EVENTFS_DIR_INODE_INO;
+		ino = eventfs_dir_ino(ei_child);
 
 		if (!dir_emit(ctx, name, strlen(name), ino, DT_DIR))
 			goto out_dec;
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 12b7d0150ae9..45397df9bb65 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -55,6 +55,10 @@ struct eventfs_inode {
 	struct eventfs_attr		*entry_attrs;
 	struct eventfs_attr		attr;
 	void				*data;
+	unsigned int			is_freed:1;
+	unsigned int			is_events:1;
+	unsigned int			nr_entries:30;
+	unsigned int			ino;
 	/*
 	 * Union - used for deletion
 	 * @llist:	for calling dput() if needed after RCU
@@ -64,9 +68,6 @@ struct eventfs_inode {
 		struct llist_node	llist;
 		struct rcu_head		rcu;
 	};
-	unsigned int			is_freed:1;
-	unsigned int			is_events:1;
-	unsigned int			nr_entries:30;
 };
 
 static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [for-linus][PATCH 2/2] eventfs: Save directory inodes in the eventfs_inode structure
  2024-01-23  3:08 ` [for-linus][PATCH 2/2] eventfs: Save directory inodes in the eventfs_inode structure Steven Rostedt
@ 2024-01-23  9:16   ` Geert Uytterhoeven
  0 siblings, 0 replies; 4+ messages in thread
From: Geert Uytterhoeven @ 2024-01-23  9:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers,
	Andrew Morton, Linus Torvalds, Kees Cook

On Tue, Jan 23, 2024 at 4:13 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
>
> The eventfs inodes and directories are allocated when referenced. But this
> leaves the issue of keeping consistent inode numbers and the number is
> only saved in the inode structure itself. When the inode is no longer
> referenced, it can be freed. When the file that the inode was representing
> is referenced again, the inode is once again created, but the inode number
> needs to be the same as it was before.
>
> Just making the inode numbers the same for all files is fine, but that
> does not work with directories. The find command will check for loops via
> the inode number and having the same inode number for directories triggers:
>
>   # find /sys/kernel/tracing
> find: File system loop detected;
> '/sys/kernel/debug/tracing/events/initcall/initcall_finish' is part of the same file system loop as
> '/sys/kernel/debug/tracing/events/initcall'.
> [..]
>
> Linus pointed out that the eventfs_inode structure ends with a single
> 32bit int, and on 64 bit machines, there's likely a 4 byte hole due to
> alignment. We can use this hole to store the inode number for the
> eventfs_inode. All directories in eventfs are represented by an
> eventfs_inode and that data structure can hold its inode number.
>
> That last int was also purposely placed at the end of the structure to
> prevent holes from within. Now that there's a 4 byte number to hold the
> inode, both the inode number and the last integer can be moved up in the
> structure for better cache locality, where the llist and rcu fields can be
> moved to the end as they are only used when the eventfs_inode is being
> deleted.
>
> Link: https://lore.kernel.org/all/CAMuHMdXKiorg-jiuKoZpfZyDJ3Ynrfb8=X+c7x0Eewxn-YRdCA@mail.gmail.com/
> Link: https://lore.kernel.org/linux-trace-kernel/20240122152748.46897388@gandalf.local.home
>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
> Fixes: 53c41052ba31 ("eventfs: Have the inodes all for files and directories all be the same")
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> Reviewed-by: Kees Cook <keescook@chromium.org>

Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-01-23  9:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-23  3:08 [for-linus][PATCH 0/2] tracing/eventfs: Fixes for 6.8 Steven Rostedt
2024-01-23  3:08 ` [for-linus][PATCH 1/2] tracing: Ensure visibility when inserting an element into tracing_map Steven Rostedt
2024-01-23  3:08 ` [for-linus][PATCH 2/2] eventfs: Save directory inodes in the eventfs_inode structure Steven Rostedt
2024-01-23  9:16   ` Geert Uytterhoeven

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.