public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] ipc/mqueue: add fcntl(F_MQ_PEEK) for non-destructive message inspection
@ 2026-03-25 19:00 Shaurya Rane
  2026-03-25 19:00 ` [RFC PATCH 1/3] mqueue: uapi: add struct mq_peek_attr and F_MQ_PEEK Shaurya Rane
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Shaurya Rane @ 2026-03-25 19:00 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: manfred, viro, brauner, chuck.lever, jlayton, rstoyanov,
	ptikhomirov, Shaurya Rane

POSIX message queues currently have no way to inspect queued messages
without consuming them.  mq_receive() always dequeues the message it
returns.  This makes it impossible for checkpoint/restore tools such as
CRIU to save and replay message queue contents without first destroying
the queue state.

This series adds F_MQ_PEEK, a new fcntl command that reads a message
from a POSIX message queue by index (0 = highest priority, matching
mq_receive dequeue order) without removing it.

Background
----------
CRIU (Checkpoint/Restore In Userspace) supports live container migration
and process checkpoint/restore.  It handles SysV message queues, shared
memory, semaphores, pipes, and sockets, but POSIX mqueues remain
unsupported because there is no kernel interface to inspect queued
messages non-destructively.

The SysV analogue for this problem was solved in 2012: the MSG_COPY flag
was added to msgrcv() in commit 4a674f34ba04 ("ipc: introduce message
queue copy feature") specifically for CRIU.  This series does the same
for POSIX mqueues.

Why fcntl and not a new syscall
---------------------------------
POSIX mqueue file descriptors are first-class Linux file descriptors.
fcntl() is the established interface for per-fd operations beyond
read/write: F_GETPIPE_SZ queries pipe capacity, F_ADD_SEALS/F_GET_SEALS
control memfd sealing, F_GETDELEG manages file delegations.  F_MQ_PEEK
follows this same pattern.

Adding a flag to mq_timedreceive() is not possible without ABI breakage:
the POSIX signature has no flags parameter.  A new syscall is unnecessary
when fcntl() covers the use case cleanly.

Series overview
---------------
Patch 1 adds struct mq_peek_attr and F_MQ_PEEK to the UAPI headers.

Patch 2 moves struct msg_msgseg and DATALEN_MSG/DATALEN_SEG from the
private ipc/msgutil.c to include/linux/msg.h.  This is a pure
refactoring (no functional change) that allows ipc/mqueue.c to walk
multi-segment message chains under a spinlock using only memcpy()
(copy_to_user() cannot be called under a spinlock).

Patch 3 implements do_mq_peek() in ipc/mqueue.c and dispatches it from
fs/fcntl.c.  The implementation holds info->lock for the tree traversal
and kernel-buffer copy, then releases the lock before copy_to_user().

A corresponding CRIU userspace patch will be sent to the CRIU mailing
list separately.  It uses F_MQ_PEEK when the kernel supports it,
falling back to a ptrace-safe intrusive drain/re-enqueue on older
kernels.

Link: https://github.com/checkpoint-restore/criu/issues/2285

Shaurya Rane (3):
  mqueue: uapi: add struct mq_peek_attr and F_MQ_PEEK
  msg: move struct msg_msgseg and DATALEN_* to include/linux/msg.h
  ipc/mqueue: implement fcntl(F_MQ_PEEK) for non-destructive message inspection

 fs/fcntl.c                  |   3 +
 include/linux/mqueue.h      |  19 ++++++
 include/linux/msg.h         |  13 ++++
 include/uapi/linux/fcntl.h  |   7 ++
 include/uapi/linux/mqueue.h |  20 ++++++
 ipc/mqueue.c                | 130 ++++++++++++++++++++++++++++++++++++
 ipc/msgutil.c               |   7 --
 7 files changed, 192 insertions(+), 7 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC PATCH 1/3] mqueue: uapi: add struct mq_peek_attr and F_MQ_PEEK
  2026-03-25 19:00 [RFC PATCH 0/3] ipc/mqueue: add fcntl(F_MQ_PEEK) for non-destructive message inspection Shaurya Rane
@ 2026-03-25 19:00 ` Shaurya Rane
  2026-03-25 19:00 ` [RFC PATCH 2/3] msg: move struct msg_msgseg and DATALEN_* to include/linux/msg.h Shaurya Rane
  2026-03-25 19:00 ` [RFC PATCH 3/3] ipc/mqueue: implement fcntl(F_MQ_PEEK) for non-destructive message inspection Shaurya Rane
  2 siblings, 0 replies; 4+ messages in thread
From: Shaurya Rane @ 2026-03-25 19:00 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: manfred, viro, brauner, chuck.lever, jlayton, rstoyanov,
	ptikhomirov, Shaurya Rane

Add the user-visible interface for non-destructive POSIX message queue
inspection via fcntl(2).

POSIX message queues have no way to inspect queued messages without
consuming them: mq_receive() always dequeues the message it returns.
This makes it impossible for checkpoint/restore tools such as CRIU to
save and replay message queue contents without destroying the queue
state in the process.

struct mq_peek_attr describes the request: the caller specifies an
index into the queue in receive order (0 = next message that
mq_receive() would return, i.e. highest priority, FIFO within same
priority) and a buffer to receive the payload.  On return, msg_prio is
filled with the message priority and the return value is the number of
bytes copied.

F_MQ_PEEK = F_LINUX_SPECIFIC_BASE + 17 is the new fcntl command that
accepts a pointer to struct mq_peek_attr.

Link: https://github.com/checkpoint-restore/criu/issues/2285
Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in>
---
 include/uapi/linux/fcntl.h  |  6 ++++++
 include/uapi/linux/mqueue.h | 21 +++++++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index aadfbf6e0cb3..ea34f87de0fb 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -84,6 +84,12 @@
 #define F_GETDELEG		(F_LINUX_SPECIFIC_BASE + 15)
 #define F_SETDELEG		(F_LINUX_SPECIFIC_BASE + 16)
 
+/*
+ * Peek at a POSIX message queue message by index without consuming it.
+ * Argument is a pointer to struct mq_peek_attr (see <linux/mqueue.h>).
+ */
+#define F_MQ_PEEK		(F_LINUX_SPECIFIC_BASE + 17)
+
 /* Argument structure for F_GETDELEG and F_SETDELEG */
 struct delegation {
 	__u32	d_flags;	/* Must be 0 */
diff --git a/include/uapi/linux/mqueue.h b/include/uapi/linux/mqueue.h
index b516b66840ad..7133b84c70d1 100644
--- a/include/uapi/linux/mqueue.h
+++ b/include/uapi/linux/mqueue.h
@@ -53,4 +53,25 @@ struct mq_attr {
 
 #define NOTIFY_COOKIE_LEN	32
 
+/*
+ * Argument structure for fcntl(F_MQ_PEEK).
+ *
+ * Peek at a POSIX message queue message by index without removing it.
+ * @offset:   Index in receive order (0 = highest priority, next to dequeue).
+ *            FIFO ordering is preserved within the same priority level.
+ * @msg_prio: Output: priority of the message at @offset.
+ * @buf_len:  Size of the caller-provided buffer at @buf.
+ * @buf:      Output: message payload is written here; truncated to @buf_len
+ *            bytes if the message is larger.
+ *
+ * Returns the number of bytes copied on success, -ENOMSG if @offset is
+ * >= mq_curmsgs, or a negative error code on failure.
+ */
+struct mq_peek_attr {
+	__s32		 offset;
+	__u32		 msg_prio;
+	__kernel_size_t  buf_len;
+	char __user	*buf;
+};
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [RFC PATCH 2/3] msg: move struct msg_msgseg and DATALEN_* to include/linux/msg.h
  2026-03-25 19:00 [RFC PATCH 0/3] ipc/mqueue: add fcntl(F_MQ_PEEK) for non-destructive message inspection Shaurya Rane
  2026-03-25 19:00 ` [RFC PATCH 1/3] mqueue: uapi: add struct mq_peek_attr and F_MQ_PEEK Shaurya Rane
@ 2026-03-25 19:00 ` Shaurya Rane
  2026-03-25 19:00 ` [RFC PATCH 3/3] ipc/mqueue: implement fcntl(F_MQ_PEEK) for non-destructive message inspection Shaurya Rane
  2 siblings, 0 replies; 4+ messages in thread
From: Shaurya Rane @ 2026-03-25 19:00 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: manfred, viro, brauner, chuck.lever, jlayton, rstoyanov,
	ptikhomirov, Shaurya Rane

struct msg_msgseg and the DATALEN_MSG / DATALEN_SEG macros are
currently private to ipc/msgutil.c.  struct msg_msg (already in the
public kernel header include/linux/msg.h) carries a pointer to
msg_msgseg, making it an incomplete type for all callers outside
msgutil.c.

Move the definition of struct msg_msgseg and the two DATALEN macros to
include/linux/msg.h so that other IPC code can safely copy
multi-segment message payloads into a kernel buffer under a spinlock,
without calling store_msg() which performs copy_to_user() and therefore
cannot be used under a spinlock.

ipc/msgutil.c already includes <linux/msg.h>, so it picks up the
definitions from the header with no functional change.

Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in>
---
 include/linux/msg.h | 13 +++++++++++++
 ipc/msgutil.c       |  7 -------
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/msg.h b/include/linux/msg.h
index 9a972a296b95..2d5353bace9a 100644
--- a/include/linux/msg.h
+++ b/include/linux/msg.h
@@ -5,6 +5,19 @@
 #include <linux/list.h>
 #include <uapi/linux/msg.h>
 
+/*
+ * Each message is stored in one or more page-sized segments.
+ * The first segment is embedded in struct msg_msg; overflow goes into
+ * chained struct msg_msgseg blocks.
+ */
+struct msg_msgseg {
+	struct msg_msgseg *next;
+	/* message data follows immediately */
+};
+
+#define DATALEN_MSG	((size_t)PAGE_SIZE - sizeof(struct msg_msg))
+#define DATALEN_SEG	((size_t)PAGE_SIZE - sizeof(struct msg_msgseg))
+
 /* one msg_msg structure for each message */
 struct msg_msg {
 	struct list_head m_list;
diff --git a/ipc/msgutil.c b/ipc/msgutil.c
index e28f0cecb2ec..9cd4b078d55c 100644
--- a/ipc/msgutil.c
+++ b/ipc/msgutil.c
@@ -31,13 +31,6 @@ struct ipc_namespace init_ipc_ns = {
 	.user_ns = &init_user_ns,
 };
 
-struct msg_msgseg {
-	struct msg_msgseg *next;
-	/* the next part of the message follows immediately */
-};
-
-#define DATALEN_MSG	((size_t)PAGE_SIZE-sizeof(struct msg_msg))
-#define DATALEN_SEG	((size_t)PAGE_SIZE-sizeof(struct msg_msgseg))
 
 static kmem_buckets *msg_buckets __ro_after_init;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [RFC PATCH 3/3] ipc/mqueue: implement fcntl(F_MQ_PEEK) for non-destructive message inspection
  2026-03-25 19:00 [RFC PATCH 0/3] ipc/mqueue: add fcntl(F_MQ_PEEK) for non-destructive message inspection Shaurya Rane
  2026-03-25 19:00 ` [RFC PATCH 1/3] mqueue: uapi: add struct mq_peek_attr and F_MQ_PEEK Shaurya Rane
  2026-03-25 19:00 ` [RFC PATCH 2/3] msg: move struct msg_msgseg and DATALEN_* to include/linux/msg.h Shaurya Rane
@ 2026-03-25 19:00 ` Shaurya Rane
  2 siblings, 0 replies; 4+ messages in thread
From: Shaurya Rane @ 2026-03-25 19:00 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: manfred, viro, brauner, chuck.lever, jlayton, rstoyanov,
	ptikhomirov, Shaurya Rane

Add support for F_MQ_PEEK, a new fcntl command that reads a POSIX
message queue message by index without removing it from the queue.

Background:
CRIU (Checkpoint/Restore In Userspace) supports live container migration
and process checkpoint/restore.  POSIX message queues are a widely-used
IPC mechanism, but CRIU cannot checkpoint processes that hold open mqueue
file descriptors: there is no kernel interface to inspect queued messages
non-destructively.  The SysV IPC analogue (MSG_COPY for msgrcv) was
introduced specifically for CRIU in commit 4a674f34ba04 ("ipc: introduce
message queue copy feature").  This patch provides the equivalent for
POSIX mqueues.

Implementation:
The queue stores messages in a red-black tree (info->msg_tree) keyed
by priority, with each tree node holding a FIFO list of messages at
that priority level.  mq_peek_at_offset() walks this structure in
receive order (highest priority first, FIFO within priority) to locate
the message at the requested index without modifying any state.

Message payload is copied into a kvmalloc'd kernel buffer under
info->lock using pure memcpy() (no page faults possible).  This
correctly handles multi-segment messages by walking the msg_msgseg
chain.  The lock is released before copy_to_user() transfers the
kernel buffer to userspace.

A new include/linux/mqueue.h kernel header is added to declare
do_mq_peek() for use from fs/fcntl.c, following the same pattern as
include/linux/memfd.h for memfd_fcntl().

Concurrency:
The snapshot is consistent within the spin_lock() critical section.
Between two F_MQ_PEEK calls the queue may change (messages may be sent
or received).  This is documented snapshot semantics, analogous to
/proc entries.  CRIU freezes the target process via ptrace before
dumping, so in practice the queue is stable for the entire checkpoint
sequence.

Link: https://github.com/checkpoint-restore/criu/issues/2285
Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in>
---
 fs/fcntl.c             |   4 ++
 include/linux/mqueue.h |  19 ++++++
 ipc/mqueue.c           | 129 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 152 insertions(+)
 create mode 100644 include/linux/mqueue.h

diff --git a/fs/fcntl.c b/fs/fcntl.c
index f93dbca08435..32d0dcc8e544 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -24,6 +24,7 @@
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
 #include <linux/memfd.h>
+#include <linux/mqueue.h>
 #include <linux/compat.h>
 #include <linux/mount.h>
 #include <linux/rw_hint.h>
@@ -563,6 +564,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 			return -EFAULT;
 		err = fcntl_setdeleg(fd, filp, &deleg);
 		break;
+	case F_MQ_PEEK:
+		err = do_mq_peek(filp, argp);
+		break;
 	default:
 		break;
 	}
diff --git a/include/linux/mqueue.h b/include/linux/mqueue.h
new file mode 100644
index 000000000000..a725fcf90d39
--- /dev/null
+++ b/include/linux/mqueue.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_MQUEUE_H
+#define __LINUX_MQUEUE_H
+
+#include <uapi/linux/mqueue.h>
+
+struct file;
+
+#ifdef CONFIG_POSIX_MQUEUE
+long do_mq_peek(struct file *filp, struct mq_peek_attr __user *uattr);
+#else
+static inline long do_mq_peek(struct file *filp,
+			       struct mq_peek_attr __user *uattr)
+{
+	return -EBADF;
+}
+#endif /* CONFIG_POSIX_MQUEUE */
+
+#endif /* __LINUX_MQUEUE_H */
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index bb7c9e5d2b90..5e73864a9657 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -286,6 +286,135 @@ static inline struct msg_msg *msg_get(struct mqueue_inode_info *info)
 	return msg;
 }
 
+/*
+ * mq_peek_at_offset - locate a message by receive-order index.
+ *
+ * Walk the priority tree from highest to lowest priority, and within each
+ * priority level in FIFO order, returning the message at position @offset
+ * (0 = next message that mq_receive() would dequeue).
+ *
+ * Must be called with info->lock held.  Does not modify queue state.
+ * Returns NULL if @offset >= mq_curmsgs.
+ */
+static struct msg_msg *mq_peek_at_offset(struct mqueue_inode_info *info,
+					 int offset)
+{
+	struct posix_msg_tree_node *leaf;
+	struct rb_node *node;
+	struct msg_msg *msg;
+	int count = 0;
+
+	for (node = info->msg_tree_rightmost; node; node = rb_prev(node)) {
+		leaf = rb_entry(node, struct posix_msg_tree_node, rb_node);
+		list_for_each_entry(msg, &leaf->msg_list, m_list) {
+			if (count == offset)
+				return msg;
+			count++;
+		}
+	}
+	return NULL;
+}
+
+/*
+ * mq_msg_copy_to_buf - copy message payload into a flat kernel buffer.
+ *
+ * Handles multi-segment messages by walking the msg_msgseg chain.
+ * Uses only memcpy() so it is safe to call under info->lock.
+ * Returns the number of bytes copied.
+ */
+static size_t mq_msg_copy_to_buf(struct msg_msg *msg, void *buf, size_t buf_len)
+{
+	size_t alen, to_copy, copied = 0;
+	struct msg_msgseg *seg;
+
+	to_copy = min(buf_len, msg->m_ts);
+
+	alen = min(to_copy, DATALEN_MSG);
+	memcpy(buf, msg + 1, alen);
+	copied += alen;
+	to_copy -= alen;
+
+	for (seg = msg->next; seg && to_copy > 0; seg = seg->next) {
+		alen = min(to_copy, DATALEN_SEG);
+		memcpy((char *)buf + copied, seg + 1, alen);
+		copied += alen;
+		to_copy -= alen;
+	}
+	return copied;
+}
+
+/*
+ * do_mq_peek - implement fcntl(F_MQ_PEEK).
+ *
+ * Read the message at position @attr.offset in receive order from the
+ * queue without removing it.  Position 0 is the message that the next
+ * mq_receive() would return (highest priority, FIFO within priority).
+ *
+ * The snapshot is consistent within the spin_lock() critical section.
+ * Between two F_MQ_PEEK calls the queue may change; this is documented
+ * snapshot semantics analogous to /proc entries.
+ *
+ * Returns bytes copied on success, -ENOMSG if offset >= mq_curmsgs.
+ */
+long do_mq_peek(struct file *filp, struct mq_peek_attr __user *uattr)
+{
+	struct mqueue_inode_info *info;
+	struct mq_peek_attr attr;
+	struct msg_msg *msg;
+	void *kbuf;
+	long ret;
+
+	if (filp->f_op != &mqueue_file_operations)
+		return -EBADF;
+
+	if (!(filp->f_mode & FMODE_READ))
+		return -EBADF;
+
+	if (copy_from_user(&attr, uattr, sizeof(attr)))
+		return -EFAULT;
+
+	if (attr.offset < 0 || !attr.buf_len || !attr.buf)
+		return -EINVAL;
+
+	info = MQUEUE_I(file_inode(filp));
+
+	/*
+	 * Allocate the kernel copy buffer before taking the spinlock.
+	 * Cap at mq_msgsize: no message can exceed it.
+	 */
+	kbuf = kvmalloc(min_t(size_t, attr.buf_len, info->attr.mq_msgsize),
+			GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	spin_lock(&info->lock);
+
+	msg = mq_peek_at_offset(info, attr.offset);
+	if (!msg) {
+		spin_unlock(&info->lock);
+		kvfree(kbuf);
+		return -ENOMSG;
+	}
+
+	/*
+	 * Copy the payload under the lock using pure memcpy() (no page
+	 * faults), then transfer to userspace after releasing the lock.
+	 */
+	ret = mq_msg_copy_to_buf(msg, kbuf,
+				 min_t(size_t, attr.buf_len,
+				       info->attr.mq_msgsize));
+	attr.msg_prio = msg->m_type;
+
+	spin_unlock(&info->lock);
+
+	if (copy_to_user(attr.buf, kbuf, ret) ||
+	    copy_to_user(uattr, &attr, sizeof(attr)))
+		ret = -EFAULT;
+
+	kvfree(kbuf);
+	return ret;
+}
+
 static struct inode *mqueue_get_inode(struct super_block *sb,
 		struct ipc_namespace *ipc_ns, umode_t mode,
 		struct mq_attr *attr)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-25 19:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-25 19:00 [RFC PATCH 0/3] ipc/mqueue: add fcntl(F_MQ_PEEK) for non-destructive message inspection Shaurya Rane
2026-03-25 19:00 ` [RFC PATCH 1/3] mqueue: uapi: add struct mq_peek_attr and F_MQ_PEEK Shaurya Rane
2026-03-25 19:00 ` [RFC PATCH 2/3] msg: move struct msg_msgseg and DATALEN_* to include/linux/msg.h Shaurya Rane
2026-03-25 19:00 ` [RFC PATCH 3/3] ipc/mqueue: implement fcntl(F_MQ_PEEK) for non-destructive message inspection Shaurya Rane

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox