* [RFC v7 1/2] epoll: Implement eventpoll_replace_file()
@ 2023-05-24 6:39 aloktiagi
2023-05-24 6:39 ` [RFC v7 2/2] seccomp: replace existing file in the epoll interface by a new file injected by the syscall supervisor aloktiagi
2023-05-24 17:26 ` [RFC v7 1/2] epoll: Implement eventpoll_replace_file() Christian Brauner
0 siblings, 2 replies; 4+ messages in thread
From: aloktiagi @ 2023-05-24 6:39 UTC (permalink / raw)
To: viro, willy, brauner, David.Laight, linux-fsdevel, linux-kernel
Cc: keescook, hch, tycho, aloktiagi
Introduce a mechanism to replace a file linked in the epoll interface with a new
file.
eventpoll_replace() finds all instances of the file to be replaced and replaces
them with the new file and the interested events.
Signed-off-by: aloktiagi <aloktiagi@gmail.com>
---
Changes in v7:
- address review comments on incorrect use of spin_lock.
- cleanup comments and simplify them.
Changes in v6:
- incorporate latest changes that get rid of the global epmutex lock.
Changes in v5:
- address review comments and move the call to replace old file in each
subsystem (epoll, io_uring, etc.) outside the fdtable helpers like
replace_fd().
Changes in v4:
- address review comment to remove the redundant eventpoll_replace() function.
- removed an extra empty line introduced in include/linux/file.h
Changes in v3:
- address review comment and iterate over the file table while holding the
spin_lock(&files->file_lock).
- address review comment and call filp_close() outside the
spin_lock(&files->file_lock).
---
fs/eventpoll.c | 75 +++++++++++++++++++++++++++++++++++++++
include/linux/eventpoll.h | 2 ++
2 files changed, 77 insertions(+)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 980483455cc0..60c14b549918 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -973,6 +973,81 @@ void eventpoll_release_file(struct file *file)
spin_unlock(&file->f_lock);
}
+static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
+ struct file *tfile, int fd, int full_check);
+
+/*
+ * Replace a linked file in the epoll interface with a new file
+ */
+int eventpoll_replace_file(struct file *toreplace, struct file *file, int tfd)
+{
+ struct file *to_remove = toreplace;
+ struct epoll_event event;
+ struct hlist_node *next;
+ struct eventpoll *ep;
+ struct epitem *epi;
+ int error = 0;
+ bool dispose;
+ int fd;
+
+ if (!file_can_poll(file))
+ return 0;
+
+ spin_lock(&toreplace->f_lock);
+ hlist_for_each_entry_safe(epi, next, toreplace->f_ep, fllink) {
+ fd = epi->ffd.fd;
+ event = epi->event;
+ if (fd != tfd) {
+ spin_unlock(&toreplace->f_lock);
+ goto install;
+ }
+ ep = epi->ep;
+ ep_get(ep);
+ spin_unlock(&toreplace->f_lock);
+
+ mutex_lock(&ep->mtx);
+ error = ep_insert(ep, &event, file, fd, 1);
+ mutex_unlock(&ep->mtx);
+ if (error != 0)
+ goto error;
+ WARN_ON_ONCE(ep_refcount_dec_and_test(ep));
+install:
+ spin_lock(&toreplace->f_lock);
+ }
+ spin_unlock(&toreplace->f_lock);
+error:
+ /*
+ * In case of an error remove all instances of the new file in the epoll
+ * interface. If no error, remove all instances of the original file.
+ */
+ if (error != 0)
+ to_remove = file;
+
+remove:
+ spin_lock(&to_remove->f_lock);
+ if (to_remove->f_ep && to_remove->f_ep->first) {
+ epi = hlist_entry(to_remove->f_ep->first, struct epitem, fllink);
+ fd = epi->ffd.fd;
+ if (fd != tfd) {
+ spin_unlock(&to_remove->f_lock);
+ goto remove;
+ }
+ epi->dying = true;
+ spin_unlock(&to_remove->f_lock);
+
+ ep = epi->ep;
+ mutex_lock(&ep->mtx);
+ dispose = __ep_remove(ep, epi, true);
+ mutex_unlock(&ep->mtx);
+
+ if (dispose)
+ ep_free(ep);
+ goto remove;
+ }
+ spin_unlock(&to_remove->f_lock);
+ return error;
+}
+
static int ep_alloc(struct eventpoll **pep)
{
int error;
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 3337745d81bd..f8d52c45a37a 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -25,6 +25,8 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd, unsigned long t
/* Used to release the epoll bits inside the "struct file" */
void eventpoll_release_file(struct file *file);
+int eventpoll_replace_file(struct file *toreplace, struct file *file, int tfd);
+
/*
* This is called from inside fs/file_table.c:__fput() to unlink files
* from the eventpoll interface. We need to have this facility to cleanup
--
2.34.1
^ permalink raw reply related [flat|nested] 4+ messages in thread* [RFC v7 2/2] seccomp: replace existing file in the epoll interface by a new file injected by the syscall supervisor. 2023-05-24 6:39 [RFC v7 1/2] epoll: Implement eventpoll_replace_file() aloktiagi @ 2023-05-24 6:39 ` aloktiagi 2023-05-24 17:26 ` [RFC v7 1/2] epoll: Implement eventpoll_replace_file() Christian Brauner 1 sibling, 0 replies; 4+ messages in thread From: aloktiagi @ 2023-05-24 6:39 UTC (permalink / raw) To: viro, willy, brauner, David.Laight, linux-fsdevel, linux-kernel Cc: keescook, hch, tycho, aloktiagi Introduce a mechanism to replace a file linked in the epoll interface by a new file injected by the syscall supervisor by using the epoll provided eventpoll_replace_file() api. Also introduce a new addfd flag SECCOMP_ADDFD_FLAG_REPLACE_REF to allow the supervisor to indicate that it is interested in getting the original file replaced by the new injected file. We have a use case where multiple IPv6 only network namespaces can use a single IPv4 network namespace for IPv4 only egress connectivity by switching their sockets from IPv6 to IPv4 network namespace. This allows for migration of systems to IPv6 only while keeping their connectivity to IPv4 only destinations intact. Today, we achieve this by setting up seccomp filter to intercept network system calls like connect() from a container in a syscall supervisor which runs in an IPv4 only network namespace. The syscall supervisor creates a new IPv4 connection and injects the new file descriptor through SECCOMP_NOTIFY_IOCTL_ADDFD replacing the original file descriptor from the connect() call. This does not work for cases where the original file descriptor is handed off to a system like epoll before the connect() call. After a new file descriptor is injected the original file descriptor being referenced by the epoll fd is not longer valid leading to failures. As a workaround the syscall supervisor when intercepting connect() loops through all open socket file descriptors to check if they are referencing the socket attempting the connect() and replace the reference with the to be injected file descriptor. This workaround is cumbersome and makes the solution prone to similar yet to be discovered issues. The above change will enable us remove the workaround in the syscall supervisor and let the kernel handle the replacement correctly. Signed-off-by: aloktiagi <aloktiagi@gmail.com> --- include/uapi/linux/seccomp.h | 1 + kernel/seccomp.c | 35 +++++- tools/testing/selftests/seccomp/seccomp_bpf.c | 102 ++++++++++++++++++ 3 files changed, 136 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index 0fdc6ef02b94..0a74dc5d967f 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -118,6 +118,7 @@ struct seccomp_notif_resp { /* valid flags for seccomp_notif_addfd */ #define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */ #define SECCOMP_ADDFD_FLAG_SEND (1UL << 1) /* Addfd and return it, atomically */ +#define SECCOMP_ADDFD_FLAG_REPLACE_REF (1UL << 2) /* Update replace references */ /** * struct seccomp_notif_addfd diff --git a/kernel/seccomp.c b/kernel/seccomp.c index d3e584065c7f..e4784b70b9e5 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -19,6 +19,7 @@ #include <linux/audit.h> #include <linux/compat.h> #include <linux/coredump.h> +#include <linux/eventpoll.h> #include <linux/kmemleak.h> #include <linux/nospec.h> #include <linux/prctl.h> @@ -1056,6 +1057,7 @@ static u64 seccomp_next_notify_id(struct seccomp_filter *filter) static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_knotif *n) { int fd; + struct file *old_file = NULL; /* * Remove the notification, and reset the list pointers, indicating @@ -1064,8 +1066,30 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_kn list_del_init(&addfd->list); if (!addfd->setfd) fd = receive_fd(addfd->file, addfd->flags); - else + else { + int ret = 0; + if (addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_REPLACE_REF) { + old_file = fget(addfd->fd); + if (!old_file) { + fd = -EBADF; + goto error; + } + ret = eventpoll_replace_file(old_file, addfd->file, addfd->fd); + if (ret < 0) { + fd = ret; + goto error; + } + } fd = receive_fd_replace(addfd->fd, addfd->file, addfd->flags); + /* In case of error restore all references */ + if (fd < 0 && addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_REPLACE_REF) { + ret = eventpoll_replace_file(addfd->file, old_file, addfd->fd); + if (ret < 0) { + fd = ret; + } + } + } +error: addfd->ret = fd; if (addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_SEND) { @@ -1080,6 +1104,9 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_kn } } + if (old_file) + fput(old_file); + /* * Mark the notification as completed. From this point, addfd mem * might be invalidated and we can't safely read it anymore. @@ -1613,12 +1640,16 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter, if (addfd.newfd_flags & ~O_CLOEXEC) return -EINVAL; - if (addfd.flags & ~(SECCOMP_ADDFD_FLAG_SETFD | SECCOMP_ADDFD_FLAG_SEND)) + if (addfd.flags & ~(SECCOMP_ADDFD_FLAG_SETFD | SECCOMP_ADDFD_FLAG_SEND | + SECCOMP_ADDFD_FLAG_REPLACE_REF)) return -EINVAL; if (addfd.newfd && !(addfd.flags & SECCOMP_ADDFD_FLAG_SETFD)) return -EINVAL; + if (!addfd.newfd && (addfd.flags & SECCOMP_ADDFD_FLAG_REPLACE_REF)) + return -EINVAL; + kaddfd.file = fget(addfd.srcfd); if (!kaddfd.file) return -EBADF; diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c index 43ec36b179dc..0ec8e4f9dff6 100644 --- a/tools/testing/selftests/seccomp/seccomp_bpf.c +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c @@ -47,6 +47,7 @@ #include <linux/kcmp.h> #include <sys/resource.h> #include <sys/capability.h> +#include <sys/epoll.h> #include <unistd.h> #include <sys/syscall.h> @@ -4185,6 +4186,107 @@ TEST(user_notification_addfd) close(memfd); } +TEST(user_notification_addfd_with_epoll_replace) +{ + char c; + pid_t pid; + long ret; + int optval; + socklen_t optlen = sizeof(optval); + int status, listener, fd; + int efd, sfd[4]; + struct epoll_event e; + struct seccomp_notif_addfd addfd = {}; + struct seccomp_notif req = {}; + struct seccomp_notif_resp resp = {}; + + ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); + ASSERT_EQ(0, ret) { + TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!"); + } + + listener = user_notif_syscall(__NR_getsockopt, + SECCOMP_FILTER_FLAG_NEW_LISTENER); + + /* Create two socket pairs sfd[0] <-> sfd[1] and sfd[2] <-> sfd[3] */ + ASSERT_EQ(socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[2]), 0); + + pid = fork(); + ASSERT_GE(pid, 0); + + if (pid == 0) { + if (socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[0]) != 0) + exit(1); + + efd = epoll_create(1); + if (efd == -1) + exit(1); + + e.events = EPOLLIN; + if (epoll_ctl(efd, EPOLL_CTL_ADD, sfd[0], &e) != 0) + exit(1); + + /* + * fd will be added here to replace an existing one linked + * in the epoll interface. + */ + if (getsockopt(sfd[0], SOL_SOCKET, SO_DOMAIN, &optval, + &optlen) != USER_NOTIF_MAGIC) + exit(1); + + /* + * Write data to the sfd[3] connected to sfd[2], but due to + * the swap, we should see data on sfd[0] + */ + if (write(sfd[3], "w", 1) != 1) + exit(1); + + if (epoll_wait(efd, &e, 1, 0) != 1) + exit(1); + + if (read(sfd[0], &c, 1) != 1) + exit(1); + + if ('w' != c) + exit(1); + + if (epoll_ctl(efd, EPOLL_CTL_DEL, sfd[0], &e) != 0) + exit(1); + + close(efd); + close(sfd[0]); + close(sfd[1]); + close(sfd[2]); + close(sfd[3]); + exit(0); + } + + ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0); + + addfd.srcfd = sfd[2]; + addfd.newfd = req.data.args[0]; + addfd.id = req.id; + addfd.flags = SECCOMP_ADDFD_FLAG_SETFD | SECCOMP_ADDFD_FLAG_REPLACE_REF; + addfd.newfd_flags = O_CLOEXEC; + + /* + * Verfiy we can install and replace a file that is linked in the + * epoll interface. Replace the socket sfd[0] with sfd[2] + */ + fd = ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd); + EXPECT_EQ(fd, req.data.args[0]); + + resp.id = req.id; + resp.error = 0; + resp.val = USER_NOTIF_MAGIC; + EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0); + + /* Wait for child to finish. */ + EXPECT_EQ(waitpid(pid, &status, 0), pid); + EXPECT_EQ(true, WIFEXITED(status)); + EXPECT_EQ(0, WEXITSTATUS(status)); +} + TEST(user_notification_addfd_rlimit) { pid_t pid; -- 2.34.1 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [RFC v7 1/2] epoll: Implement eventpoll_replace_file() 2023-05-24 6:39 [RFC v7 1/2] epoll: Implement eventpoll_replace_file() aloktiagi 2023-05-24 6:39 ` [RFC v7 2/2] seccomp: replace existing file in the epoll interface by a new file injected by the syscall supervisor aloktiagi @ 2023-05-24 17:26 ` Christian Brauner 2023-05-25 21:55 ` Alok Tiagi 1 sibling, 1 reply; 4+ messages in thread From: Christian Brauner @ 2023-05-24 17:26 UTC (permalink / raw) To: aloktiagi Cc: viro, willy, David.Laight, linux-fsdevel, linux-kernel, keescook, hch, tycho On Wed, May 24, 2023 at 06:39:32AM +0000, aloktiagi wrote: > Introduce a mechanism to replace a file linked in the epoll interface with a new > file. > > eventpoll_replace() finds all instances of the file to be replaced and replaces > them with the new file and the interested events.h I've spent a bit more time on this and I have a few more questions/thoughts. * What if the seccomp notifier replaces a pollable file with a non-pollable file? Right now you check that the file is pollable and if it isn't you're not updating the file references in the epoll instance for the file descriptor you updated. Why these semantics and not e.g., removing all instances of that file referring to the updated fd? * What if the seccomp notifier replaces the file of a file descriptor with an epoll file descriptor? If the fd and original file are present in an epoll instance does that mean you add the epoll file into all epoll instances? That also means you create nested epoll instances which are supported but are subject to various limitations. What's the plan? * What if you have two threads in the same threadgroup that each have a seccomp listener profile attached to them. Both have the same fd open. Now both replace the same fd concurrently. Both threads concurrently update references in the epoll instances now since the spinlock and mutex are acquired and reacquired again. Afaict, you can end up with some instances of the fd temporarily generating events for file1 and other instances generating events for file2 while the replace is in progress. Thus generating spurious events and userspace might be acting on a file descriptor that doesn't yet refer to the new file? That's possibly dangerous. Maybe I'm mistaken but if so I'd like to hear the details why that can't happen. Thinking about it what if the same file is registered via multiple fds at the same time? Can't you end up in a scenario where you have the same fd referring to different files in one or multiple epoll instance? I mean, you can get into that situation via dup2() where you change the file descriptor to refer to a different file but the fd might still be registered in the epoll instance referring to the old file provided there's another fd open holding the old file alive. The difference though is that userspace must've been dumb enough to actually do that whereas now this can just happen behind their back misleading them. Honestly, the kernel can't give you any atomicity in replacing these references and if so it would require new possibly invasive locking that would very likely not be acceptable upstream just for the sake of this feature. I still have a very hard time seeing any of this happening. * I haven't looked at the codepath that tries to restore the old file on failure. That might introduce even more weirdness. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFC v7 1/2] epoll: Implement eventpoll_replace_file() 2023-05-24 17:26 ` [RFC v7 1/2] epoll: Implement eventpoll_replace_file() Christian Brauner @ 2023-05-25 21:55 ` Alok Tiagi 0 siblings, 0 replies; 4+ messages in thread From: Alok Tiagi @ 2023-05-25 21:55 UTC (permalink / raw) To: Christian Brauner Cc: viro, willy, David.Laight, linux-fsdevel, linux-kernel, keescook, hch, tycho On Wed, May 24, 2023 at 07:26:24PM +0200, Christian Brauner wrote: > On Wed, May 24, 2023 at 06:39:32AM +0000, aloktiagi wrote: > > Introduce a mechanism to replace a file linked in the epoll interface with a new > > file. > > > > eventpoll_replace() finds all instances of the file to be replaced and replaces > > them with the new file and the interested events.h > > I've spent a bit more time on this and I have a few more > questions/thoughts. > > * What if the seccomp notifier replaces a pollable file with a > non-pollable file? Right now you check that the file is pollable and > if it isn't you're not updating the file references in the epoll > instance for the file descriptor you updated. Why these semantics and > not e.g., removing all instances of that file referring to the updated > fd? > good question. the current implementation relies on __fput() calling eventpoll_release() to ultimately release the file. eventpoll_replace_file() only removes the file if it can successfully install the new file in epoll. > * What if the seccomp notifier replaces the file of a file descriptor > with an epoll file descriptor? If the fd and original file are present > in an epoll instance does that mean you add the epoll file into all > epoll instances? That also means you create nested epoll instances > which are supported but are subject to various limitations. What's the > plan? > My plan was to allow these cases since there is support for nested epoll instances. But thinking more on this, since seccomp subsystem is the only caller of eventpoll_replace_file(), I am not sure whether there is a valid use case where seccomp is used to intercept a system call that uses an epoll fd as a parameter. Maybe its ok to not do the replacement for such cases. Thoughts? > * What if you have two threads in the same threadgroup that each have a > seccomp listener profile attached to them. Both have the same fd open. > > Now both replace the same fd concurrently. Both threads concurrently > update references in the epoll instances now since the spinlock and > mutex are acquired and reacquired again. Afaict, you can end up with > some instances of the fd temporarily generating events for file1 and > other instances generating events for file2 while the replace is in > progress. Thus generating spurious events and userspace might be > acting on a file descriptor that doesn't yet refer to the new file? > That's possibly dangerous. > > Maybe I'm mistaken but if so I'd like to hear the details why that > can't happen. > Considering file1 is the original file and file2 is the new file. First the eventpoll_replace_file() is called before receive_fd_replace(), so the file1 is still active and file2 would not receive any events. Within receive_fd_replace() the install phase first installs file2 alongside file1 without removing file1. So during this phase the userspace can continue to receive events on file1. In the remove phase within eventpoll_replace_file() the epi for file1 is set to dying and replaced with file2. At this point the fd should see no new events, since receive_fd_replace() is yet to be called. > Thinking about it what if the same file is registered via multiple fds > at the same time? Can't you end up in a scenario where you have the > same fd referring to different files in one or multiple epoll > instance? > > I mean, you can get into that situation via dup2() where you change > the file descriptor to refer to a different file but the fd might > still be registered in the epoll instance referring to the old file > provided there's another fd open holding the old file alive. > The current implementation scopes the replacement to the fd being replaced in the call to receive_fd_replace() since thats what the userspace intends to do. In case there are multiple fds pointing to the same file, and receive_fd_replace() replaces only one of them, we would end up updating the file for only one of the fds. The other fd will see the same result as seen today without this patch, where the replaced file doesn't exist in epoll since it got cleared due to __fput(). > The difference though is that userspace must've been dumb enough to > actually do that whereas now this can just happen behind their back > misleading them. > Since this is mainly serving the seccomp add fd usecase today, do you think its something that can be documented as a limitation? I am not aware of the interesting ways users are using seccomp add fd to think of all the possible scenarios, so I am open to suggestions. > Honestly, the kernel can't give you any atomicity in replacing these > references and if so it would require new possibly invasive locking > that would very likely not be acceptable upstream just for the sake of > this feature. I still have a very hard time seeing any of this > happening. > > * I haven't looked at the codepath that tries to restore the old file on > failure. That might introduce even more weirdness. ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-05-25 21:55 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-05-24 6:39 [RFC v7 1/2] epoll: Implement eventpoll_replace_file() aloktiagi 2023-05-24 6:39 ` [RFC v7 2/2] seccomp: replace existing file in the epoll interface by a new file injected by the syscall supervisor aloktiagi 2023-05-24 17:26 ` [RFC v7 1/2] epoll: Implement eventpoll_replace_file() Christian Brauner 2023-05-25 21:55 ` Alok Tiagi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).