* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection [not found] ` <CAHk-=wjahcAqLYm0ijcAVcPcQAz-UUuJ3Ubx4GzP_SJAupf=qQ@mail.gmail.com> @ 2023-05-25 7:01 ` Christian Brauner [not found] ` <CAHk-=wgKu=tJf1bm_dtme4Hde4zTB=_7EdgR8avsDRK4_jD+uA@mail.gmail.com> 1 sibling, 0 replies; 6+ messages in thread From: Christian Brauner @ 2023-05-25 7:01 UTC (permalink / raw) To: Linus Torvalds, Luis Chamberlain Cc: david, tglx, hch, patches, linux-modules, linux-mm, linux-kernel, pmladek, petr.pavlu, prarit, lennart, gregkh, rafael, song, lucas.de.marchi, lucas.demarchi, christophe.leroy, peterz, rppt, dave, willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, linux-fsdevel, Alexander Viro On Wed, May 24, 2023 at 02:52:50PM -0700, Linus Torvalds wrote: > On Wed, May 24, 2023 at 2:36 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > Add support for a new call which allows to detect duplicate requests > > for each inode passed. > > No. > > This is all disgusting. > > Stop adding horrific code for some made-up load that isn't real. Also, this series adds non-trivial amount of code to fs/ without ever having Cced fsdevel. As I told the bpf folks already if fs/ code is touched fsdevel should absolutely be Cced, please. It's literally also the first thing that get_maintainers.pl spews out. ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <CAHk-=wgKu=tJf1bm_dtme4Hde4zTB=_7EdgR8avsDRK4_jD+uA@mail.gmail.com>]
* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection [not found] ` <CAHk-=wgKu=tJf1bm_dtme4Hde4zTB=_7EdgR8avsDRK4_jD+uA@mail.gmail.com> @ 2023-05-25 18:08 ` Luis Chamberlain 2023-05-25 18:35 ` Luis Chamberlain 2023-05-25 18:50 ` Linus Torvalds 0 siblings, 2 replies; 6+ messages in thread From: Luis Chamberlain @ 2023-05-25 18:08 UTC (permalink / raw) To: Linus Torvalds, Linux FS Devel, hch, brauner, david Cc: tglx, patches, linux-modules, linux-mm, linux-kernel, pmladek, petr.pavlu, prarit, lennart, gregkh, rafael, song, lucas.de.marchi, lucas.demarchi, christophe.leroy, peterz, rppt, dave, willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu + fsdevel please review, On Wed, May 24, 2023 at 09:00:02PM -0700, Linus Torvalds wrote: > On Wed, May 24, 2023 at 2:52 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > This is all disgusting. > > Bringing back the original thread, because I just sent an alternate > patch to Luis to test. > > That one is also disgusting, but for different reasons: it needs some > polish if it works. It's a very simple patch, in that it just extends > our existing i_writecount and ETXTBSY logic to also have a "exclusive" > mode, and says that we do the module file reading in that exclusive > mode (so if/when udev in its incompetence tries to load the same > module X number of times at the same time, only one will read at a > time). Indeed, this is the sort of gem I was hoping we could acomplish. > The disgusting part is mainly the hacky test for "id == > READING_MODULE", and it would probably be better with some kind of > "exclusive flag" field for general use, but right now READING_MODULE > is basically that one user. > > Luis having explained _why_ we'd want this (and honestly, it took a > couple of tries), I can only say that udev is horribly broken, and > this most definitely should be fixed in user mode. udev randomly > loading the same module multiple times just because it gets confused > is not ok. At this point it would be good for for someone on the udev camp to at least to *try*. If the problem is the fork on udev, maybe the shmem could be used to share the module context to prevent duplicates. > Any udev developer that goes "we can't fix it in user space" should be > ashamed of themselves. Really? Just randomly doing the same thing in > parallel and expecting the kernel to sort out your mess? What a crock. > > But this *might* mitigate that udev horror. And not introduce any new > kernel-side horror, just a slight extension of our existing writer > exclusion logic to allow "full exclusive access". Yes, that expresses what is needed well and is simple enough. > (Note: it's not actually excluding other purely regular readers - but > it *is* excluding not just writers, but also other "special readers" > that want to exclude writers) > > I'd like to point out that this patch really is completely untested. > It built for me, but that's all the testing it has gotten. It's > _small_. Tiny, even. But that "id == READING_MODULE" thing really is > pretty disgusting and I feel this needs more thought. > fs/kernel_read_file.c | 6 +++++- > include/linux/fs.h | 6 ++++++ > 2 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c > index 5d826274570c..ff3e894f8cd4 100644 > --- a/fs/kernel_read_file.c > +++ b/fs/kernel_read_file.c > @@ -48,7 +48,11 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf, > if (!S_ISREG(file_inode(file)->i_mode)) > return -EINVAL; > > - ret = deny_write_access(file); > + /* Module reading wants *exclusive* access to the file */ > + if (id == READING_MODULE) > + ret = exclusive_deny_write_access(file); > + else > + ret = deny_write_access(file); > if (ret) > return ret; > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 21a981680856..722b42a77d51 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2566,6 +2566,12 @@ static inline int deny_write_access(struct file *file) > struct inode *inode = file_inode(file); > return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY; > } > +static inline int exclusive_deny_write_access(struct file *file) > +{ > + int old = 0; > + struct inode *inode = file_inode(file); > + return atomic_try_cmpxchg(&inode->i_writecount, &old, -1) ? 0 : -ETXTBSY; > +} > static inline void put_write_access(struct inode * inode) > { > atomic_dec(&inode->i_writecount); Certainly on the track where I wish we could go. Now this goes tested. On 255 cores: Before: vagrant@kmod ~ $ sudo systemd-analyze Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s graphical.target reached after 44.178s in userspace. root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats Virtual mem wasted bytes 1949006968 ; 1949006968/1024/1024/1024 ~1.81515418738126754761 So ~1.8 GiB... of vmalloc space wasted during boot. After: systemd-analyze Startup finished in 24.438s (kernel) + 41.278s (userspace) = 1min 5.717s graphical.target reached after 41.154s in userspace. root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats Virtual mem wasted bytes 354413398 So still 337.99 MiB of vmalloc space wasted during boot due to duplicates. The reason is the exclusive_deny_write_access() must be kept during the life of the module otherwise as soon as it is done others can still race to load after and fail later after it sees the module is already loaded. It sounds crazy to think that such races exist because that means userspace didn't see the module loaded yet and still tried finit_module() but the stats reveal that's the case. So with two other hunks added (2nd and 4th), this now matches parity with my patch, not suggesting this is right, just demonstrating how this could be resolved with this. We could also just have a helper which lets the module code allow_write_access() at the end of its use of the fd (failure to load or module is removed). diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c index 5d826274570c..ff5b338a288b 100644 --- a/fs/kernel_read_file.c +++ b/fs/kernel_read_file.c @@ -48,7 +48,11 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf, if (!S_ISREG(file_inode(file)->i_mode)) return -EINVAL; - ret = deny_write_access(file); + /* Module reading wants *exclusive* access to the file */ + if (id == READING_MODULE) + ret = exclusive_deny_write_access(file); + else + ret = deny_write_access(file); if (ret) return ret; @@ -119,7 +123,8 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf, } out: - allow_write_access(file); + if (id != READING_MODULE) + allow_write_access(file); return ret == 0 ? copied : ret; } EXPORT_SYMBOL_GPL(kernel_read_file); diff --git a/include/linux/fs.h b/include/linux/fs.h index 21a981680856..722b42a77d51 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2566,6 +2566,12 @@ static inline int deny_write_access(struct file *file) struct inode *inode = file_inode(file); return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY; } +static inline int exclusive_deny_write_access(struct file *file) +{ + int old = 0; + struct inode *inode = file_inode(file); + return atomic_try_cmpxchg(&inode->i_writecount, &old, -1) ? 0 : -ETXTBSY; +} static inline void put_write_access(struct inode * inode) { atomic_dec(&inode->i_writecount); diff --git a/kernel/module/main.c b/kernel/module/main.c index 044aa2c9e3cb..88aaada929b1 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -3078,8 +3079,10 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL, READING_MODULE); if (len < 0) { - mod_stat_inc(&failed_kreads); - mod_stat_add_long(len, &invalid_kread_bytes); + if (len != -ETXTBSY) { + mod_stat_inc(&failed_kreads); + mod_stat_add_long(len, &invalid_kread_bytes); + } return len; } ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection 2023-05-25 18:08 ` Luis Chamberlain @ 2023-05-25 18:35 ` Luis Chamberlain 2023-05-25 18:50 ` Linus Torvalds 1 sibling, 0 replies; 6+ messages in thread From: Luis Chamberlain @ 2023-05-25 18:35 UTC (permalink / raw) To: Linus Torvalds, Linux FS Devel, hch, brauner, david Cc: tglx, patches, linux-modules, linux-mm, linux-kernel, pmladek, petr.pavlu, prarit, lennart, gregkh, rafael, song, lucas.de.marchi, lucas.demarchi, christophe.leroy, peterz, rppt, dave, willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu On Thu, May 25, 2023 at 11:08:13AM -0700, Luis Chamberlain wrote: > + fsdevel please review, > So with two other hunks added (2nd and 4th), this now matches parity with > my patch, not suggesting this is right, just demonstrating how this > could be resolved with this. We could also just have a helper which lets > the module code allow_write_access() at the end of its use of the fd > (failure to load or module is removed). This even fixes the pathological case with stress-ng for finit_module: ./stress-ng --module 8192 --module-name xfs (stress-ng assumes you have all dependencies already loaded and the module is not loaded, it uses finit_module() directly) Luis ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection 2023-05-25 18:08 ` Luis Chamberlain 2023-05-25 18:35 ` Luis Chamberlain @ 2023-05-25 18:50 ` Linus Torvalds 2023-05-25 19:32 ` Luis Chamberlain 1 sibling, 1 reply; 6+ messages in thread From: Linus Torvalds @ 2023-05-25 18:50 UTC (permalink / raw) To: Luis Chamberlain Cc: Linux FS Devel, hch, brauner, david, tglx, patches, linux-modules, linux-mm, linux-kernel, pmladek, petr.pavlu, prarit, lennart, gregkh, rafael, song, lucas.de.marchi, lucas.demarchi, christophe.leroy, peterz, rppt, dave, willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu On Thu, May 25, 2023 at 11:08 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > Certainly on the track where I wish we could go. Now this goes tested. > On 255 cores: > > Before: > > vagrant@kmod ~ $ sudo systemd-analyze > Startup finished in 41.653s (kernel) + 44.305s (userspace) = 1min 25.958s > graphical.target reached after 44.178s in userspace. > > root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats > Virtual mem wasted bytes 1949006968 > > > ; 1949006968/1024/1024/1024 > ~1.81515418738126754761 > > So ~1.8 GiB... of vmalloc space wasted during boot. > > After: > > systemd-analyze > Startup finished in 24.438s (kernel) + 41.278s (userspace) = 1min 5.717s > graphical.target reached after 41.154s in userspace. > > root@kmod ~ # grep "Virtual mem wasted bytes" /sys/kernel/debug/modules/stats > Virtual mem wasted bytes 354413398 > > So still 337.99 MiB of vmalloc space wasted during boot due to > duplicates. Ok. I think this will count as 'good enough for mitigation purposes' > The reason is the exclusive_deny_write_access() must be > kept during the life of the module otherwise as soon as it is done > others can still race to load Yes. The exclusion only applies while the file is actively being read. > So with two other hunks added (2nd and 4th), this now matches parity with > my patch, not suggesting this is right, Yeah, we can't do that, because user space may quite validly want to write the file afterwards. Or, in fact, unload the module and re-load it. So the "exclusion" really needs to be purely temporary. That said, I considered moving the exclusion to module/main.c itself, rather than the reading part. That wouild get rid of the hacky "id == READING_MODULE", and put the exclusion in the place that actually wants it. And that would allow us to at least extend that temporary exlusion a bit - we could keep it until the module has actually been loaded and inited. So it would probably improve on those numbers a bit more, but you'd still have the fundamental race where *serial* duplicates end up always wasting CPU effort and temporary vmalloc space. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection 2023-05-25 18:50 ` Linus Torvalds @ 2023-05-25 19:32 ` Luis Chamberlain 0 siblings, 0 replies; 6+ messages in thread From: Luis Chamberlain @ 2023-05-25 19:32 UTC (permalink / raw) To: Linus Torvalds Cc: Linux FS Devel, hch, brauner, david, tglx, patches, linux-modules, linux-mm, linux-kernel, pmladek, petr.pavlu, prarit, lennart, gregkh, rafael, song, lucas.de.marchi, lucas.demarchi, christophe.leroy, peterz, rppt, dave, willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu On Thu, May 25, 2023 at 11:50 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > So it would probably improve on those numbers a bit more, but you'd > still have the fundamental race where *serial* duplicates end up > always wasting CPU effort and temporary vmalloc space. The known failed boots are with KASAN with a large number of CPUs, so the value in the mitigation would be to help those boot until userspace fixes it and we have enough time for propagation. But since it is not a full proof solution, it may seem like an odd thing to have in place later and this being lost as odd tribal knowledge. I'd be in favor of only applying the mitigation if we really are chasing userspace to fix this, and we'd be OK in later removing it after userspace gets this fixed / propagated. If we're going to have userspace fix this, who is volunteering? Luis ^ permalink raw reply [flat|nested] 6+ messages in thread
[parent not found: <20230524213620.3509138-3-mcgrof@kernel.org>]
[parent not found: <8fc5b26b-d2f6-0c8f-34a1-af085dbef155@suse.com>]
[parent not found: <CAHk-=wiPjcPL_50WRWOi-Fmi9TYO6yp_oj63a_N84FzG-rxGKQ@mail.gmail.com>]
[parent not found: <2023052518-unable-mortician-4365@gregkh>]
* Re: [PATCH 2/2] module: add support to avoid duplicates early on load [not found] ` <2023052518-unable-mortician-4365@gregkh> @ 2023-05-25 18:22 ` Luis Chamberlain 0 siblings, 0 replies; 6+ messages in thread From: Luis Chamberlain @ 2023-05-25 18:22 UTC (permalink / raw) To: Greg KH, Linux FS Devel Cc: Linus Torvalds, Petr Pavlu, rafael, song, lucas.de.marchi, lucas.demarchi, christophe.leroy, peterz, rppt, dave, willy, vbabka, mhocko, dave.hansen, colin.i.king, jim.cromie, catalin.marinas, jbaron, rick.p.edgecombe, yujie.liu, david, tglx, hch, patches, linux-modules, linux-mm, linux-kernel, pmladek, prarit, lennart On Thu, May 25, 2023 at 05:42:10PM +0100, Greg KH wrote: > Luis, I asked last time what modules are being asked by the kernel to be > loaded thousands of times at boot and can't seem to find an answer > anywhere, did I miss that? Yes you missed it, I had explained it: https://lore.kernel.org/all/ZEGopJ8VAYnE7LQ2@bombadil.infradead.org/ "My best assessment of the situation is that each CPU in udev ends up triggering a load of duplicate set of modules, not just one, but *a lot*. Not sure what heuristics udev uses to load a set of modules per CPU." Petr Pavlu then finishes the assessment: https://lore.kernel.org/all/23bd0ce6-ef78-1cd8-1f21-0e706a00424a@suse.com/ But let me quote it, so it is not missed: "My understanding is that udev workers are forked. An initial kmod context is created by the main udevd process but no sharing happens after the fork. It means that the mentioned memory pool logic doesn't really kick in. Multiple parallel load requests come from multiple udev workers, for instance, each handling an udev event for one CPU device and making the exactly same requests as all others are doing at the same time. The optimization idea would be to recognize these duplicate requests at the udevd/kmod level and converge them." > This should be very easy to handle in > userspace if systems need it, so that begs the questions, what types of > systems need this? I had explained, this has existed for a long time. > We have handled booting with tens of thousands of > devices attached for decades now with no reports of boot/udev/kmod > issues before, what has recently changed to cause issues? Doesn't mean this didn't happen before, just because memory is freed due to duplicates does not mean that the memory pressure induced by them is not stupid. It is stupid, but hasn't come up as a possible real issue nowadays where systems require more vmalloc space used during boot with new features. I had explained also the context where this came from. David Hildenbrand had reported failure to boot on many CPUs. If you induce more vmap memory pressure on boot with multiple CPUs eventually you can't boot. Enabling KASAN will make this worse today. Luis ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-05-25 19:35 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20230524213620.3509138-1-mcgrof@kernel.org>
[not found] ` <20230524213620.3509138-2-mcgrof@kernel.org>
[not found] ` <CAHk-=wjahcAqLYm0ijcAVcPcQAz-UUuJ3Ubx4GzP_SJAupf=qQ@mail.gmail.com>
2023-05-25 7:01 ` [PATCH 1/2] fs/kernel_read_file: add support for duplicate detection Christian Brauner
[not found] ` <CAHk-=wgKu=tJf1bm_dtme4Hde4zTB=_7EdgR8avsDRK4_jD+uA@mail.gmail.com>
2023-05-25 18:08 ` Luis Chamberlain
2023-05-25 18:35 ` Luis Chamberlain
2023-05-25 18:50 ` Linus Torvalds
2023-05-25 19:32 ` Luis Chamberlain
[not found] ` <20230524213620.3509138-3-mcgrof@kernel.org>
[not found] ` <8fc5b26b-d2f6-0c8f-34a1-af085dbef155@suse.com>
[not found] ` <CAHk-=wiPjcPL_50WRWOi-Fmi9TYO6yp_oj63a_N84FzG-rxGKQ@mail.gmail.com>
[not found] ` <2023052518-unable-mortician-4365@gregkh>
2023-05-25 18:22 ` [PATCH 2/2] module: add support to avoid duplicates early on load Luis Chamberlain
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).