All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	Arnd Bergmann <arnd@arndb.de>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
	syzbot <syzbot+8ab2d0f39fb79fe6ca40@syzkaller.appspotmail.com>
Subject: Re: [PATCH] /dev/mem: Bail out upon SIGKILL when reading memory.
Date: Sat, 24 Aug 2019 22:22:24 +0200	[thread overview]
Message-ID: <20190824202224.GA5286@gmail.com> (raw)
In-Reply-To: <CAHk-=whFQNkqPJ5zA1xAyvgtCPLN2C4xeJ181rU3k6bG+2zugg@mail.gmail.com>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Aug 24, 2019 at 9:14 AM Ingo Molnar <mingo@kernel.org> wrote:
> >
> > What I noticed is that while reading regular RAM is reasonably fast even
> > in very large chunks, accesses become very slow once they hit iomem - and
> > delays even longer than the reported 145 seconds become possible if
> > bs=100M is increased to say bs=1000M.
> 
> Ok, that makes sense.
> 
> So for IOMEM, we actually have two independent issues:
> 
>  (a) for normal unmapped IOMEM (ie no device), which is probably the
> case your test triggers anyway, it quite possibly ends up having ISA
> timings (ie around 1us per read)

That makes sense: I measured 17 seconds per 100 MB of data, which is is 
0.16 usecs per byte. The instruction used by 
copy_user_enhanced_fast_string() is REP MOVSB - which supposedly goes as 
high as cacheline size accesses - but perhaps those get broken down for 
physical memory that has no device claiming it?

>  (b) we use "probe_kernel_read()" to a temporary buffer avoid page
> faults, which uses __copy_from_user_inatomic(). So far so good. But on
> modern x86, because we treat it as just a regular memcpy, we probably
> have ERMS and do "rep movsb".
> 
> So (a) means that each access is slow to begin with, and then (b)
> means that we don't use "copy_from_io()" but a slow byte-by-byte
> access.
> 
> > With Tetsuo's approach the delays are fixed, because the fatal signal is
> > noticed within the 4K chunks of read_mem() immediately:
> 
> Yeah. that's the size of the temp buffer.
> 
> > Note how the first 100MB chunk took 17 seconds, the second chunk was
> > interrupted within ~2 seconds after I sent a SIGKILL.
> 
> It looks closer to 1 second from that trace, but that actually is
> still within the basic noise above: a 4kB buffer being copied one byte
> at a time would take about 4s, but maybe it's not *quite* as bad as
> traditional ISA PIO timings.

Yeah - and note that I phrased it imprecisely: the real in-kernel signal 
interruption delay was probably below 1 msec: in my test the SIGKILL was 
manually triggered, with an about 1 second delay caused by the human 
brain, not by the kernel. ;-)

The in-kernel interruption is almost instantaneous - the 4K max chunking 
is good I think.

> > Except that I think the regular pattern here is not 'break' but 'goto
> > failed' - 'break' would just result in a partial read and 'err' gets
> > ignored.
> 
> That's actually the correct signal handling pattern: either a partial
> read, or -EINTR.
> 
> In the case of SIGKILL, the return value doesn't matter, of course,
> but *if* we were to decide that we can make it interruptible, then it
> would.

Yeah. So while error cases and signals are not equivalent, I also went by 
existing precedent within read_kmem(): the 2 other error cases return an 
error (-ENXIO and -EFAULT), while they could also conceivably return a 
partial read of the previously completed read and only return an error if 
the *first* chunk generates an error without making any progress?

Interruption is arguably *not* an 'error', so preserving partial reads 
sounds like the higher quality solution, nevertheless one could argue 
that particual read *could* be returned by read_kmem() if progress was 
made.

> > I also agree that we should probably allow regular interrupts to
> > interrupt /dev/mem accesses - while the 'dd' process is killable, it's
> > not Ctrl-C-able.
> 
> I'm a bit nervous about it, but there probably aren't all that many
> actual /dev/mem users.
> 
> The main ones I can see are things like "dmidecode" etc.
> 
> > If I change the fatal_signal_pending() to signal_pending() it all behaves
> > much better IMHO - assuming that no utility learned rely on
> > non-interruptibility ...
> 
> So I cloned the dmidecode git tree, and it does "myread()" on the
> /dev/mem file as far as I can tell. And that one correctly hanles
> partial reads.
> 
> It still makes me a bit nervous, but just plain "signal_pending()" is
> not just nicer, it's technically the right thing to do (it's not a
> regular file, so partial reads are permissible, and other files like
> it - eg /dev/zero - already does exactly that).
> 
> I also wonder if we should just not use that crazy
> "probe_kernel_read()" to begin with. The /dev/mem code uses it to
> avoid faults - and that's the intent of it - but it's not meant for
> IO-memory at all.
> 
> But "read()" on /dev/mem does that off "xlate_dev_mem_ptr()", which is
> a bastardized "kernel address or ioremap" thing. It works, but it's
> all kinds of stupid. We'd be a *lot* better off using kmap(), I think.

Hm, I think xlate_dev_mem_ptr() and thus ioremap() would also perform a 
cache aliasing conflict check - which kmap() wouldn't do?

I.e. if for example an iomem area is already mapped by a driver with some 
conflicting cache attribute, xlate_dev_mem_ptr() AFAICS will not 
ioremap_cache() it to cached? IIRC some CPUs would triple fault or 
completely misbehave on certain cache attribute conflicts.

This check in mremap() might also trigger:

        if (is_ram == REGION_MIXED) {
                WARN_ONCE(1, "memremap attempted on mixed range %pa size: %#lx\n",
                                &offset, (unsigned long) size);
                return NULL;
        }

So I'd say xlate_dev_mem_ptr() looks messy, but is possibly a bit more 
robust in this regard?

> So my gut feel is that this is something worth trying to do bigger and
> more fundamental changes to.
> 
> But changing it to use "signal_pending()" at least as a trial looks
> ok. The only user *should* be things like dmidecode that apparently
> already do the right thing.
> 
> And if we changed the bounce buffering to do things at least a 32-bit
> word at a time, that would help latency for IO by a factor of 4.
> 
> And if the signal_pending() is at the end of the copy, then we'd
> guarantee that at least _small_ reads still work the way they did
> before, so people who do things like "lspci()" with direct hardware
> accesses wouldn't be impacted - if they exist.

Yeah.

> So I'd be willing to try that (and then if somebody reports a
> regression we can make it use "fatal_signal_pending()" instead)

Ok, will post a changelogged patch (unless Tetsuo beats me to it?).

Thanks,

	Ingo

  reply	other threads:[~2019-08-24 20:22 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-20 22:06 [PATCH] /dev/mem: Bail out upon SIGKILL when reading memory Tetsuo Handa
2019-08-20 22:24 ` Greg Kroah-Hartman
2019-08-21  0:07   ` Tetsuo Handa
2019-08-22  9:59   ` Tetsuo Handa
2019-08-22 13:35     ` Greg Kroah-Hartman
2019-08-22 14:00       ` Tetsuo Handa
2019-08-22 16:42         ` Greg Kroah-Hartman
2019-08-22 17:11           ` Dmitry Vyukov
2019-08-22 21:17             ` Tetsuo Handa
2019-08-22 23:59               ` Dmitry Vyukov
2019-08-23  8:17                 ` Tetsuo Handa
2019-08-23 16:47                   ` Dmitry Vyukov
2019-10-08  9:57                   ` Kernel config for fuzz testing Tetsuo Handa
2019-08-22 21:29 ` [PATCH] /dev/mem: Bail out upon SIGKILL when reading memory Linus Torvalds
2019-08-22 22:08 ` Linus Torvalds
2019-08-23  9:16   ` Ingo Molnar
2019-08-23 16:39     ` Linus Torvalds
2019-08-24 16:14       ` Ingo Molnar
2019-08-24 17:40         ` Linus Torvalds
2019-08-24 20:22           ` Ingo Molnar [this message]
2019-08-24 20:56             ` Linus Torvalds
2019-08-30  9:56               ` David Laight
2019-08-25  5:49             ` Tetsuo Handa
2019-08-25  9:59               ` Ingo Molnar
2019-08-25 10:35                 ` Tetsuo Handa
2019-08-25 10:48                   ` Ingo Molnar
2019-08-25 16:54               ` Linus Torvalds
2019-08-26 10:40                 ` Tetsuo Handa
2019-08-26 11:19                   ` Ingo Molnar
2019-08-26 15:02                     ` Rewriting read_kmem()/write_kmem() ? Tetsuo Handa
2019-08-23 11:46   ` [PATCH] /dev/mem: Bail out upon SIGKILL when reading memory Tetsuo Handa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190824202224.GA5286@gmail.com \
    --to=mingo@kernel.org \
    --cc=arnd@arndb.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=syzbot+8ab2d0f39fb79fe6ca40@syzkaller.appspotmail.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.