Re: [PATCH] x86_64: fix delayed signals

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Török Edwin" <edwintorok@gmail.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>, Roland McGrath <roland@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Elias Oltmanns <eo@nebensachen.de>,
	Arjan van de Ven <arjan@infradead.org>,
	Oleg Nesterov <oleg@tv-sign.ru>
Subject: Re: [PATCH] x86_64: fix delayed signals
Date: Sat, 12 Jul 2008 23:26:59 +0300	[thread overview]
Message-ID: <48791393.1020107@gmail.com> (raw)
In-Reply-To: <alpine.LFD.1.10.0807121011470.2875@woody.linux-foundation.org>

On 2008-07-12 20:29, Linus Torvalds wrote:
> On Sat, 12 Jul 2008, Török Edwin wrote:
>   
>> On my 32-bit box (slow disks, SMP, XFS filesystem) 2.6.26-rc9 behaves
>> the same as 2.6.26-rc8, I can reliably reproduce a 2-3 second latency
>> [1] between pressing ^C the first time, and the shell returning (on the
>> text console too).
>> Using ftrace available from tip/master, I see up to 3 seconds of delay
>> between kill_pgrp and detach_pid (and during that time I can press ^C
>> again, leading to 2-3 kill_pgrp calls)
>>     
>
> The thing is, it's important to see what happens in between. 
>
> In particular, 2-3 second latencies can be entirely _normal_ (although 
> obviously very annoying) with most log-based filesystems when they decide 
> they have to flush the log.

A bit off-topic, but something I noticed during the tests:
In my original test I have rm-ed the files right after launching dd in
the background, yet it still continued to write to the disk.
I can understand that if the file is opened O_RDWR, you might seek back
and read what you wrote, so Linux needs to actually do the write,
but why does it insist on writing to the disk, on a file opened with
O_WRONLY, after the file itself got unlinked?

>  A lot of filesystems are not designed for 
> latency - every single filesystem test I have ever seen has always been 
> either a throughput test, or a "average random-seek latency" kind of test.
>
> The exact behavior will depend on the filesystem, for example. It will 
> also easily depend on things like whether you update 'atime' or not. Many 
> ostensibly read-only loads end up writing some data, especially inode 
> atimes, and that's when they can get caught up in having to wait for a log 
> to flush (to make up for that atime thing).
>   

I have my filesystems mounted as noatime already.
But yes, I am using different filesystems, the x86-64 box has reiserfs,
and the x86-32 box has xfs.

> You can try to limit the amount of dirty data in flight by tweaking 
> /proc/sys/vm/dirty*ratio

I have these in my /etc/rc.local:
echo 5 > /proc/sys/vm/dirty_background_ratio
echo 10 >/proc/sys/vm/dirty_ratio


> , but from a latency standpoint the thing that 
> actually matters more is often not the amount of dirty data, but the size 
> of the requests queues - because you often care about read latency, but if 
> you have big requests and especially if you have a load that does lots of 
> big _contiguous_ writes (like your 'dd's would do), then what can easily 
> happen is that the read ends up being behind a really big write in the 
> request queues. 
>   

I will try tweaking these (and /sys/block/sda/queue/nr_requests that
Arjan suggested).

> And 2-3 second latencies by no means means that each individual IO is 2-3 
> seconds long. No - it just means that you ended up having to do multiple 
> reads synchronously, and since the reads depended on each other (think a 
> pathname lookup - reading each directory entry -> inode -> data), you can 
> easily have a single system call causing 5-10 reads (bad cases are _much_ 
> more, but 5-10 are perfectly normal for even well-behaved things), and now 
> if each of those reads end up being behind a fairly big write...
>   

I'll try blktrace tomorrow, it should tell me when I/Os are queued /
completed.

>   
>> On my 64-bit box (2 disks in raid-0, UP, reiserfs filesystem) 2.6.25 and
>> 2.6.26-rc9 behave the same, and most of the time (10-20 times in a row)
>> find responds to ^C instantly.
>>
>> However in _some_ cases find doesn't respond to ^C for a very long time 
>> (~30 seconds), and when this happens I can't do anything else but switch 
>> consoles, starting another process (latencytop -d) hangs, and so does 
>> any other external command.
>>     
>
> Ok, that is definitel not related to signals at all. You're simply stuck 
> waiting for IO - or perhaps some fundamental filesystem semaphore which is 
> held while some IO needs to be flushed.

AFAICT reiserfs still uses the BKL, could that explain why one I/O
delays another?

>  That's why _unrelated_ processes 
> hang: they're all waiting for a global resource.
>
> And it may be worse on your other box for any number of reasons: raid 
> means, for example, that you have two different levels of queueing, and 
> thus effectively your queues are longer. And while raid-0 is better for 
> throughput, it's not necessarily at all better for latency. The filesystem 
> also makes a difference, as does the amount of dirty data under write-back 
> (do you also have more memory in your x86-64 box, for example? That makes 
> the kernel do bigger writeback buffers by default)
>
>   

Yes, I have more memory on x86-64 (2G), and x86-32 has 1G.
>> I haven't yet tried ftrace on this box, and neither did I try Roland's
>> patch yet. I will try that now, and hopefuly come back with some numbers
>> shortly.
>>     
>
> Trust me, roland's patch will make no difference what-so-ever. It's purely 
> a per-thread thing, and your behaviour is clearly not per-thread.
>   

Indeed.
[Roland's patch is included in tip/master so I actually tried it even if
I didn't know I did]

> Signals are _always_ delayed until non-blocking system calls are done, and 
> that means until the end of IO. 
>
> This is also why your trace on just 'kill_pgrp' and 'detach_pid' is not 
> interesting. It's _normal_ to have a delay between them. It can happen 
> because the process blocks (or catches) signals, but it will also happen 
> if some system call waits for disk.
>   

Is there a way to trace what happens between those 2 functions?
Maybe if I don't use the trace filter (and thus trace all functions),
and modify the kernel sources to start tracing on kill_pgrp, and stop
tracing on detach_pid.
Would that provide useful info?


Best regards,
--Edwin

next prev parent reply	other threads:[~2008-07-12 20:27 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-10 21:50 [PATCH] x86_64: fix delayed signals Roland McGrath
2008-07-10 22:06 ` Linus Torvalds
2008-07-10 22:42   ` Roland McGrath
2008-07-10 22:51     ` Linus Torvalds
2008-07-10 23:02       ` Linus Torvalds
2008-07-11  0:52       ` Roland McGrath
2008-07-11  1:18         ` Linus Torvalds
2008-07-11  1:27           ` Roland McGrath
2008-07-11  1:48         ` Linus Torvalds
2008-07-11  2:02           ` Linus Torvalds
2008-07-11  2:22             ` Linus Torvalds
2008-07-11  2:26               ` Linus Torvalds
2008-07-12 12:24             ` Andi Kleen
2008-07-11  5:46 ` Ingo Molnar
2008-07-11 11:13   ` Török Edwin
2008-07-11 12:24   ` Elias Oltmanns
2008-07-11 17:58   ` Linus Torvalds
2008-07-11 18:07     ` Roland McGrath
2008-07-11 18:16       ` Linus Torvalds
2008-07-11 18:17         ` Linus Torvalds
2008-07-11 18:10     ` Linus Torvalds
2008-07-11 18:31       ` Linus Torvalds
2008-07-11 22:53         ` Arjan van de Ven
2008-07-12 10:33           ` Török Edwin
2008-07-11 20:37       ` Linus Torvalds
2008-07-11 23:22         ` Linus Torvalds
2008-07-12 10:32           ` Török Edwin
2008-07-12 13:42             ` Török Edwin
2008-07-12 14:55               ` Arjan van de Ven
2008-07-12 18:00                 ` Linus Torvalds
2008-07-12 18:15                   ` Arjan van de Ven
2008-07-12 18:28                     ` Linus Torvalds
2008-07-12 17:29             ` Linus Torvalds
2008-07-12 20:26               ` Török Edwin [this message]
2008-07-12 20:47                 ` Linus Torvalds
2008-07-12 20:57                 ` Denys Vlasenko
2008-07-13 10:46                   ` Oleg Nesterov
2008-07-13 12:34                     ` Denys Vlasenko
2008-07-13 18:36                     ` Linus Torvalds
2008-07-13 18:45                       ` Peter T. Breuer
2008-07-12 12:27     ` Andi Kleen
2008-07-12 17:41       ` Linus Torvalds
2008-07-13  9:38         ` Andi Kleen
2008-07-13 17:32           ` Linus Torvalds
2008-07-13 18:59             ` Andi Kleen
2008-07-13 19:08               ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48791393.1020107@gmail.com \
    --to=edwintorok@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@infradead.org \
    --cc=eo@nebensachen.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=oleg@tv-sign.ru \
    --cc=roland@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.