public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Török Edwin" <edwintorok@gmail.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>, Roland McGrath <roland@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Elias Oltmanns <eo@nebensachen.de>,
	Arjan van de Ven <arjan@infradead.org>,
	Oleg Nesterov <oleg@tv-sign.ru>
Subject: Re: [PATCH] x86_64: fix delayed signals
Date: Sat, 12 Jul 2008 23:26:59 +0300	[thread overview]
Message-ID: <48791393.1020107@gmail.com> (raw)
In-Reply-To: <alpine.LFD.1.10.0807121011470.2875@woody.linux-foundation.org>

On 2008-07-12 20:29, Linus Torvalds wrote:
> On Sat, 12 Jul 2008, Török Edwin wrote:
>   
>> On my 32-bit box (slow disks, SMP, XFS filesystem) 2.6.26-rc9 behaves
>> the same as 2.6.26-rc8, I can reliably reproduce a 2-3 second latency
>> [1] between pressing ^C the first time, and the shell returning (on the
>> text console too).
>> Using ftrace available from tip/master, I see up to 3 seconds of delay
>> between kill_pgrp and detach_pid (and during that time I can press ^C
>> again, leading to 2-3 kill_pgrp calls)
>>     
>
> The thing is, it's important to see what happens in between. 
>
> In particular, 2-3 second latencies can be entirely _normal_ (although 
> obviously very annoying) with most log-based filesystems when they decide 
> they have to flush the log.

A bit off-topic, but something I noticed during the tests:
In my original test I have rm-ed the files right after launching dd in
the background, yet it still continued to write to the disk.
I can understand that if the file is opened O_RDWR, you might seek back
and read what you wrote, so Linux needs to actually do the write,
but why does it insist on writing to the disk, on a file opened with
O_WRONLY, after the file itself got unlinked?

>  A lot of filesystems are not designed for 
> latency - every single filesystem test I have ever seen has always been 
> either a throughput test, or a "average random-seek latency" kind of test.
>
> The exact behavior will depend on the filesystem, for example. It will 
> also easily depend on things like whether you update 'atime' or not. Many 
> ostensibly read-only loads end up writing some data, especially inode 
> atimes, and that's when they can get caught up in having to wait for a log 
> to flush (to make up for that atime thing).
>   

I have my filesystems mounted as noatime already.
But yes, I am using different filesystems, the x86-64 box has reiserfs,
and the x86-32 box has xfs.

> You can try to limit the amount of dirty data in flight by tweaking 
> /proc/sys/vm/dirty*ratio

I have these in my /etc/rc.local:
echo 5 > /proc/sys/vm/dirty_background_ratio
echo 10 >/proc/sys/vm/dirty_ratio


> , but from a latency standpoint the thing that 
> actually matters more is often not the amount of dirty data, but the size 
> of the requests queues - because you often care about read latency, but if 
> you have big requests and especially if you have a load that does lots of 
> big _contiguous_ writes (like your 'dd's would do), then what can easily 
> happen is that the read ends up being behind a really big write in the 
> request queues. 
>   

I will try tweaking these (and /sys/block/sda/queue/nr_requests that
Arjan suggested).

> And 2-3 second latencies by no means means that each individual IO is 2-3 
> seconds long. No - it just means that you ended up having to do multiple 
> reads synchronously, and since the reads depended on each other (think a 
> pathname lookup - reading each directory entry -> inode -> data), you can 
> easily have a single system call causing 5-10 reads (bad cases are _much_ 
> more, but 5-10 are perfectly normal for even well-behaved things), and now 
> if each of those reads end up being behind a fairly big write...
>   

I'll try blktrace tomorrow, it should tell me when I/Os are queued /
completed.

>   
>> On my 64-bit box (2 disks in raid-0, UP, reiserfs filesystem) 2.6.25 and
>> 2.6.26-rc9 behave the same, and most of the time (10-20 times in a row)
>> find responds to ^C instantly.
>>
>> However in _some_ cases find doesn't respond to ^C for a very long time 
>> (~30 seconds), and when this happens I can't do anything else but switch 
>> consoles, starting another process (latencytop -d) hangs, and so does 
>> any other external command.
>>     
>
> Ok, that is definitel not related to signals at all. You're simply stuck 
> waiting for IO - or perhaps some fundamental filesystem semaphore which is 
> held while some IO needs to be flushed.

AFAICT reiserfs still uses the BKL, could that explain why one I/O
delays another?

>  That's why _unrelated_ processes 
> hang: they're all waiting for a global resource.
>
> And it may be worse on your other box for any number of reasons: raid 
> means, for example, that you have two different levels of queueing, and 
> thus effectively your queues are longer. And while raid-0 is better for 
> throughput, it's not necessarily at all better for latency. The filesystem 
> also makes a difference, as does the amount of dirty data under write-back 
> (do you also have more memory in your x86-64 box, for example? That makes 
> the kernel do bigger writeback buffers by default)
>
>   

Yes, I have more memory on x86-64 (2G), and x86-32 has 1G.
>> I haven't yet tried ftrace on this box, and neither did I try Roland's
>> patch yet. I will try that now, and hopefuly come back with some numbers
>> shortly.
>>     
>
> Trust me, roland's patch will make no difference what-so-ever. It's purely 
> a per-thread thing, and your behaviour is clearly not per-thread.
>   

Indeed.
[Roland's patch is included in tip/master so I actually tried it even if
I didn't know I did]

> Signals are _always_ delayed until non-blocking system calls are done, and 
> that means until the end of IO. 
>
> This is also why your trace on just 'kill_pgrp' and 'detach_pid' is not 
> interesting. It's _normal_ to have a delay between them. It can happen 
> because the process blocks (or catches) signals, but it will also happen 
> if some system call waits for disk.
>   

Is there a way to trace what happens between those 2 functions?
Maybe if I don't use the trace filter (and thus trace all functions),
and modify the kernel sources to start tracing on kill_pgrp, and stop
tracing on detach_pid.
Would that provide useful info?


Best regards,
--Edwin


  reply	other threads:[~2008-07-12 20:27 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-10 21:50 [PATCH] x86_64: fix delayed signals Roland McGrath
2008-07-10 22:06 ` Linus Torvalds
2008-07-10 22:42   ` Roland McGrath
2008-07-10 22:51     ` Linus Torvalds
2008-07-10 23:02       ` Linus Torvalds
2008-07-11  0:52       ` Roland McGrath
2008-07-11  1:18         ` Linus Torvalds
2008-07-11  1:27           ` Roland McGrath
2008-07-11  1:48         ` Linus Torvalds
2008-07-11  2:02           ` Linus Torvalds
2008-07-11  2:22             ` Linus Torvalds
2008-07-11  2:26               ` Linus Torvalds
2008-07-12 12:24             ` Andi Kleen
2008-07-11  5:46 ` Ingo Molnar
2008-07-11 11:13   ` Török Edwin
2008-07-11 12:24   ` Elias Oltmanns
2008-07-11 17:58   ` Linus Torvalds
2008-07-11 18:07     ` Roland McGrath
2008-07-11 18:16       ` Linus Torvalds
2008-07-11 18:17         ` Linus Torvalds
2008-07-11 18:10     ` Linus Torvalds
2008-07-11 18:31       ` Linus Torvalds
2008-07-11 22:53         ` Arjan van de Ven
2008-07-12 10:33           ` Török Edwin
2008-07-11 20:37       ` Linus Torvalds
2008-07-11 23:22         ` Linus Torvalds
2008-07-12 10:32           ` Török Edwin
2008-07-12 13:42             ` Török Edwin
2008-07-12 14:55               ` Arjan van de Ven
2008-07-12 18:00                 ` Linus Torvalds
2008-07-12 18:15                   ` Arjan van de Ven
2008-07-12 18:28                     ` Linus Torvalds
2008-07-12 17:29             ` Linus Torvalds
2008-07-12 20:26               ` Török Edwin [this message]
2008-07-12 20:47                 ` Linus Torvalds
2008-07-12 20:57                 ` Denys Vlasenko
2008-07-13 10:46                   ` Oleg Nesterov
2008-07-13 12:34                     ` Denys Vlasenko
2008-07-13 18:36                     ` Linus Torvalds
2008-07-13 18:45                       ` Peter T. Breuer
2008-07-12 12:27     ` Andi Kleen
2008-07-12 17:41       ` Linus Torvalds
2008-07-13  9:38         ` Andi Kleen
2008-07-13 17:32           ` Linus Torvalds
2008-07-13 18:59             ` Andi Kleen
2008-07-13 19:08               ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48791393.1020107@gmail.com \
    --to=edwintorok@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@infradead.org \
    --cc=eo@nebensachen.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=oleg@tv-sign.ru \
    --cc=roland@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox