Re: Linux 2.6.36-rc7

dm-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* Re: Linux 2.6.36-rc7
       [not found] <AANLkTi=LsBNU+O2hqZUcM2nYM_ze6qPq3thwSZBMtY_v@mail.gmail.com>
@ 2010-10-07 19:28 ` Tejun Heo
  2010-10-07 20:13   ` Milan Broz
  0 siblings, 1 reply; 5+ messages in thread
From: Tejun Heo @ 2010-10-07 19:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, just.for.lkml, herbert, hch, neilb,
	dm-devel

Hello, Linus.

On 10/06/2010 11:45 PM, Linus Torvalds wrote:
> So I decided to break my a-week-is-eight-days rut, and actually
> release -rc7 after a proper seven-day week instead. Wo-oo!
> 
> And yes, that's probably as exciting as it gets, which is just fine by
> me. This should be the last -rc, I'm not seeing any reason to keep
> delaying a real release. There was still more changes to
> drivers/gpu/drm than I really would have hoped for, but they all look
> harmless and good. Famous last words.

I'm afraid there is a possibly workqueue related deadlock under high
memory pressure.  It happens on dm-crypt + md raid1 configuration.
I'm not yet sure whether this is caused by workqueue failing to kick
rescuers under memory pressure or the shared workqueue is making an
already existing problem more visible and in the process of setting up
an environment to reproduce the problem.

 http://thread.gmane.org/gmane.comp.file-systems.xfs.general/34922/focus=1044784

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Linux 2.6.36-rc7
  2010-10-07 19:28 ` Linux 2.6.36-rc7 Tejun Heo
@ 2010-10-07 20:13   ` Milan Broz
  2010-10-08 17:02     ` [dm-devel] " Tejun Heo
  0 siblings, 1 reply; 5+ messages in thread
From: Milan Broz @ 2010-10-07 20:13 UTC (permalink / raw)
  To: device-mapper development
  Cc: Linux Kernel Mailing List, just.for.lkml, hch, herbert, Tejun Heo,
	Linus Torvalds

On 10/07/2010 09:28 PM, Tejun Heo wrote:

> I'm afraid there is a possibly workqueue related deadlock under high
> memory pressure.  It happens on dm-crypt + md raid1 configuration.
> I'm not yet sure whether this is caused by workqueue failing to kick
> rescuers under memory pressure or the shared workqueue is making an
> already existing problem more visible and in the process of setting up
> an environment to reproduce the problem.
> 
>  http://thread.gmane.org/gmane.comp.file-systems.xfs.general/34922/focus=1044784

Yes, XFS is very good to show up problems in dm-crypt:)

But there was no change in dm-crypt which can itself cause such problem,
planned workqueue changes are not in 2.6.36 yet.
Code is basically the same for the last few releases.

So it seems that workqueue processing really changed here under memory pressure.

Milan

p.s.
Anyway, if you are able to reproduce it and you think that there is problem
in per-device dm-crypt workqueue, there are patches from Andi for shared
per-cpu workqueue, maybe it can help here. (But this is really not RC material.)

Unfortunately not yet in dm-devel tree, but I have them here ready for review:
http://mbroz.fedorapeople.org/dm-crypt/2.6.36-devel/
(all 4 patches must be applied, I hope Alasdair will put them in dm quilt soon.)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dm-devel] Linux 2.6.36-rc7
  2010-10-07 20:13   ` Milan Broz
@ 2010-10-08 17:02     ` Tejun Heo
  2010-10-10 11:56       ` Torsten Kaiser
  2010-10-11 10:09       ` [PATCH wq#for-next] workqueue: fix HIGHPRI handling in keep_working() Tejun Heo
  0 siblings, 2 replies; 5+ messages in thread
From: Tejun Heo @ 2010-10-08 17:02 UTC (permalink / raw)
  To: Milan Broz
  Cc: device-mapper development, Linus Torvalds,
	Linux Kernel Mailing List, just.for.lkml, hch, herbert

Hello, again.

On 10/07/2010 10:13 PM, Milan Broz wrote:
> Yes, XFS is very good to show up problems in dm-crypt:)
> 
> But there was no change in dm-crypt which can itself cause such problem,
> planned workqueue changes are not in 2.6.36 yet.
> Code is basically the same for the last few releases.
> 
> So it seems that workqueue processing really changed here under memory pressure.
> 
> Milan
> 
> p.s.
> Anyway, if you are able to reproduce it and you think that there is problem
> in per-device dm-crypt workqueue, there are patches from Andi for shared
> per-cpu workqueue, maybe it can help here. (But this is really not RC material.)
> 
> Unfortunately not yet in dm-devel tree, but I have them here ready for review:
> http://mbroz.fedorapeople.org/dm-crypt/2.6.36-devel/
> (all 4 patches must be applied, I hope Alasdair will put them in dm quilt soon.)

Okay, spent the whole day reproduing the problem and trying to
determine what's going on.  In the process, I've found a bug and a
potential issue (not sure whether it's an actual issue which should be
fixed for this release yet) but the hang doesn't seem to have anything
to do with workqueue update.  All the queues are behaving exactly as
expected during hang.

Also, it isn't a regression.  I can reliably trigger the same deadlock
on v2.6.35.

Here's the setup, which should be mostly similar to Torsten's setup I
used to trigger the problem.

The machine is dual quad-core Opteron (8 phys cores) w/ 4GiB memory.

* 80GB raid1 of two SATA disks
* On top of that, luks encrypted device w/ twofish-cbc-essiv:sha256
* In the encrypted device, xfs filesystem which hosts 8GiB swapfile
* 12GiB tmpfs

The workload is v2.6.35 allyesconfig -j 128 build in the tmpfs.  Not
too long after swap starts being used (several tens of secs), the
system hangs.  IRQ handling and all are fine but no IO gets through
with a lot of tasks stuck in bio allocation somewhere.

I suspected that with md and dm stacked together, something in the
upper layer ended up exhausting a shared bio pool and tried a couple
of things but haven't succeeded at finding where the culprit is.  It
probably would be best to run blktrace together and analyze how IO
gets stuck.

So, well, we seem to be broken the same way as before.  No need to
delay release for this one.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [dm-devel] Linux 2.6.36-rc7
  2010-10-08 17:02     ` [dm-devel] " Tejun Heo
@ 2010-10-10 11:56       ` Torsten Kaiser
  2010-10-11 10:09       ` [PATCH wq#for-next] workqueue: fix HIGHPRI handling in keep_working() Tejun Heo
  1 sibling, 0 replies; 5+ messages in thread
From: Torsten Kaiser @ 2010-10-10 11:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Milan Broz, device-mapper development, Linus Torvalds,
	Linux Kernel Mailing List, hch, herbert

On Fri, Oct 8, 2010 at 7:02 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, again.
>
> On 10/07/2010 10:13 PM, Milan Broz wrote:
>> Yes, XFS is very good to show up problems in dm-crypt:)
>>
>> But there was no change in dm-crypt which can itself cause such problem,
>> planned workqueue changes are not in 2.6.36 yet.
>> Code is basically the same for the last few releases.
>>
>> So it seems that workqueue processing really changed here under memory pressure.
>>
>> Milan
>>
>> p.s.
>> Anyway, if you are able to reproduce it and you think that there is problem
>> in per-device dm-crypt workqueue, there are patches from Andi for shared
>> per-cpu workqueue, maybe it can help here. (But this is really not RC material.)
>>
>> Unfortunately not yet in dm-devel tree, but I have them here ready for review:
>> http://mbroz.fedorapeople.org/dm-crypt/2.6.36-devel/
>> (all 4 patches must be applied, I hope Alasdair will put them in dm quilt soon.)
>
> Okay, spent the whole day reproduing the problem and trying to
> determine what's going on.  In the process, I've found a bug and a
> potential issue (not sure whether it's an actual issue which should be
> fixed for this release yet) but the hang doesn't seem to have anything
> to do with workqueue update.  All the queues are behaving exactly as
> expected during hang.
>
> Also, it isn't a regression.  I can reliably trigger the same deadlock
> on v2.6.35.
>
> Here's the setup, which should be mostly similar to Torsten's setup I
> used to trigger the problem.
>
> The machine is dual quad-core Opteron (8 phys cores) w/ 4GiB memory.
>
> * 80GB raid1 of two SATA disks
> * On top of that, luks encrypted device w/ twofish-cbc-essiv:sha256
> * In the encrypted device, xfs filesystem which hosts 8GiB swapfile
> * 12GiB tmpfs
>
> The workload is v2.6.35 allyesconfig -j 128 build in the tmpfs.  Not
> too long after swap starts being used (several tens of secs), the
> system hangs.  IRQ handling and all are fine but no IO gets through
> with a lot of tasks stuck in bio allocation somewhere.
>
> I suspected that with md and dm stacked together, something in the
> upper layer ended up exhausting a shared bio pool and tried a couple
> of things but haven't succeeded at finding where the culprit is.  It
> probably would be best to run blktrace together and analyze how IO
> gets stuck.
>
> So, well, we seem to be broken the same way as before.  No need to
> delay release for this one.

I instrument mm/mempool.c, trying to find what shared pool gets exhausted.
On the last run, it seemed that the fs_bio_set from fs/bio.c runs dry.

As far as I can see, that pool is used by bio_alloc() and bio_clone().
Above bio_alloc() a dire warning says, that any bio allocated that way
needs to be submitted from IO, otherwise the system could livelock.
bio_clone() does not have this warning, but as it uses the same pool
in the same way, I would expect the same rule applies.

Looking for uses of bio_allow() and bio_clone() in drivers/md it looks
like dm-crypt uses its own pools and not the fs_bio_set.
But drivers/md/raid1.c uses this pool, and in my eyes it does it wrong.

When writing to a RAID1 array the function make_request() in raid1.c
does a bio_clone() for each drive (lines 967-1001 in 2.6.36-rc7) and
only after all bios are allocates they will be merged into the
pending_bio_list.

So a RAID1 with 3 mirrors is a sure way to lock up a system as soon as
the mempool is needed?
(The fs_bio_set pool only allocates BIO_POOL_SIZE entries and that is
defined as 2)

From the use of atomic_inc(&r1_bio->remaining) and the use of the
spin_lock_irqsave(&conf->device_lock, flags) when merging the bio
list, I would suspect that its even possible that multiple CPUs
concurrently get into this allocation loop, or that the use of
multiple RAID1 devices each with only 2 drives could lock up the same
way.

What am I missing, or is the use of bio_clone() really the wrong thing?

Torsten

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH wq#for-next] workqueue: fix HIGHPRI handling in keep_working()
  2010-10-08 17:02     ` [dm-devel] " Tejun Heo
  2010-10-10 11:56       ` Torsten Kaiser
@ 2010-10-11 10:09       ` Tejun Heo
  1 sibling, 0 replies; 5+ messages in thread
From: Tejun Heo @ 2010-10-11 10:09 UTC (permalink / raw)
  To: Milan Broz, Linus Torvalds
  Cc: device-mapper development, Linux Kernel Mailing List,
	just.for.lkml, hch, herbert

The policy function keep_working() didn't check GCWQ_HIGHPRI_PENDING
and could return %false with highpri work pending.  This could lead to
late execution of a highpri work which was delayed due to @max_active
throttling if other works are actively consuming CPU cycles.

For example, the following could happen.

1. Work W0 which burns CPU cycles.

2. Two works W1 and W2 are queued to a highpri wq w/ @max_active of 1.

3. W1 starts executing and W2 is put to delayed queue.  W0 and W1 are
   both runnable.

4. W1 finishes which puts W2 to pending queue but keep_working()
   incorrectly returns %false and the worker goes to sleep.

5. W0 finishes and W2 starts execution.

With this patch applied, W2 starts execution as soon as W1 finishes.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
This is the workqueue bug I've found while trying to debug the dm/raid
hang.  Although the bug may introduce unexpected delay in scheduling a
highpri work, the delay can only be as long as the combined length of
CPU cycle burns of the already running works.  Given that HIGHPRI is
currently only used by xfs and its usage, I don't think it's likely to
cause an actual issue.  I'll queue it for #for-next.

Thank you.

 kernel/workqueue.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f77afd9..d355278 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -604,7 +604,9 @@ static bool keep_working(struct global_cwq *gcwq)
 {
 	atomic_t *nr_running = get_gcwq_nr_running(gcwq->cpu);

-	return !list_empty(&gcwq->worklist) && atomic_read(nr_running) <= 1;
+	return !list_empty(&gcwq->worklist) &&
+		(atomic_read(nr_running) <= 1 ||
+		 gcwq->flags & GCWQ_HIGHPRI_PENDING);
 }

 /* Do we need a new worker?  Called from manager. */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-10-11 10:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <AANLkTi=LsBNU+O2hqZUcM2nYM_ze6qPq3thwSZBMtY_v@mail.gmail.com>
2010-10-07 19:28 ` Linux 2.6.36-rc7 Tejun Heo
2010-10-07 20:13   ` Milan Broz
2010-10-08 17:02     ` [dm-devel] " Tejun Heo
2010-10-10 11:56       ` Torsten Kaiser
2010-10-11 10:09       ` [PATCH wq#for-next] workqueue: fix HIGHPRI handling in keep_working() Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).