* Calltrace in dm-snapshot in 2.6.27 kernel
@ 2008-10-20 6:23 aluno3
2008-10-20 8:43 ` Milan Broz
0 siblings, 1 reply; 17+ messages in thread
From: aluno3 @ 2008-10-20 6:23 UTC (permalink / raw)
To: dm-devel
Hi,
I have some problems with device mapper in 2.6.27 kernel. Below there is
calltrace from logs:
---------------
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<0000000000000000>] 0x0
PGD 5a84c067 PUD 5cfdb067 PMD 0
Oops: 0000 [1] SMP
CPU 0
Modules linked in: iscsi_trgt drbd bonding iscsi_tcp libiscsi
scsi_transport_iscsi megaraid_mbox megaraid_mm sky2 skge button ftdi_sio
usbserial
Pid: 31704, comm: kcopyd Not tainted 2.6.27 #7
RIP: 0010:[<0000000000000000>] [<0000000000000000>] 0x0
RSP: 0000:ffff880055af5d18 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88007dfe3128 RCX: 010000000000059d
RDX: 0000000000000018 RSI: 8000000000000000 RDI: ffff88007dfe3128
RBP: ffff88007dfe33c8 R08: ffffc20005f751d0 R09: 00ffffffffffffff
R10: 0100000000000000 R11: 0000000000000000 R12: ffff880014d8dc00
R13: 0000000000000000 R14: ffff880059c89840 R15: ffff880014d8dd18
FS: 0000000000000000(0000) GS:ffffffff808dea80(0000)
knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000007d0d0000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
Process kcopyd (pid: 31704, threadinfo ffff880055af4000, task
ffff880026e3ce70)
Stack: ffffffff805c2ea4 00000000ffffff7e 00000000000000ee ffff88004c268440
0000000000000000 ffff880026d58eb8 0000000000000400 0000000000000000
ffffffff805c4140 0000000000001d5a 00000000000005b8 ffff880001025af0
Call Trace:
[<ffffffff805c2ea4>] ? pending_complete+0x1e4/0x220
[<ffffffff805c4140>] ? persistent_commit+0x100/0x130
[<ffffffff805bd8a3>] ? segment_complete+0x183/0x1c0
[<ffffffff805bd720>] ? segment_complete+0x0/0x1c0
[<ffffffff805bd385>] ? run_complete_job+0x65/0xb0
[<ffffffff805bd320>] ? run_complete_job+0x0/0xb0
[<ffffffff805bd5d6>] ? process_jobs+0x26/0xe0
[<ffffffff805bd690>] ? do_work+0x0/0x60
[<ffffffff805bd6b8>] ? do_work+0x28/0x60
[<ffffffff8024686a>] ? run_workqueue+0x5a/0x110
[<ffffffff802469bc>] ? worker_thread+0x9c/0xf0
[<ffffffff8024a620>] ? autoremove_wake_function+0x0/0x30
[<ffffffff8024a620>] ? autoremove_wake_function+0x0/0x30
[<ffffffff80246920>] ? worker_thread+0x0/0xf0
[<ffffffff80249f0c>] ? kthread+0x6c/0xa0
[<ffffffff8020d1c9>] ? child_rip+0xa/0x11
[<ffffffff8021b5f0>] ? lapic_next_event+0x0/0x10
[<ffffffff80249ea0>] ? kthread+0x0/0xa0
[<ffffffff8020d1bf>] ? child_rip+0x0/0x11
Code: Bad RIP value.
RIP [<0000000000000000>] 0x0
RSP <ffff880055af5d18>
CR2: 0000000000000000
---------------
I've got this calltrace from our QA team. They say that they mad few
snapshots, run several programs like bacula or rsync and that calltrace
is appearing about 1 hour after starting those programs.
We didn't recognize the reason of this calltrace so far. I mean we don't
know which of these programs can cause this calltrace.
I investigate a little this calltrace on my own. That what I know is
NULL value of "free" pointer (in mempool_t structure) while calling
mempool_free().
Here there is trace of procedures call:
(...) -> put_pending_exception():841 -> free_pending_exception() ->
mempool_free()
The mempool_free() calls:
pool->free(element, pool->pool_data)
and here pool->free is NULL, so it causes calltrace.
This is the description of the problem.
Is this known problem? Is there any solution for fixing it?
Any suggestions?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-10-20 6:23 Calltrace in dm-snapshot in 2.6.27 kernel aluno3
@ 2008-10-20 8:43 ` Milan Broz
2008-10-21 6:39 ` aluno3
0 siblings, 1 reply; 17+ messages in thread
From: Milan Broz @ 2008-10-20 8:43 UTC (permalink / raw)
To: device-mapper development
aluno3@poczta.onet.pl wrote:
> I've got this calltrace from our QA team. They say that they mad few
> snapshots, run several programs like bacula or rsync and that calltrace
> is appearing about 1 hour after starting those programs.
Hi,
if it is reproducible, please can you try if this patch helps?
http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-snapshot-fix-primary_pe-race.patch
Probably the same problem reported here
http://bugzilla.kernel.org/show_bug.cgi?id=11636
(Added Mikulas to CC)
Milan
--
mbroz@redhat.com
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-10-20 8:43 ` Milan Broz
@ 2008-10-21 6:39 ` aluno3
2008-10-21 13:55 ` Mikulas Patocka
0 siblings, 1 reply; 17+ messages in thread
From: aluno3 @ 2008-10-21 6:39 UTC (permalink / raw)
To: device-mapper development
Hi Milan,
Thanks for the patch. I've applied it on 2.6.27 but it looks like we're
still having the same problem. We've tested it on both 32 and 64 bit
kernels - and on both of them the problem occurs, but in different way.
Here there are calltraces from both kernels (32 and 64 bit):
32 bit one:
BUG: unable to handle kernel paging request at 08048000
IP: [<c05263f9>] _spin_lock_irqsave+0x9/0x20
*pdpt = 000000000c438001 *pde = 000000007f997067
Oops: 0003 [#1] SMP
Modules linked in: sg st iscsi_trgt drbd bonding iscsi_tcp libiscsi
scsi_transport_iscsi 3w_9xxx sata_nv forcedeth button ftdi_sio usbserial
Pid: 30618, comm: kcopyd Not tainted (2.6.27-32#1)
EIP: 0060:[<c05263f9>] EFLAGS: 00010097 CPU: 0
EIP is at _spin_lock_irqsave+0x9/0x20
EAX: 08048000 EBX: 08048000 ECX: 00000297 EDX: 00000100
ESI: eb602148 EDI: eb53cb40 EBP: 00000000 ESP: f11f7ea0
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process kcopyd (pid: 30618, ti=f11f6000 task=f5065580 task.ti=f11f6000)
Stack: c015a57f eb602148 eb641d08 eb53cb40 c044ca4a eb641d08 00000000
f29a5080
c044cb07 00000002 00000000 eb8dd540 00000000 c044ddb3 00057803
00000000
000001f5 00000000 f2728108 f29a5080 00000000 c044cba0 f2728108
ed016370
Call Trace:
[<c015a57f>] mempool_free+0x1f/0x70
[<c044ca4a>] put_pending_exception+0x5a/0x60
[<c044cb07>] pending_complete+0xb7/0x110
[<c044ddb3>] persistent_commit+0xe3/0x110
[<c044cba0>] copy_callback+0x30/0x40
[<c0447d04>] segment_complete+0x154/0x1d0
[<c0447935>] run_complete_job+0x45/0x80
[<c0447bb0>] segment_complete+0x0/0x1d0
[<c04478f0>] run_complete_job+0x0/0x80
[<c0447af4>] process_jobs+0x14/0x70
[<c0447b50>] do_work+0x0/0x40
[<c0447b66>] do_work+0x16/0x40
[<c013509d>] run_workqueue+0x4d/0xf0
[<c01351bd>] worker_thread+0x7d/0xc0
[<c0138350>] autoremove_wake_function+0x0/0x30
[<c0524efc>] __sched_text_start+0x1ec/0x4b0
[<c0138350>] autoremove_wake_function+0x0/0x30
[<c0121a9b>] complete+0x2b/0x40
[<c0135140>] worker_thread+0x0/0xc0
[<c0137e24>] kthread+0x44/0x70
[<c0137de0>] kthread+0x0/0x70
[<c0104c57>] kernel_thread_helper+0x7/0x10
=======================
Code: 89 c8 c3 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 90 90 83 28 01
79 05 e8 25 ff ff ff c3 8d 74 26 00 9c 59 fa ba 00 01 00 00 90 <66> 0f
c1 10 38 f2 74 06 f3 90 8a 10 eb f6 89 c8 c3 8d b6 00 00
EIP: [<c05263f9>] _spin_lock_irqsave+0x9/0x20 SS:ESP 0068:f11f7ea0
---[ end trace b3493777a8378781 ]---
64 bit one:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<0000000000000000>] 0x0
PGD 6e88e067 PUD 53f6f067 PMD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: iscsi_trgt drbd bonding iscsi_tcp libiscsi
scsi_transport_iscsi megaraid_mbox megaraid_mm sky2 skge button ftdi_sio
usbserial
Pid: 13724, comm: kcopyd Not tainted 2.6.27-64#3
RIP: 0010:[<0000000000000000>] [<0000000000000000>] 0x0
RSP: 0000:ffff880000b83d18 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88002626fba8 RCX: 0000000000000001
RDX: ffff8800761d4208 RSI: 8000000000000000 RDI: ffff88002626fba8
RBP: ffff8800399e4000 R08: ffffc20005e1e130 R09: 00ffffffffffffff
R10: 0100000000000000 R11: 0000000000000000 R12: ffff880030087c88
R13: 0000000000000000 R14: ffff88002c14f440 R15: ffff8800399e4118
FS: 0000000000000000(0000) GS:ffff88007f473dc0(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000002a7d9000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kcopyd (pid: 13724, threadinfo ffff880000b82000, task
ffff88007f58cf30)
Stack: ffffffff805c2eae 0000000000000400 0000000000000001 ffff8800561a55c0
0000000000000000 ffff880038c1f978 0000000000000400 0000000000000000
ffffffff805c4130 0000000000001425 000000000000062a 0000000000000082
Call Trace:
[<ffffffff805c2eae>] ? pending_complete+0x1ee/0x230
[<ffffffff805c4130>] ? persistent_commit+0xe0/0x130
[<ffffffff805bd8a3>] ? segment_complete+0x183/0x1c0
[<ffffffff805bd720>] ? segment_complete+0x0/0x1c0
[<ffffffff805bd385>] ? run_complete_job+0x65/0xb0
[<ffffffff805bd320>] ? run_complete_job+0x0/0xb0
[<ffffffff805bd5d6>] ? process_jobs+0x26/0xe0
[<ffffffff805bd690>] ? do_work+0x0/0x60
[<ffffffff805bd6b8>] ? do_work+0x28/0x60
[<ffffffff8024686a>] ? run_workqueue+0x5a/0x110
[<ffffffff802469bc>] ? worker_thread+0x9c/0xf0
[<ffffffff8024a620>] ? autoremove_wake_function+0x0/0x30
[<ffffffff8024a620>] ? autoremove_wake_function+0x0/0x30
[<ffffffff80246920>] ? worker_thread+0x0/0xf0
[<ffffffff80249f0c>] ? kthread+0x6c/0xa0
[<ffffffff8020d1c9>] ? child_rip+0xa/0x11
[<ffffffff8021b5f0>] ? lapic_next_event+0x0/0x10
[<ffffffff80249ea0>] ? kthread+0x0/0xa0
[<ffffffff8020d1bf>] ? child_rip+0x0/0x11
Code: Bad RIP value.
RIP [<0000000000000000>] 0x0
RSP <ffff880000b83d18>
CR2: 0000000000000000
---[ end trace 03b26540ec781e73 ]---
Any other suggestions?
Best
Milan Broz wrote:
> aluno3@poczta.onet.pl wrote:
>
>> I've got this calltrace from our QA team. They say that they mad few
>> snapshots, run several programs like bacula or rsync and that calltrace
>> is appearing about 1 hour after starting those programs.
>>
>
> Hi,
> if it is reproducible, please can you try if this patch helps?
> http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-snapshot-fix-primary_pe-race.patch
>
> Probably the same problem reported here
> http://bugzilla.kernel.org/show_bug.cgi?id=11636
>
> (Added Mikulas to CC)
>
> Milan
> --
> mbroz@redhat.com
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-10-21 6:39 ` aluno3
@ 2008-10-21 13:55 ` Mikulas Patocka
[not found] ` <48FDFF53.5080007@poczta.onet.pl>
0 siblings, 1 reply; 17+ messages in thread
From: Mikulas Patocka @ 2008-10-21 13:55 UTC (permalink / raw)
To: aluno3@poczta.onet.pl; +Cc: device-mapper development
Hi
Please send me files mm/mempool.o and driver/md/*.o from these two
kernels, that crashed with these oopses. So that I can see more precisely,
where it happened.
Mikulas
On Tue, 21 Oct 2008, aluno3@poczta.onet.pl wrote:
> Hi Milan,
>
> Thanks for the patch. I've applied it on 2.6.27 but it looks like we're
> still having the same problem. We've tested it on both 32 and 64 bit
> kernels - and on both of them the problem occurs, but in different way.
>
> Here there are calltraces from both kernels (32 and 64 bit):
>
>
> 32 bit one:
>
> BUG: unable to handle kernel paging request at 08048000
> IP: [<c05263f9>] _spin_lock_irqsave+0x9/0x20
> *pdpt = 000000000c438001 *pde = 000000007f997067
> Oops: 0003 [#1] SMP
> Modules linked in: sg st iscsi_trgt drbd bonding iscsi_tcp libiscsi
> scsi_transport_iscsi 3w_9xxx sata_nv forcedeth button ftdi_sio usbserial
>
> Pid: 30618, comm: kcopyd Not tainted (2.6.27-32#1)
> EIP: 0060:[<c05263f9>] EFLAGS: 00010097 CPU: 0
> EIP is at _spin_lock_irqsave+0x9/0x20
> EAX: 08048000 EBX: 08048000 ECX: 00000297 EDX: 00000100
> ESI: eb602148 EDI: eb53cb40 EBP: 00000000 ESP: f11f7ea0
> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> Process kcopyd (pid: 30618, ti=f11f6000 task=f5065580 task.ti=f11f6000)
> Stack: c015a57f eb602148 eb641d08 eb53cb40 c044ca4a eb641d08 00000000
> f29a5080
> c044cb07 00000002 00000000 eb8dd540 00000000 c044ddb3 00057803
> 00000000
> 000001f5 00000000 f2728108 f29a5080 00000000 c044cba0 f2728108
> ed016370
> Call Trace:
> [<c015a57f>] mempool_free+0x1f/0x70
> [<c044ca4a>] put_pending_exception+0x5a/0x60
> [<c044cb07>] pending_complete+0xb7/0x110
> [<c044ddb3>] persistent_commit+0xe3/0x110
> [<c044cba0>] copy_callback+0x30/0x40
> [<c0447d04>] segment_complete+0x154/0x1d0
> [<c0447935>] run_complete_job+0x45/0x80
> [<c0447bb0>] segment_complete+0x0/0x1d0
> [<c04478f0>] run_complete_job+0x0/0x80
> [<c0447af4>] process_jobs+0x14/0x70
> [<c0447b50>] do_work+0x0/0x40
> [<c0447b66>] do_work+0x16/0x40
> [<c013509d>] run_workqueue+0x4d/0xf0
> [<c01351bd>] worker_thread+0x7d/0xc0
> [<c0138350>] autoremove_wake_function+0x0/0x30
> [<c0524efc>] __sched_text_start+0x1ec/0x4b0
> [<c0138350>] autoremove_wake_function+0x0/0x30
> [<c0121a9b>] complete+0x2b/0x40
> [<c0135140>] worker_thread+0x0/0xc0
> [<c0137e24>] kthread+0x44/0x70
> [<c0137de0>] kthread+0x0/0x70
> [<c0104c57>] kernel_thread_helper+0x7/0x10
> =======================
> Code: 89 c8 c3 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 90 90 83 28 01
> 79 05 e8 25 ff ff ff c3 8d 74 26 00 9c 59 fa ba 00 01 00 00 90 <66> 0f
> c1 10 38 f2 74 06 f3 90 8a 10 eb f6 89 c8 c3 8d b6 00 00
> EIP: [<c05263f9>] _spin_lock_irqsave+0x9/0x20 SS:ESP 0068:f11f7ea0
> ---[ end trace b3493777a8378781 ]---
>
>
>
> 64 bit one:
>
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<0000000000000000>] 0x0
> PGD 6e88e067 PUD 53f6f067 PMD 0
> Oops: 0000 [1] SMP
> CPU 1
> Modules linked in: iscsi_trgt drbd bonding iscsi_tcp libiscsi
> scsi_transport_iscsi megaraid_mbox megaraid_mm sky2 skge button ftdi_sio
> usbserial
> Pid: 13724, comm: kcopyd Not tainted 2.6.27-64#3
> RIP: 0010:[<0000000000000000>] [<0000000000000000>] 0x0
> RSP: 0000:ffff880000b83d18 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: ffff88002626fba8 RCX: 0000000000000001
> RDX: ffff8800761d4208 RSI: 8000000000000000 RDI: ffff88002626fba8
> RBP: ffff8800399e4000 R08: ffffc20005e1e130 R09: 00ffffffffffffff
> R10: 0100000000000000 R11: 0000000000000000 R12: ffff880030087c88
> R13: 0000000000000000 R14: ffff88002c14f440 R15: ffff8800399e4118
> FS: 0000000000000000(0000) GS:ffff88007f473dc0(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 000000002a7d9000 CR4: 00000000000006a0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process kcopyd (pid: 13724, threadinfo ffff880000b82000, task
> ffff88007f58cf30)
> Stack: ffffffff805c2eae 0000000000000400 0000000000000001 ffff8800561a55c0
> 0000000000000000 ffff880038c1f978 0000000000000400 0000000000000000
> ffffffff805c4130 0000000000001425 000000000000062a 0000000000000082
> Call Trace:
> [<ffffffff805c2eae>] ? pending_complete+0x1ee/0x230
> [<ffffffff805c4130>] ? persistent_commit+0xe0/0x130
> [<ffffffff805bd8a3>] ? segment_complete+0x183/0x1c0
> [<ffffffff805bd720>] ? segment_complete+0x0/0x1c0
> [<ffffffff805bd385>] ? run_complete_job+0x65/0xb0
> [<ffffffff805bd320>] ? run_complete_job+0x0/0xb0
> [<ffffffff805bd5d6>] ? process_jobs+0x26/0xe0
> [<ffffffff805bd690>] ? do_work+0x0/0x60
> [<ffffffff805bd6b8>] ? do_work+0x28/0x60
> [<ffffffff8024686a>] ? run_workqueue+0x5a/0x110
> [<ffffffff802469bc>] ? worker_thread+0x9c/0xf0
> [<ffffffff8024a620>] ? autoremove_wake_function+0x0/0x30
> [<ffffffff8024a620>] ? autoremove_wake_function+0x0/0x30
> [<ffffffff80246920>] ? worker_thread+0x0/0xf0
> [<ffffffff80249f0c>] ? kthread+0x6c/0xa0
> [<ffffffff8020d1c9>] ? child_rip+0xa/0x11
> [<ffffffff8021b5f0>] ? lapic_next_event+0x0/0x10
> [<ffffffff80249ea0>] ? kthread+0x0/0xa0
> [<ffffffff8020d1bf>] ? child_rip+0x0/0x11
>
>
> Code: Bad RIP value.
> RIP [<0000000000000000>] 0x0
> RSP <ffff880000b83d18>
> CR2: 0000000000000000
> ---[ end trace 03b26540ec781e73 ]---
>
>
>
> Any other suggestions?
>
> Best
>
> Milan Broz wrote:
> > aluno3@poczta.onet.pl wrote:
> >
> >> I've got this calltrace from our QA team. They say that they mad few
> >> snapshots, run several programs like bacula or rsync and that calltrace
> >> is appearing about 1 hour after starting those programs.
> >>
> >
> > Hi,
> > if it is reproducible, please can you try if this patch helps?
> > http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-snapshot-fix-primary_pe-race.patch
> >
> > Probably the same problem reported here
> > http://bugzilla.kernel.org/show_bug.cgi?id=11636
> >
> > (Added Mikulas to CC)
> >
> > Milan
> > --
> > mbroz@redhat.com
> >
> > --
> > dm-devel mailing list
> > dm-devel@redhat.com
> > https://www.redhat.com/mailman/listinfo/dm-devel
> >
> >
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
[not found] ` <48FDFF53.5080007@poczta.onet.pl>
@ 2008-10-21 17:22 ` Mikulas Patocka
2008-10-21 18:42 ` aluno3
0 siblings, 1 reply; 17+ messages in thread
From: Mikulas Patocka @ 2008-10-21 17:22 UTC (permalink / raw)
To: aluno3; +Cc: device-mapper development
> Hi Mikulas
>
> I send to You (in attachment) files from both kernels.Thank You for help
>
> Best
Thanks. Does your workload involve just one snapshot of a given volume, or
do you have more snapshots of the same volume? Do you write only to origin
or to both origin and the snapshot or only to the snapshot?
I will send you some patch with more debug tests soon, just answer these
questions first, so that I can make it.
Mikulas
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-10-21 17:22 ` Mikulas Patocka
@ 2008-10-21 18:42 ` aluno3
2008-10-21 21:43 ` Mikulas Patocka
0 siblings, 1 reply; 17+ messages in thread
From: aluno3 @ 2008-10-21 18:42 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: device-mapper development
My workload in case of 32b involved only one volume and 3 snapshots to
this volume. I wrote only to origin volume (with use fstress) and I
created and deleted these snapshots from time to time. Snapshots was
mount with rw option.
In case 64b I had 2 volume and 10 snapshots for each volume.I created
and deleted these snapshots from time to time as well. I wrote only to
origin volumes (with use dd)
I don`t remember exactly but in case test with 1 snapshot per volume all
was working correctly through a few hours and I stopped this test but in
case my above tests system was working correctly through a few minutes.
Mikulas Patocka wrote:
>> Hi Mikulas
>>
>> I send to You (in attachment) files from both kernels.Thank You for help
>>
>> Best
>>
>
> Thanks. Does your workload involve just one snapshot of a given volume, or
> do you have more snapshots of the same volume? Do you write only to origin
> or to both origin and the snapshot or only to the snapshot?
>
> I will send you some patch with more debug tests soon, just answer these
> questions first, so that I can make it.
>
> Mikulas
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-10-21 18:42 ` aluno3
@ 2008-10-21 21:43 ` Mikulas Patocka
2008-10-22 13:37 ` aluno3
0 siblings, 1 reply; 17+ messages in thread
From: Mikulas Patocka @ 2008-10-21 21:43 UTC (permalink / raw)
To: aluno3; +Cc: device-mapper development
Hi
Run the same workload and try this patch (also keep the previous patch
applied).
Note, that the patch is not correct (it leaks a little bit of memory when
you deactivate a snapshot, also it will cause an error if you rmmod
dm-snapshot), I just want to see if you get any more crashes with this.
Mikulas
> My workload in case of 32b involved only one volume and 3 snapshots to
> this volume. I wrote only to origin volume (with use fstress) and I
> created and deleted these snapshots from time to time. Snapshots was
> mount with rw option.
>
> In case 64b I had 2 volume and 10 snapshots for each volume.I created
> and deleted these snapshots from time to time as well. I wrote only to
> origin volumes (with use dd)
>
> I don`t remember exactly but in case test with 1 snapshot per volume all
> was working correctly through a few hours and I stopped this test but in
> case my above tests system was working correctly through a few minutes.
>
> Mikulas Patocka wrote:
> >> Hi Mikulas
> >>
> >> I send to You (in attachment) files from both kernels.Thank You for help
> >>
> >> Best
> >>
> >
> > Thanks. Does your workload involve just one snapshot of a given volume, or
> > do you have more snapshots of the same volume? Do you write only to origin
> > or to both origin and the snapshot or only to the snapshot?
> >
> > I will send you some patch with more debug tests soon, just answer these
> > questions first, so that I can make it.
> >
> > Mikulas
---
drivers/md/dm-snap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
Index: linux-2.6.27-clean/drivers/md/dm-snap.c
===================================================================
--- linux-2.6.27-clean.orig/drivers/md/dm-snap.c 2008-10-21 23:38:49.000000000 +0200
+++ linux-2.6.27-clean/drivers/md/dm-snap.c 2008-10-21 23:39:09.000000000 +0200
@@ -736,12 +736,12 @@ static void snapshot_dtr(struct dm_targe
__free_exceptions(s);
- mempool_destroy(s->pending_pool);
+ /*mempool_destroy(s->pending_pool);*/
dm_put_device(ti, s->origin);
dm_put_device(ti, s->cow);
- kfree(s);
+ /*kfree(s);*/
}
/*
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-10-21 21:43 ` Mikulas Patocka
@ 2008-10-22 13:37 ` aluno3
2008-10-22 15:45 ` Mikulas Patocka
0 siblings, 1 reply; 17+ messages in thread
From: aluno3 @ 2008-10-22 13:37 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: device-mapper development
Hi
I used your patch and I ran test the same workload. After a few hours
test, everything is OK. Is it possible? Test is still running.When I get
something wrong from kernel I write to You again.
Mikulas Patocka wrote:
> Hi
>
> Run the same workload and try this patch (also keep the previous patch
> applied).
>
> Note, that the patch is not correct (it leaks a little bit of memory when
> you deactivate a snapshot, also it will cause an error if you rmmod
> dm-snapshot), I just want to see if you get any more crashes with this.
>
> Mikulas
>
>
>> My workload in case of 32b involved only one volume and 3 snapshots to
>> this volume. I wrote only to origin volume (with use fstress) and I
>> created and deleted these snapshots from time to time. Snapshots was
>> mount with rw option.
>>
>> In case 64b I had 2 volume and 10 snapshots for each volume.I created
>> and deleted these snapshots from time to time as well. I wrote only to
>> origin volumes (with use dd)
>>
>> I don`t remember exactly but in case test with 1 snapshot per volume all
>> was working correctly through a few hours and I stopped this test but in
>> case my above tests system was working correctly through a few minutes.
>>
>> Mikulas Patocka wrote:
>>
>>>> Hi Mikulas
>>>>
>>>> I send to You (in attachment) files from both kernels.Thank You for help
>>>>
>>>> Best
>>>>
>>>>
>>> Thanks. Does your workload involve just one snapshot of a given volume, or
>>> do you have more snapshots of the same volume? Do you write only to origin
>>> or to both origin and the snapshot or only to the snapshot?
>>>
>>> I will send you some patch with more debug tests soon, just answer these
>>> questions first, so that I can make it.
>>>
>>> Mikulas
>>>
>
> ---
> drivers/md/dm-snap.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: linux-2.6.27-clean/drivers/md/dm-snap.c
> ===================================================================
> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.c 2008-10-21 23:38:49.000000000 +0200
> +++ linux-2.6.27-clean/drivers/md/dm-snap.c 2008-10-21 23:39:09.000000000 +0200
> @@ -736,12 +736,12 @@ static void snapshot_dtr(struct dm_targe
>
> __free_exceptions(s);
>
> - mempool_destroy(s->pending_pool);
> + /*mempool_destroy(s->pending_pool);*/
>
> dm_put_device(ti, s->origin);
> dm_put_device(ti, s->cow);
>
> - kfree(s);
> + /*kfree(s);*/
> }
>
> /*
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-10-22 13:37 ` aluno3
@ 2008-10-22 15:45 ` Mikulas Patocka
2008-10-22 16:39 ` Mikulas Patocka
0 siblings, 1 reply; 17+ messages in thread
From: Mikulas Patocka @ 2008-10-22 15:45 UTC (permalink / raw)
To: aluno3@poczta.onet.pl; +Cc: device-mapper development
On Wed, 22 Oct 2008, aluno3@poczta.onet.pl wrote:
> Hi
>
> I used your patch and I ran test the same workload. After a few hours
> test, everything is OK. Is it possible? Test is still running.When I get
> something wrong from kernel I write to You again.
Hi
That's good that it works. So try this. Keep the first patch (it is this
one ---
http://people.redhat.com/mpatocka/patches/kernel/2.6.27/dm-snapshot-fix-primary-pe-race.patch
--- I think Milan already sent it to you and you have it applied). Undo
the second patch (that one that hides deallocation with /* */ ). And apply
this. Run the same test.
Mikulas
---
drivers/md/dm-snap.c | 10 +++++++++-
drivers/md/dm-snap.h | 2 ++
2 files changed, 11 insertions(+), 1 deletion(-)
Index: linux-2.6.27-clean/drivers/md/dm-snap.c
===================================================================
--- linux-2.6.27-clean.orig/drivers/md/dm-snap.c 2008-10-22 15:41:24.000000000 +0200
+++ linux-2.6.27-clean/drivers/md/dm-snap.c 2008-10-22 15:51:33.000000000 +0200
@@ -368,6 +368,7 @@ static struct dm_snap_pending_exception
struct dm_snap_pending_exception *pe = mempool_alloc(s->pending_pool,
GFP_NOIO);
+ atomic_inc(&s->n_pending_exceptions);
pe->snap = s;
return pe;
@@ -375,7 +376,10 @@ static struct dm_snap_pending_exception
static void free_pending_exception(struct dm_snap_pending_exception *pe)
{
- mempool_free(pe, pe->snap->pending_pool);
+ struct struct dm_snapshot *s = pe->snap;
+ mempool_free(pe, s->pending_pool);
+ smp_mb__before_atomic_dec();
+ atomic_dec(&s->n_pending_exceptions);
}
static void insert_completed_exception(struct dm_snapshot *s,
@@ -601,6 +605,7 @@ static int snapshot_ctr(struct dm_target
s->valid = 1;
s->active = 0;
s->last_percent = 0;
+ atomic_set(&s->n_pending_exceptions, 0);
init_rwsem(&s->lock);
spin_lock_init(&s->pe_lock);
s->ti = ti;
@@ -727,6 +732,9 @@ static void snapshot_dtr(struct dm_targe
/* After this returns there can be no new kcopyd jobs. */
unregister_snapshot(s);
+ while (atomic_read(&s->n_pending_exceptions))
+ yield();
+
#ifdef CONFIG_DM_DEBUG
for (i = 0; i < DM_TRACKED_CHUNK_HASH_SIZE; i++)
BUG_ON(!hlist_empty(&s->tracked_chunk_hash[i]));
Index: linux-2.6.27-clean/drivers/md/dm-snap.h
===================================================================
--- linux-2.6.27-clean.orig/drivers/md/dm-snap.h 2008-10-22 15:45:08.000000000 +0200
+++ linux-2.6.27-clean/drivers/md/dm-snap.h 2008-10-22 15:46:49.000000000 +0200
@@ -163,6 +163,8 @@ struct dm_snapshot {
mempool_t *pending_pool;
+ atomic_t n_pending_exceptions;
+
struct exception_table pending;
struct exception_table complete;
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-10-22 15:45 ` Mikulas Patocka
@ 2008-10-22 16:39 ` Mikulas Patocka
2008-10-23 11:30 ` aluno3
0 siblings, 1 reply; 17+ messages in thread
From: Mikulas Patocka @ 2008-10-22 16:39 UTC (permalink / raw)
To: device-mapper development
Oh, sorry for this "struct struct" in the patch in free_pending_exception,
replace it just with one "struct". I forgot to refresh the patch before
sending it.
Mikulas
On Wed, 22 Oct 2008, Mikulas Patocka wrote:
> On Wed, 22 Oct 2008, aluno3@poczta.onet.pl wrote:
>
> > Hi
> >
> > I used your patch and I ran test the same workload. After a few hours
> > test, everything is OK. Is it possible? Test is still running.When I get
> > something wrong from kernel I write to You again.
>
> Hi
>
> That's good that it works. So try this. Keep the first patch (it is this
> one ---
> http://people.redhat.com/mpatocka/patches/kernel/2.6.27/dm-snapshot-fix-primary-pe-race.patch
> --- I think Milan already sent it to you and you have it applied). Undo
> the second patch (that one that hides deallocation with /* */ ). And apply
> this. Run the same test.
>
> Mikulas
>
> ---
> drivers/md/dm-snap.c | 10 +++++++++-
> drivers/md/dm-snap.h | 2 ++
> 2 files changed, 11 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.27-clean/drivers/md/dm-snap.c
> ===================================================================
> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.c 2008-10-22 15:41:24.000000000 +0200
> +++ linux-2.6.27-clean/drivers/md/dm-snap.c 2008-10-22 15:51:33.000000000 +0200
> @@ -368,6 +368,7 @@ static struct dm_snap_pending_exception
> struct dm_snap_pending_exception *pe = mempool_alloc(s->pending_pool,
> GFP_NOIO);
>
> + atomic_inc(&s->n_pending_exceptions);
> pe->snap = s;
>
> return pe;
> @@ -375,7 +376,10 @@ static struct dm_snap_pending_exception
>
> static void free_pending_exception(struct dm_snap_pending_exception *pe)
> {
> - mempool_free(pe, pe->snap->pending_pool);
> + struct struct dm_snapshot *s = pe->snap;
> + mempool_free(pe, s->pending_pool);
> + smp_mb__before_atomic_dec();
> + atomic_dec(&s->n_pending_exceptions);
> }
>
> static void insert_completed_exception(struct dm_snapshot *s,
> @@ -601,6 +605,7 @@ static int snapshot_ctr(struct dm_target
> s->valid = 1;
> s->active = 0;
> s->last_percent = 0;
> + atomic_set(&s->n_pending_exceptions, 0);
> init_rwsem(&s->lock);
> spin_lock_init(&s->pe_lock);
> s->ti = ti;
> @@ -727,6 +732,9 @@ static void snapshot_dtr(struct dm_targe
> /* After this returns there can be no new kcopyd jobs. */
> unregister_snapshot(s);
>
> + while (atomic_read(&s->n_pending_exceptions))
> + yield();
> +
> #ifdef CONFIG_DM_DEBUG
> for (i = 0; i < DM_TRACKED_CHUNK_HASH_SIZE; i++)
> BUG_ON(!hlist_empty(&s->tracked_chunk_hash[i]));
> Index: linux-2.6.27-clean/drivers/md/dm-snap.h
> ===================================================================
> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.h 2008-10-22 15:45:08.000000000 +0200
> +++ linux-2.6.27-clean/drivers/md/dm-snap.h 2008-10-22 15:46:49.000000000 +0200
> @@ -163,6 +163,8 @@ struct dm_snapshot {
>
> mempool_t *pending_pool;
>
> + atomic_t n_pending_exceptions;
> +
> struct exception_table pending;
> struct exception_table complete;
>
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-10-22 16:39 ` Mikulas Patocka
@ 2008-10-23 11:30 ` aluno3
2008-10-23 13:40 ` Mikulas Patocka
0 siblings, 1 reply; 17+ messages in thread
From: aluno3 @ 2008-10-23 11:30 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: device-mapper development
I used dm-snapshot-fix-primary-pe-race.patch and last patch related with
pending_exception.After the same test and workload everything work
correctly so far.Is it final patch?
best and thanks
Mikulas Patocka wrote:
> Oh, sorry for this "struct struct" in the patch in free_pending_exception,
> replace it just with one "struct". I forgot to refresh the patch before
> sending it.
>
> Mikulas
>
> On Wed, 22 Oct 2008, Mikulas Patocka wrote:
>
>
>> On Wed, 22 Oct 2008, aluno3@poczta.onet.pl wrote:
>>
>>
>>> Hi
>>>
>>> I used your patch and I ran test the same workload. After a few hours
>>> test, everything is OK. Is it possible? Test is still running.When I get
>>> something wrong from kernel I write to You again.
>>>
>> Hi
>>
>> That's good that it works. So try this. Keep the first patch (it is this
>> one ---
>> http://people.redhat.com/mpatocka/patches/kernel/2.6.27/dm-snapshot-fix-primary-pe-race.patch
>> --- I think Milan already sent it to you and you have it applied). Undo
>> the second patch (that one that hides deallocation with /* */ ). And apply
>> this. Run the same test.
>>
>> Mikulas
>>
>> ---
>> drivers/md/dm-snap.c | 10 +++++++++-
>> drivers/md/dm-snap.h | 2 ++
>> 2 files changed, 11 insertions(+), 1 deletion(-)
>>
>> Index: linux-2.6.27-clean/drivers/md/dm-snap.c
>> ===================================================================
>> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.c 2008-10-22 15:41:24.000000000 +0200
>> +++ linux-2.6.27-clean/drivers/md/dm-snap.c 2008-10-22 15:51:33.000000000 +0200
>> @@ -368,6 +368,7 @@ static struct dm_snap_pending_exception
>> struct dm_snap_pending_exception *pe = mempool_alloc(s->pending_pool,
>> GFP_NOIO);
>>
>> + atomic_inc(&s->n_pending_exceptions);
>> pe->snap = s;
>>
>> return pe;
>> @@ -375,7 +376,10 @@ static struct dm_snap_pending_exception
>>
>> static void free_pending_exception(struct dm_snap_pending_exception *pe)
>> {
>> - mempool_free(pe, pe->snap->pending_pool);
>> + struct struct dm_snapshot *s = pe->snap;
>> + mempool_free(pe, s->pending_pool);
>> + smp_mb__before_atomic_dec();
>> + atomic_dec(&s->n_pending_exceptions);
>> }
>>
>> static void insert_completed_exception(struct dm_snapshot *s,
>> @@ -601,6 +605,7 @@ static int snapshot_ctr(struct dm_target
>> s->valid = 1;
>> s->active = 0;
>> s->last_percent = 0;
>> + atomic_set(&s->n_pending_exceptions, 0);
>> init_rwsem(&s->lock);
>> spin_lock_init(&s->pe_lock);
>> s->ti = ti;
>> @@ -727,6 +732,9 @@ static void snapshot_dtr(struct dm_targe
>> /* After this returns there can be no new kcopyd jobs. */
>> unregister_snapshot(s);
>>
>> + while (atomic_read(&s->n_pending_exceptions))
>> + yield();
>> +
>> #ifdef CONFIG_DM_DEBUG
>> for (i = 0; i < DM_TRACKED_CHUNK_HASH_SIZE; i++)
>> BUG_ON(!hlist_empty(&s->tracked_chunk_hash[i]));
>> Index: linux-2.6.27-clean/drivers/md/dm-snap.h
>> ===================================================================
>> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.h 2008-10-22 15:45:08.000000000 +0200
>> +++ linux-2.6.27-clean/drivers/md/dm-snap.h 2008-10-22 15:46:49.000000000 +0200
>> @@ -163,6 +163,8 @@ struct dm_snapshot {
>>
>> mempool_t *pending_pool;
>>
>> + atomic_t n_pending_exceptions;
>> +
>> struct exception_table pending;
>> struct exception_table complete;
>>
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
>>
>>
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-10-23 11:30 ` aluno3
@ 2008-10-23 13:40 ` Mikulas Patocka
2008-11-19 8:31 ` aluno3
0 siblings, 1 reply; 17+ messages in thread
From: Mikulas Patocka @ 2008-10-23 13:40 UTC (permalink / raw)
To: aluno3@poczta.onet.pl; +Cc: device-mapper development
On Thu, 23 Oct 2008, aluno3@poczta.onet.pl wrote:
> I used dm-snapshot-fix-primary-pe-race.patch and last patch related with
> pending_exception.After the same test and workload everything work
> correctly so far.Is it final patch?
Yes, these two patches are expected to be the final fix. Thanks for the
testing. If you get some more crashes even with these two, write about
them.
Mikulas
> best and thanks
>
>
> Mikulas Patocka wrote:
> > Oh, sorry for this "struct struct" in the patch in free_pending_exception,
> > replace it just with one "struct". I forgot to refresh the patch before
> > sending it.
> >
> > Mikulas
> >
> > On Wed, 22 Oct 2008, Mikulas Patocka wrote:
> >
> >
> >> On Wed, 22 Oct 2008, aluno3@poczta.onet.pl wrote:
> >>
> >>
> >>> Hi
> >>>
> >>> I used your patch and I ran test the same workload. After a few hours
> >>> test, everything is OK. Is it possible? Test is still running.When I get
> >>> something wrong from kernel I write to You again.
> >>>
> >> Hi
> >>
> >> That's good that it works. So try this. Keep the first patch (it is this
> >> one ---
> >> http://people.redhat.com/mpatocka/patches/kernel/2.6.27/dm-snapshot-fix-primary-pe-race.patch
> >> --- I think Milan already sent it to you and you have it applied). Undo
> >> the second patch (that one that hides deallocation with /* */ ). And apply
> >> this. Run the same test.
> >>
> >> Mikulas
> >>
> >> ---
> >> drivers/md/dm-snap.c | 10 +++++++++-
> >> drivers/md/dm-snap.h | 2 ++
> >> 2 files changed, 11 insertions(+), 1 deletion(-)
> >>
> >> Index: linux-2.6.27-clean/drivers/md/dm-snap.c
> >> ===================================================================
> >> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.c 2008-10-22 15:41:24.000000000 +0200
> >> +++ linux-2.6.27-clean/drivers/md/dm-snap.c 2008-10-22 15:51:33.000000000 +0200
> >> @@ -368,6 +368,7 @@ static struct dm_snap_pending_exception
> >> struct dm_snap_pending_exception *pe = mempool_alloc(s->pending_pool,
> >> GFP_NOIO);
> >>
> >> + atomic_inc(&s->n_pending_exceptions);
> >> pe->snap = s;
> >>
> >> return pe;
> >> @@ -375,7 +376,10 @@ static struct dm_snap_pending_exception
> >>
> >> static void free_pending_exception(struct dm_snap_pending_exception *pe)
> >> {
> >> - mempool_free(pe, pe->snap->pending_pool);
> >> + struct struct dm_snapshot *s = pe->snap;
> >> + mempool_free(pe, s->pending_pool);
> >> + smp_mb__before_atomic_dec();
> >> + atomic_dec(&s->n_pending_exceptions);
> >> }
> >>
> >> static void insert_completed_exception(struct dm_snapshot *s,
> >> @@ -601,6 +605,7 @@ static int snapshot_ctr(struct dm_target
> >> s->valid = 1;
> >> s->active = 0;
> >> s->last_percent = 0;
> >> + atomic_set(&s->n_pending_exceptions, 0);
> >> init_rwsem(&s->lock);
> >> spin_lock_init(&s->pe_lock);
> >> s->ti = ti;
> >> @@ -727,6 +732,9 @@ static void snapshot_dtr(struct dm_targe
> >> /* After this returns there can be no new kcopyd jobs. */
> >> unregister_snapshot(s);
> >>
> >> + while (atomic_read(&s->n_pending_exceptions))
> >> + yield();
> >> +
> >> #ifdef CONFIG_DM_DEBUG
> >> for (i = 0; i < DM_TRACKED_CHUNK_HASH_SIZE; i++)
> >> BUG_ON(!hlist_empty(&s->tracked_chunk_hash[i]));
> >> Index: linux-2.6.27-clean/drivers/md/dm-snap.h
> >> ===================================================================
> >> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.h 2008-10-22 15:45:08.000000000 +0200
> >> +++ linux-2.6.27-clean/drivers/md/dm-snap.h 2008-10-22 15:46:49.000000000 +0200
> >> @@ -163,6 +163,8 @@ struct dm_snapshot {
> >>
> >> mempool_t *pending_pool;
> >>
> >> + atomic_t n_pending_exceptions;
> >> +
> >> struct exception_table pending;
> >> struct exception_table complete;
> >>
> >>
> >> --
> >> dm-devel mailing list
> >> dm-devel@redhat.com
> >> https://www.redhat.com/mailman/listinfo/dm-devel
> >>
> >>
> >
> >
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-10-23 13:40 ` Mikulas Patocka
@ 2008-11-19 8:31 ` aluno3
2008-11-24 10:52 ` Mikulas Patocka
0 siblings, 1 reply; 17+ messages in thread
From: aluno3 @ 2008-11-19 8:31 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: device-mapper development
Hi
I tested kernel 2.6.27.6 with patch from 2.6.28rc (wait for chunks in destructor,fix register_snapshot deadlock,) and I identified next problem with kernel and dm but repeatability this problem is very small.I got call trace:
Pid: 26230, comm: kcopyd Not tainted (2.6.27.6 #36)
EIP: 0060:[<c044d485>] EFLAGS: 00010282 CPU: 1
EIP is at remove_exception+0x5/0x20
EAX: ca3b5908 EBX: ca3b5908 ECX: 00200200 EDX: 00100100
ESI: f7b489f8 EDI: e92ad980 EBP: 00000000 ESP: f29c7ec0
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process kcopyd (pid: 26230, ti=f29c6000 task=e8512430 task.ti=f29c6000)
Stack: c044e03f 0000000d 00000000 c85948c0 00000000 c044f2e7 0009bc30
00000000
0000e705 00000000 e8e41288 e92ad980 00000000 c044e0f0 e8e41288
c7800ec8
00000000 c0449224 00000000 c7800fb4 00000400 00000000 00000000
f2bdfbb0
Call Trace:
[<c044e03f>] pending_complete+0x9f/0x110
[<c044f2e7>] persistent_commit+0xc7/0x110
[<c044e0f0>] copy_callback+0x30/0x40
[<c0449224>] segment_complete+0x154/0x1d0
[<c0448e55>] run_complete_job+0x45/0x80
[<c04490d0>] segment_complete+0x0/0x1d0
[<c0448e10>] run_complete_job+0x0/0x80
[<c0449014>] process_jobs+0x14/0x70
[<c0449070>] do_work+0x0/0x40
[<c0449086>] do_work+0x16/0x40
[<c013502d>] run_workqueue+0x4d/0xf0
[<c013514d>] worker_thread+0x7d/0xc0
[<c01382e0>] autoremove_wake_function+0x0/0x30
[<c0526583>] __sched_text_start+0x1e3/0x4a0
[<c01382e0>] autoremove_wake_function+0x0/0x30
[<c0121a2b>] complete+0x2b/0x40
[<c01350d0>] worker_thread+0x0/0xc0
[<c0137db4>] kthread+0x44/0x70
[<c0137d70>] kthread+0x0/0x70
[<c0104c57>] kernel_thread_helper+0x7/0x10
=======================
Code: 4b 0c e8 cf ff ff ff 8b 56 08 8d 04 c2 8b 10 89 13 89 18 89 5a 04
89 43 04 5b 5e c3 8d 76 00 8d bc 27 00 00 00 00 8b 48 04 8b 10 <89> 11
89 4a 04 c7 00 00 01 10 00 c7 40 04 00 02 20 00 c3 90 8d
EIP: [<c044d485>] remove_exception+0x5/0x20 SS:ESP 0068:f29c7ec0
---[ end trace 834a1d3742a1be05 ]---
addr2line returned include/linux/list.h:93 for EIP c044d485:
static inline void __list_del(struct list_head * prev, struct list_head
* next)
{
next->prev = prev;
prev->next = next; //line 93
}
A few weeks ago I got similar call trace with plain kernel 2.6.27 and
patches from mail thread:
BUG: unable to handle kernel paging request at 00200200
IP: [<c044bf65>] remove_exception+0x5/0x20
*pdpt = 0000000029acc001 *pde = 0000000000000000
Oops: 0002 [#1] SMP
Modules linked in: iscsi_trgt mptctl mptbase st sg drbd bonding
iscsi_tcp libiscsi scsi_transport_iscsi aacraid sata_nv forcedeth button
ftdi_sio usbserial
Pid: 31375, comm: kcopyd Not tainted (2.6.27 #21)
EIP: 0060:[<c044bf65>] EFLAGS: 00010282 CPU: 1
EIP is at remove_exception+0x5/0x20
EAX: f276da88 EBX: f276da88 ECX: 00200200 EDX: 00100100
ESI: c79a4a58 EDI: c9268cc0 EBP: 00000000 ESP: ecbcbec0
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process kcopyd (pid: 31375, ti=ecbca000 task=e6d9d220 task.ti=ecbca000)
Stack: c044cb1f 0000000e 00000000 c916b480 00000000 c044dde3 00018f47
00000000
00002870 00000000 c70cba48 c9268cc0 00000000 c044cbd0 c70cba48
c720aec8
00000000 c0447d04 00000000 c720afb4 00000400 00000000 00000000
efc65580
Call Trace:
[<c044cb1f>] pending_complete+0x9f/0x110
[<c044dde3>] persistent_commit+0xe3/0x110
[<c044cbd0>] copy_callback+0x30/0x40
[<c0447d04>] segment_complete+0x154/0x1d0
[<c0447935>] run_complete_job+0x45/0x80
[<c0447bb0>] segment_complete+0x0/0x1d0
[<c04478f0>] run_complete_job+0x0/0x80
[<c0447af4>] process_jobs+0x14/0x70
[<c0447b50>] do_work+0x0/0x40
[<c0447b66>] do_work+0x16/0x40
[<c013509d>] run_workqueue+0x4d/0xf0
[<c01351bd>] worker_thread+0x7d/0xc0
[<c0138350>] autoremove_wake_function+0x0/0x30
[<c0524f2c>] __sched_text_start+0x1ec/0x4b0
[<c0138350>] autoremove_wake_function+0x0/0x30
[<c0121a9b>] complete+0x2b/0x40
[<c0135140>] worker_thread+0x0/0xc0
[<c0137e24>] kthread+0x44/0x70
[<c0137de0>] kthread+0x0/0x70
[<c0104c57>] kernel_thread_helper+0x7/0x10
=======================
Code: 4b 0c e8 cf ff ff ff 8b 56 08 8d 04 c2 8b 10 89 13 89 18 89 5a 04
89 43 04 5b 5e c3 8d 76 00 8d bc 27 00 00 00 00 8b 48 04 8b 10 <89> 11
89 4a 04 c7 00 00 01 10 00 c7 40 04 00 02 20 00 c3 90 8d
EIP: [<c044bf65>] remove_exception+0x5/0x20 SS:ESP 0068:ecbcbec0
---[ end trace 25afcedfe7eb0a2b ]---
Is this known problem or something new? Thanks
Mikulas Patocka wrote:
> On Thu, 23 Oct 2008, aluno3@poczta.onet.pl wrote:
>
>
>> I used dm-snapshot-fix-primary-pe-race.patch and last patch related with
>> pending_exception.After the same test and workload everything work
>> correctly so far.Is it final patch?
>>
>
> Yes, these two patches are expected to be the final fix. Thanks for the
> testing. If you get some more crashes even with these two, write about
> them.
>
> Mikulas
>
>
>> best and thanks
>>
>>
>> Mikulas Patocka wrote:
>>
>>> Oh, sorry for this "struct struct" in the patch in free_pending_exception,
>>> replace it just with one "struct". I forgot to refresh the patch before
>>> sending it.
>>>
>>> Mikulas
>>>
>>> On Wed, 22 Oct 2008, Mikulas Patocka wrote:
>>>
>>>
>>>
>>>> On Wed, 22 Oct 2008, aluno3@poczta.onet.pl wrote:
>>>>
>>>>
>>>>
>>>>> Hi
>>>>>
>>>>> I used your patch and I ran test the same workload. After a few hours
>>>>> test, everything is OK. Is it possible? Test is still running.When I get
>>>>> something wrong from kernel I write to You again.
>>>>>
>>>>>
>>>> Hi
>>>>
>>>> That's good that it works. So try this. Keep the first patch (it is this
>>>> one ---
>>>> http://people.redhat.com/mpatocka/patches/kernel/2.6.27/dm-snapshot-fix-primary-pe-race.patch
>>>> --- I think Milan already sent it to you and you have it applied). Undo
>>>> the second patch (that one that hides deallocation with /* */ ). And apply
>>>> this. Run the same test.
>>>>
>>>> Mikulas
>>>>
>>>> ---
>>>> drivers/md/dm-snap.c | 10 +++++++++-
>>>> drivers/md/dm-snap.h | 2 ++
>>>> 2 files changed, 11 insertions(+), 1 deletion(-)
>>>>
>>>> Index: linux-2.6.27-clean/drivers/md/dm-snap.c
>>>> ===================================================================
>>>> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.c 2008-10-22 15:41:24.000000000 +0200
>>>> +++ linux-2.6.27-clean/drivers/md/dm-snap.c 2008-10-22 15:51:33.000000000 +0200
>>>> @@ -368,6 +368,7 @@ static struct dm_snap_pending_exception
>>>> struct dm_snap_pending_exception *pe = mempool_alloc(s->pending_pool,
>>>> GFP_NOIO);
>>>>
>>>> + atomic_inc(&s->n_pending_exceptions);
>>>> pe->snap = s;
>>>>
>>>> return pe;
>>>> @@ -375,7 +376,10 @@ static struct dm_snap_pending_exception
>>>>
>>>> static void free_pending_exception(struct dm_snap_pending_exception *pe)
>>>> {
>>>> - mempool_free(pe, pe->snap->pending_pool);
>>>> + struct struct dm_snapshot *s = pe->snap;
>>>> + mempool_free(pe, s->pending_pool);
>>>> + smp_mb__before_atomic_dec();
>>>> + atomic_dec(&s->n_pending_exceptions);
>>>> }
>>>>
>>>> static void insert_completed_exception(struct dm_snapshot *s,
>>>> @@ -601,6 +605,7 @@ static int snapshot_ctr(struct dm_target
>>>> s->valid = 1;
>>>> s->active = 0;
>>>> s->last_percent = 0;
>>>> + atomic_set(&s->n_pending_exceptions, 0);
>>>> init_rwsem(&s->lock);
>>>> spin_lock_init(&s->pe_lock);
>>>> s->ti = ti;
>>>> @@ -727,6 +732,9 @@ static void snapshot_dtr(struct dm_targe
>>>> /* After this returns there can be no new kcopyd jobs. */
>>>> unregister_snapshot(s);
>>>>
>>>> + while (atomic_read(&s->n_pending_exceptions))
>>>> + yield();
>>>> +
>>>> #ifdef CONFIG_DM_DEBUG
>>>> for (i = 0; i < DM_TRACKED_CHUNK_HASH_SIZE; i++)
>>>> BUG_ON(!hlist_empty(&s->tracked_chunk_hash[i]));
>>>> Index: linux-2.6.27-clean/drivers/md/dm-snap.h
>>>> ===================================================================
>>>> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.h 2008-10-22 15:45:08.000000000 +0200
>>>> +++ linux-2.6.27-clean/drivers/md/dm-snap.h 2008-10-22 15:46:49.000000000 +0200
>>>> @@ -163,6 +163,8 @@ struct dm_snapshot {
>>>>
>>>> mempool_t *pending_pool;
>>>>
>>>> + atomic_t n_pending_exceptions;
>>>> +
>>>> struct exception_table pending;
>>>> struct exception_table complete;
>>>>
>>>>
>>>> --
>>>> dm-devel mailing list
>>>> dm-devel@redhat.com
>>>> https://www.redhat.com/mailman/listinfo/dm-devel
>>>>
>>>>
>>>>
>>>
>>>
>
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-11-19 8:31 ` aluno3
@ 2008-11-24 10:52 ` Mikulas Patocka
2008-11-26 7:38 ` aluno3
0 siblings, 1 reply; 17+ messages in thread
From: Mikulas Patocka @ 2008-11-24 10:52 UTC (permalink / raw)
To: aluno3@poczta.onet.pl; +Cc: device-mapper development
Hi
This was supposed to be fixed with "dm snapshot: fix primary_pe race"
patch in 2.6.27.4. Are you sure that you really see it on 2.6.27.6? If so,
it looks like the bug wasn't fixed yet.
How often does it happen? Do you have some reproducible scenario for this
bug?
Mikulas
> Hi
>
> I tested kernel 2.6.27.6 with patch from 2.6.28rc (wait for chunks in destructor,fix register_snapshot deadlock,) and I identified next problem with kernel and dm but repeatability this problem is very small.I got call trace:
>
>
> Pid: 26230, comm: kcopyd Not tainted (2.6.27.6 #36)
> EIP: 0060:[<c044d485>] EFLAGS: 00010282 CPU: 1
> EIP is at remove_exception+0x5/0x20
> EAX: ca3b5908 EBX: ca3b5908 ECX: 00200200 EDX: 00100100
> ESI: f7b489f8 EDI: e92ad980 EBP: 00000000 ESP: f29c7ec0
> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> Process kcopyd (pid: 26230, ti=f29c6000 task=e8512430 task.ti=f29c6000)
> Stack: c044e03f 0000000d 00000000 c85948c0 00000000 c044f2e7 0009bc30
> 00000000
> 0000e705 00000000 e8e41288 e92ad980 00000000 c044e0f0 e8e41288
> c7800ec8
> 00000000 c0449224 00000000 c7800fb4 00000400 00000000 00000000
> f2bdfbb0
> Call Trace:
> [<c044e03f>] pending_complete+0x9f/0x110
> [<c044f2e7>] persistent_commit+0xc7/0x110
> [<c044e0f0>] copy_callback+0x30/0x40
> [<c0449224>] segment_complete+0x154/0x1d0
> [<c0448e55>] run_complete_job+0x45/0x80
> [<c04490d0>] segment_complete+0x0/0x1d0
> [<c0448e10>] run_complete_job+0x0/0x80
> [<c0449014>] process_jobs+0x14/0x70
> [<c0449070>] do_work+0x0/0x40
> [<c0449086>] do_work+0x16/0x40
> [<c013502d>] run_workqueue+0x4d/0xf0
> [<c013514d>] worker_thread+0x7d/0xc0
> [<c01382e0>] autoremove_wake_function+0x0/0x30
> [<c0526583>] __sched_text_start+0x1e3/0x4a0
> [<c01382e0>] autoremove_wake_function+0x0/0x30
> [<c0121a2b>] complete+0x2b/0x40
> [<c01350d0>] worker_thread+0x0/0xc0
> [<c0137db4>] kthread+0x44/0x70
> [<c0137d70>] kthread+0x0/0x70
> [<c0104c57>] kernel_thread_helper+0x7/0x10
> =======================
> Code: 4b 0c e8 cf ff ff ff 8b 56 08 8d 04 c2 8b 10 89 13 89 18 89 5a 04
> 89 43 04 5b 5e c3 8d 76 00 8d bc 27 00 00 00 00 8b 48 04 8b 10 <89> 11
> 89 4a 04 c7 00 00 01 10 00 c7 40 04 00 02 20 00 c3 90 8d
> EIP: [<c044d485>] remove_exception+0x5/0x20 SS:ESP 0068:f29c7ec0
> ---[ end trace 834a1d3742a1be05 ]---
>
>
>
> addr2line returned include/linux/list.h:93 for EIP c044d485:
>
>
> static inline void __list_del(struct list_head * prev, struct list_head
> * next)
> {
> next->prev = prev;
> prev->next = next; //line 93
> }
>
>
>
>
> A few weeks ago I got similar call trace with plain kernel 2.6.27 and
> patches from mail thread:
>
>
> BUG: unable to handle kernel paging request at 00200200
> IP: [<c044bf65>] remove_exception+0x5/0x20
> *pdpt = 0000000029acc001 *pde = 0000000000000000
> Oops: 0002 [#1] SMP
> Modules linked in: iscsi_trgt mptctl mptbase st sg drbd bonding
> iscsi_tcp libiscsi scsi_transport_iscsi aacraid sata_nv forcedeth button
> ftdi_sio usbserial
>
> Pid: 31375, comm: kcopyd Not tainted (2.6.27 #21)
> EIP: 0060:[<c044bf65>] EFLAGS: 00010282 CPU: 1
> EIP is at remove_exception+0x5/0x20
> EAX: f276da88 EBX: f276da88 ECX: 00200200 EDX: 00100100
> ESI: c79a4a58 EDI: c9268cc0 EBP: 00000000 ESP: ecbcbec0
> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> Process kcopyd (pid: 31375, ti=ecbca000 task=e6d9d220 task.ti=ecbca000)
> Stack: c044cb1f 0000000e 00000000 c916b480 00000000 c044dde3 00018f47
> 00000000
> 00002870 00000000 c70cba48 c9268cc0 00000000 c044cbd0 c70cba48
> c720aec8
> 00000000 c0447d04 00000000 c720afb4 00000400 00000000 00000000
> efc65580
> Call Trace:
> [<c044cb1f>] pending_complete+0x9f/0x110
> [<c044dde3>] persistent_commit+0xe3/0x110
> [<c044cbd0>] copy_callback+0x30/0x40
> [<c0447d04>] segment_complete+0x154/0x1d0
> [<c0447935>] run_complete_job+0x45/0x80
> [<c0447bb0>] segment_complete+0x0/0x1d0
> [<c04478f0>] run_complete_job+0x0/0x80
> [<c0447af4>] process_jobs+0x14/0x70
> [<c0447b50>] do_work+0x0/0x40
> [<c0447b66>] do_work+0x16/0x40
> [<c013509d>] run_workqueue+0x4d/0xf0
> [<c01351bd>] worker_thread+0x7d/0xc0
> [<c0138350>] autoremove_wake_function+0x0/0x30
> [<c0524f2c>] __sched_text_start+0x1ec/0x4b0
> [<c0138350>] autoremove_wake_function+0x0/0x30
> [<c0121a9b>] complete+0x2b/0x40
> [<c0135140>] worker_thread+0x0/0xc0
> [<c0137e24>] kthread+0x44/0x70
> [<c0137de0>] kthread+0x0/0x70
> [<c0104c57>] kernel_thread_helper+0x7/0x10
> =======================
> Code: 4b 0c e8 cf ff ff ff 8b 56 08 8d 04 c2 8b 10 89 13 89 18 89 5a 04
> 89 43 04 5b 5e c3 8d 76 00 8d bc 27 00 00 00 00 8b 48 04 8b 10 <89> 11
> 89 4a 04 c7 00 00 01 10 00 c7 40 04 00 02 20 00 c3 90 8d
> EIP: [<c044bf65>] remove_exception+0x5/0x20 SS:ESP 0068:ecbcbec0
> ---[ end trace 25afcedfe7eb0a2b ]---
>
> Is this known problem or something new? Thanks
>
>
> Mikulas Patocka wrote:
> > On Thu, 23 Oct 2008, aluno3@poczta.onet.pl wrote:
> >
> >
> >> I used dm-snapshot-fix-primary-pe-race.patch and last patch related with
> >> pending_exception.After the same test and workload everything work
> >> correctly so far.Is it final patch?
> >>
> >
> > Yes, these two patches are expected to be the final fix. Thanks for the
> > testing. If you get some more crashes even with these two, write about
> > them.
> >
> > Mikulas
> >
> >
> >> best and thanks
> >>
> >>
> >> Mikulas Patocka wrote:
> >>
> >>> Oh, sorry for this "struct struct" in the patch in free_pending_exception,
> >>> replace it just with one "struct". I forgot to refresh the patch before
> >>> sending it.
> >>>
> >>> Mikulas
> >>>
> >>> On Wed, 22 Oct 2008, Mikulas Patocka wrote:
> >>>
> >>>
> >>>
> >>>> On Wed, 22 Oct 2008, aluno3@poczta.onet.pl wrote:
> >>>>
> >>>>
> >>>>
> >>>>> Hi
> >>>>>
> >>>>> I used your patch and I ran test the same workload. After a few hours
> >>>>> test, everything is OK. Is it possible? Test is still running.When I get
> >>>>> something wrong from kernel I write to You again.
> >>>>>
> >>>>>
> >>>> Hi
> >>>>
> >>>> That's good that it works. So try this. Keep the first patch (it is this
> >>>> one ---
> >>>> http://people.redhat.com/mpatocka/patches/kernel/2.6.27/dm-snapshot-fix-primary-pe-race.patch
> >>>> --- I think Milan already sent it to you and you have it applied). Undo
> >>>> the second patch (that one that hides deallocation with /* */ ). And apply
> >>>> this. Run the same test.
> >>>>
> >>>> Mikulas
> >>>>
> >>>> ---
> >>>> drivers/md/dm-snap.c | 10 +++++++++-
> >>>> drivers/md/dm-snap.h | 2 ++
> >>>> 2 files changed, 11 insertions(+), 1 deletion(-)
> >>>>
> >>>> Index: linux-2.6.27-clean/drivers/md/dm-snap.c
> >>>> ===================================================================
> >>>> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.c 2008-10-22 15:41:24.000000000 +0200
> >>>> +++ linux-2.6.27-clean/drivers/md/dm-snap.c 2008-10-22 15:51:33.000000000 +0200
> >>>> @@ -368,6 +368,7 @@ static struct dm_snap_pending_exception
> >>>> struct dm_snap_pending_exception *pe = mempool_alloc(s->pending_pool,
> >>>> GFP_NOIO);
> >>>>
> >>>> + atomic_inc(&s->n_pending_exceptions);
> >>>> pe->snap = s;
> >>>>
> >>>> return pe;
> >>>> @@ -375,7 +376,10 @@ static struct dm_snap_pending_exception
> >>>>
> >>>> static void free_pending_exception(struct dm_snap_pending_exception *pe)
> >>>> {
> >>>> - mempool_free(pe, pe->snap->pending_pool);
> >>>> + struct struct dm_snapshot *s = pe->snap;
> >>>> + mempool_free(pe, s->pending_pool);
> >>>> + smp_mb__before_atomic_dec();
> >>>> + atomic_dec(&s->n_pending_exceptions);
> >>>> }
> >>>>
> >>>> static void insert_completed_exception(struct dm_snapshot *s,
> >>>> @@ -601,6 +605,7 @@ static int snapshot_ctr(struct dm_target
> >>>> s->valid = 1;
> >>>> s->active = 0;
> >>>> s->last_percent = 0;
> >>>> + atomic_set(&s->n_pending_exceptions, 0);
> >>>> init_rwsem(&s->lock);
> >>>> spin_lock_init(&s->pe_lock);
> >>>> s->ti = ti;
> >>>> @@ -727,6 +732,9 @@ static void snapshot_dtr(struct dm_targe
> >>>> /* After this returns there can be no new kcopyd jobs. */
> >>>> unregister_snapshot(s);
> >>>>
> >>>> + while (atomic_read(&s->n_pending_exceptions))
> >>>> + yield();
> >>>> +
> >>>> #ifdef CONFIG_DM_DEBUG
> >>>> for (i = 0; i < DM_TRACKED_CHUNK_HASH_SIZE; i++)
> >>>> BUG_ON(!hlist_empty(&s->tracked_chunk_hash[i]));
> >>>> Index: linux-2.6.27-clean/drivers/md/dm-snap.h
> >>>> ===================================================================
> >>>> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.h 2008-10-22 15:45:08.000000000 +0200
> >>>> +++ linux-2.6.27-clean/drivers/md/dm-snap.h 2008-10-22 15:46:49.000000000 +0200
> >>>> @@ -163,6 +163,8 @@ struct dm_snapshot {
> >>>>
> >>>> mempool_t *pending_pool;
> >>>>
> >>>> + atomic_t n_pending_exceptions;
> >>>> +
> >>>> struct exception_table pending;
> >>>> struct exception_table complete;
> >>>>
> >>>>
> >>>> --
> >>>> dm-devel mailing list
> >>>> dm-devel@redhat.com
> >>>> https://www.redhat.com/mailman/listinfo/dm-devel
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >
> >
> >
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-11-24 10:52 ` Mikulas Patocka
@ 2008-11-26 7:38 ` aluno3
2008-11-28 7:28 ` aluno3
0 siblings, 1 reply; 17+ messages in thread
From: aluno3 @ 2008-11-26 7:38 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: device-mapper development
Hi
Yes, I am sure that I used 2.6.27.6. I checked it again.
When I tested 2.6.27 without "dm snapshot: fix primary_pe race" I
brought to call trace after a few minutes but with use "dm snapshot: fix
primary_pe race" bring to call trace is very hard.Sometimes even after a
few days test. I don`t have reproducible scenario :(. I tested kernel
with use Bacula, Rsync, LVM, Snapshot together and very heavy load.It
happened only 2 times through four weeks test.How can I help in this
case? Thanks and best
Mikulas Patocka wrote:
> Hi
>
> This was supposed to be fixed with "dm snapshot: fix primary_pe race"
> patch in 2.6.27.4. Are you sure that you really see it on 2.6.27.6? If so,
> it looks like the bug wasn't fixed yet.
>
> How often does it happen? Do you have some reproducible scenario for this
> bug?
>
> Mikulas
>
>
>> Hi
>>
>> I tested kernel 2.6.27.6 with patch from 2.6.28rc (wait for chunks in destructor,fix register_snapshot deadlock,) and I identified next problem with kernel and dm but repeatability this problem is very small.I got call trace:
>>
>>
>> Pid: 26230, comm: kcopyd Not tainted (2.6.27.6 #36)
>> EIP: 0060:[<c044d485>] EFLAGS: 00010282 CPU: 1
>> EIP is at remove_exception+0x5/0x20
>> EAX: ca3b5908 EBX: ca3b5908 ECX: 00200200 EDX: 00100100
>> ESI: f7b489f8 EDI: e92ad980 EBP: 00000000 ESP: f29c7ec0
>> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>> Process kcopyd (pid: 26230, ti=f29c6000 task=e8512430 task.ti=f29c6000)
>> Stack: c044e03f 0000000d 00000000 c85948c0 00000000 c044f2e7 0009bc30
>> 00000000
>> 0000e705 00000000 e8e41288 e92ad980 00000000 c044e0f0 e8e41288
>> c7800ec8
>> 00000000 c0449224 00000000 c7800fb4 00000400 00000000 00000000
>> f2bdfbb0
>> Call Trace:
>> [<c044e03f>] pending_complete+0x9f/0x110
>> [<c044f2e7>] persistent_commit+0xc7/0x110
>> [<c044e0f0>] copy_callback+0x30/0x40
>> [<c0449224>] segment_complete+0x154/0x1d0
>> [<c0448e55>] run_complete_job+0x45/0x80
>> [<c04490d0>] segment_complete+0x0/0x1d0
>> [<c0448e10>] run_complete_job+0x0/0x80
>> [<c0449014>] process_jobs+0x14/0x70
>> [<c0449070>] do_work+0x0/0x40
>> [<c0449086>] do_work+0x16/0x40
>> [<c013502d>] run_workqueue+0x4d/0xf0
>> [<c013514d>] worker_thread+0x7d/0xc0
>> [<c01382e0>] autoremove_wake_function+0x0/0x30
>> [<c0526583>] __sched_text_start+0x1e3/0x4a0
>> [<c01382e0>] autoremove_wake_function+0x0/0x30
>> [<c0121a2b>] complete+0x2b/0x40
>> [<c01350d0>] worker_thread+0x0/0xc0
>> [<c0137db4>] kthread+0x44/0x70
>> [<c0137d70>] kthread+0x0/0x70
>> [<c0104c57>] kernel_thread_helper+0x7/0x10
>> =======================
>> Code: 4b 0c e8 cf ff ff ff 8b 56 08 8d 04 c2 8b 10 89 13 89 18 89 5a 04
>> 89 43 04 5b 5e c3 8d 76 00 8d bc 27 00 00 00 00 8b 48 04 8b 10 <89> 11
>> 89 4a 04 c7 00 00 01 10 00 c7 40 04 00 02 20 00 c3 90 8d
>> EIP: [<c044d485>] remove_exception+0x5/0x20 SS:ESP 0068:f29c7ec0
>> ---[ end trace 834a1d3742a1be05 ]---
>>
>>
>>
>> addr2line returned include/linux/list.h:93 for EIP c044d485:
>>
>>
>> static inline void __list_del(struct list_head * prev, struct list_head
>> * next)
>> {
>> next->prev = prev;
>> prev->next = next; //line 93
>> }
>>
>>
>>
>>
>> A few weeks ago I got similar call trace with plain kernel 2.6.27 and
>> patches from mail thread:
>>
>>
>> BUG: unable to handle kernel paging request at 00200200
>> IP: [<c044bf65>] remove_exception+0x5/0x20
>> *pdpt = 0000000029acc001 *pde = 0000000000000000
>> Oops: 0002 [#1] SMP
>> Modules linked in: iscsi_trgt mptctl mptbase st sg drbd bonding
>> iscsi_tcp libiscsi scsi_transport_iscsi aacraid sata_nv forcedeth button
>> ftdi_sio usbserial
>>
>> Pid: 31375, comm: kcopyd Not tainted (2.6.27 #21)
>> EIP: 0060:[<c044bf65>] EFLAGS: 00010282 CPU: 1
>> EIP is at remove_exception+0x5/0x20
>> EAX: f276da88 EBX: f276da88 ECX: 00200200 EDX: 00100100
>> ESI: c79a4a58 EDI: c9268cc0 EBP: 00000000 ESP: ecbcbec0
>> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>> Process kcopyd (pid: 31375, ti=ecbca000 task=e6d9d220 task.ti=ecbca000)
>> Stack: c044cb1f 0000000e 00000000 c916b480 00000000 c044dde3 00018f47
>> 00000000
>> 00002870 00000000 c70cba48 c9268cc0 00000000 c044cbd0 c70cba48
>> c720aec8
>> 00000000 c0447d04 00000000 c720afb4 00000400 00000000 00000000
>> efc65580
>> Call Trace:
>> [<c044cb1f>] pending_complete+0x9f/0x110
>> [<c044dde3>] persistent_commit+0xe3/0x110
>> [<c044cbd0>] copy_callback+0x30/0x40
>> [<c0447d04>] segment_complete+0x154/0x1d0
>> [<c0447935>] run_complete_job+0x45/0x80
>> [<c0447bb0>] segment_complete+0x0/0x1d0
>> [<c04478f0>] run_complete_job+0x0/0x80
>> [<c0447af4>] process_jobs+0x14/0x70
>> [<c0447b50>] do_work+0x0/0x40
>> [<c0447b66>] do_work+0x16/0x40
>> [<c013509d>] run_workqueue+0x4d/0xf0
>> [<c01351bd>] worker_thread+0x7d/0xc0
>> [<c0138350>] autoremove_wake_function+0x0/0x30
>> [<c0524f2c>] __sched_text_start+0x1ec/0x4b0
>> [<c0138350>] autoremove_wake_function+0x0/0x30
>> [<c0121a9b>] complete+0x2b/0x40
>> [<c0135140>] worker_thread+0x0/0xc0
>> [<c0137e24>] kthread+0x44/0x70
>> [<c0137de0>] kthread+0x0/0x70
>> [<c0104c57>] kernel_thread_helper+0x7/0x10
>> =======================
>> Code: 4b 0c e8 cf ff ff ff 8b 56 08 8d 04 c2 8b 10 89 13 89 18 89 5a 04
>> 89 43 04 5b 5e c3 8d 76 00 8d bc 27 00 00 00 00 8b 48 04 8b 10 <89> 11
>> 89 4a 04 c7 00 00 01 10 00 c7 40 04 00 02 20 00 c3 90 8d
>> EIP: [<c044bf65>] remove_exception+0x5/0x20 SS:ESP 0068:ecbcbec0
>> ---[ end trace 25afcedfe7eb0a2b ]---
>>
>> Is this known problem or something new? Thanks
>>
>>
>> Mikulas Patocka wrote:
>>
>>> On Thu, 23 Oct 2008, aluno3@poczta.onet.pl wrote:
>>>
>>>
>>>
>>>> I used dm-snapshot-fix-primary-pe-race.patch and last patch related with
>>>> pending_exception.After the same test and workload everything work
>>>> correctly so far.Is it final patch?
>>>>
>>>>
>>> Yes, these two patches are expected to be the final fix. Thanks for the
>>> testing. If you get some more crashes even with these two, write about
>>> them.
>>>
>>> Mikulas
>>>
>>>
>>>
>>>> best and thanks
>>>>
>>>>
>>>> Mikulas Patocka wrote:
>>>>
>>>>
>>>>> Oh, sorry for this "struct struct" in the patch in free_pending_exception,
>>>>> replace it just with one "struct". I forgot to refresh the patch before
>>>>> sending it.
>>>>>
>>>>> Mikulas
>>>>>
>>>>> On Wed, 22 Oct 2008, Mikulas Patocka wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On Wed, 22 Oct 2008, aluno3@poczta.onet.pl wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> I used your patch and I ran test the same workload. After a few hours
>>>>>>> test, everything is OK. Is it possible? Test is still running.When I get
>>>>>>> something wrong from kernel I write to You again.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Hi
>>>>>>
>>>>>> That's good that it works. So try this. Keep the first patch (it is this
>>>>>> one ---
>>>>>> http://people.redhat.com/mpatocka/patches/kernel/2.6.27/dm-snapshot-fix-primary-pe-race.patch
>>>>>> --- I think Milan already sent it to you and you have it applied). Undo
>>>>>> the second patch (that one that hides deallocation with /* */ ). And apply
>>>>>> this. Run the same test.
>>>>>>
>>>>>> Mikulas
>>>>>>
>>>>>> ---
>>>>>> drivers/md/dm-snap.c | 10 +++++++++-
>>>>>> drivers/md/dm-snap.h | 2 ++
>>>>>> 2 files changed, 11 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> Index: linux-2.6.27-clean/drivers/md/dm-snap.c
>>>>>> ===================================================================
>>>>>> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.c 2008-10-22 15:41:24.000000000 +0200
>>>>>> +++ linux-2.6.27-clean/drivers/md/dm-snap.c 2008-10-22 15:51:33.000000000 +0200
>>>>>> @@ -368,6 +368,7 @@ static struct dm_snap_pending_exception
>>>>>> struct dm_snap_pending_exception *pe = mempool_alloc(s->pending_pool,
>>>>>> GFP_NOIO);
>>>>>>
>>>>>> + atomic_inc(&s->n_pending_exceptions);
>>>>>> pe->snap = s;
>>>>>>
>>>>>> return pe;
>>>>>> @@ -375,7 +376,10 @@ static struct dm_snap_pending_exception
>>>>>>
>>>>>> static void free_pending_exception(struct dm_snap_pending_exception *pe)
>>>>>> {
>>>>>> - mempool_free(pe, pe->snap->pending_pool);
>>>>>> + struct struct dm_snapshot *s = pe->snap;
>>>>>> + mempool_free(pe, s->pending_pool);
>>>>>> + smp_mb__before_atomic_dec();
>>>>>> + atomic_dec(&s->n_pending_exceptions);
>>>>>> }
>>>>>>
>>>>>> static void insert_completed_exception(struct dm_snapshot *s,
>>>>>> @@ -601,6 +605,7 @@ static int snapshot_ctr(struct dm_target
>>>>>> s->valid = 1;
>>>>>> s->active = 0;
>>>>>> s->last_percent = 0;
>>>>>> + atomic_set(&s->n_pending_exceptions, 0);
>>>>>> init_rwsem(&s->lock);
>>>>>> spin_lock_init(&s->pe_lock);
>>>>>> s->ti = ti;
>>>>>> @@ -727,6 +732,9 @@ static void snapshot_dtr(struct dm_targe
>>>>>> /* After this returns there can be no new kcopyd jobs. */
>>>>>> unregister_snapshot(s);
>>>>>>
>>>>>> + while (atomic_read(&s->n_pending_exceptions))
>>>>>> + yield();
>>>>>> +
>>>>>> #ifdef CONFIG_DM_DEBUG
>>>>>> for (i = 0; i < DM_TRACKED_CHUNK_HASH_SIZE; i++)
>>>>>> BUG_ON(!hlist_empty(&s->tracked_chunk_hash[i]));
>>>>>> Index: linux-2.6.27-clean/drivers/md/dm-snap.h
>>>>>> ===================================================================
>>>>>> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.h 2008-10-22 15:45:08.000000000 +0200
>>>>>> +++ linux-2.6.27-clean/drivers/md/dm-snap.h 2008-10-22 15:46:49.000000000 +0200
>>>>>> @@ -163,6 +163,8 @@ struct dm_snapshot {
>>>>>>
>>>>>> mempool_t *pending_pool;
>>>>>>
>>>>>> + atomic_t n_pending_exceptions;
>>>>>> +
>>>>>> struct exception_table pending;
>>>>>> struct exception_table complete;
>>>>>>
>>>>>>
>>>>>> --
>>>>>> dm-devel mailing list
>>>>>> dm-devel@redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/dm-devel
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-11-26 7:38 ` aluno3
@ 2008-11-28 7:28 ` aluno3
2008-12-02 2:10 ` Mikulas Patocka
0 siblings, 1 reply; 17+ messages in thread
From: aluno3 @ 2008-11-28 7:28 UTC (permalink / raw)
To: Mikulas Patocka; +Cc: device-mapper development
Hi
More info about problem:
In my previous mail I wrote that I brought to call trace after a few
minutes but without use "dm snapshot: fix primary_pe race"and "wait for
chunks in destructor" of course and that one call trace ("after a few
minutes") involve problem with "wait for chunks in destructor".
But,yesterday I got next call trace:
Pid: 25597, comm: kcopyd Not tainted (2.6.27.7 #47)
EIP: 0060:[<c044d495>] EFLAGS: 00010282 CPU: 1
EIP is at remove_exception+0x5/0x20
EAX: c7b3d348 EBX: c7b3d348 ECX: 00200200 EDX: 00100100
ESI: c799d770 EDI: ea6a99c0 EBP: 00000000 ESP: f26b5ec0
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process kcopyd (pid: 25597, ti=f26b4000 task=f09e11f0 task.ti=f26b4000)
Stack: c044e04f 0000000c 00000000 f1980c40 00000000 c044f313 001396a7 00000000
000013ba 00000000 c7b3dd88 ea6a99c0 00000000 c044e100 c7b3dd88 f30f7aa8
00000000 c0449234 00000000 f30f7b94 00000400 00000000 00000000 ea53fcb8
Call Trace:
[<c044e04f>] pending_complete+0x9f/0x110
[<c044f313>] persistent_commit+0xe3/0x110
[<c044e100>] copy_callback+0x30/0x40
[<c0449234>] segment_complete+0x154/0x1d0
[<c0448e65>] run_complete_job+0x45/0x80
[<c04490e0>] segment_complete+0x0/0x1d0
[<c0448e20>] run_complete_job+0x0/0x80
[<c0449024>] process_jobs+0x14/0x70
[<c0449080>] do_work+0x0/0x40
[<c0449096>] do_work+0x16/0x40
[<c013502d>] run_workqueue+0x4d/0xf0
[<c013514d>] worker_thread+0x7d/0xc0
[<c01382e0>] autoremove_wake_function+0x0/0x30
[<c0526543>] __sched_text_start+0x1e3/0x4a0
[<c01382e0>] autoremove_wake_function+0x0/0x30
[<c0121a2b>] complete+0x2b/0x40
[<c01350d0>] worker_thread+0x0/0xc0
[<c0137db4>] kthread+0x44/0x70
[<c0137d70>] kthread+0x0/0x70
[<c0104c57>] kernel_thread_helper+0x7/0x10
=======================
Code: 4b 0c e8 cf ff ff ff 8b 56 08 8d 04 c2 8b 10 89 13 89 18 89 5a 04 89 43
04 5b 5e c3 8d 76 00 8d bc 27 00 00 00 00 8b 48 04 8b 10 <89> 11 89 4a 04 c7
00 00 01 10 00 c7 40 04 00 02 20 00 c3 90 8d
EIP: [<c044d495>] remove_exception+0x5/0x20 SS:ESP 0068:f26b5ec0
---[ end trace 8a6182ef9a00114f ]---
root@53434231:~# uname -a
Linux 53434231 2.6.27.7 #47 SMP Tue Nov 25 08:57:37 CET
2008 i686 GNU/Linux
root@53434231:~# addr2line -e ./vmlinux c044d495
include/linux/list.h:93
git log for my drivers/md/ show in order:
dm snapshot: wait for chunks in destructor
dm snapshot: fix register_snapshot deadlock
dm snapshot: drop unused last_percent
dm raid1: flush workqueue before destruction
md: fix bug in raid10 recovery.
md: linear: Fix a division by zero bug for very small arrays.
dm snapshot: fix primary_pe race
dm kcopyd: avoid queue shuffle
md: Fix rdev_size_store with size == 0
...
aluno3@poczta.onet.pl wrote:
> Hi
>
>
> Yes, I am sure that I used 2.6.27.6. I checked it again.
>
> When I tested 2.6.27 without "dm snapshot: fix primary_pe race" I
> brought to call trace after a few minutes but with use "dm snapshot: fix
> primary_pe race" bring to call trace is very hard.Sometimes even after a
> few days test. I don`t have reproducible scenario :(. I tested kernel
> with use Bacula, Rsync, LVM, Snapshot together and very heavy load.It
> happened only 2 times through four weeks test.How can I help in this
> case? Thanks and best
>
>
> Mikulas Patocka wrote:
>
>> Hi
>>
>> This was supposed to be fixed with "dm snapshot: fix primary_pe race"
>> patch in 2.6.27.4. Are you sure that you really see it on 2.6.27.6? If so,
>> it looks like the bug wasn't fixed yet.
>>
>> How often does it happen? Do you have some reproducible scenario for this
>> bug?
>>
>> Mikulas
>>
>>
>>
>>> Hi
>>>
>>> I tested kernel 2.6.27.6 with patch from 2.6.28rc (wait for chunks in destructor,fix register_snapshot deadlock,) and I identified next problem with kernel and dm but repeatability this problem is very small.I got call trace:
>>>
>>>
>>> Pid: 26230, comm: kcopyd Not tainted (2.6.27.6 #36)
>>> EIP: 0060:[<c044d485>] EFLAGS: 00010282 CPU: 1
>>> EIP is at remove_exception+0x5/0x20
>>> EAX: ca3b5908 EBX: ca3b5908 ECX: 00200200 EDX: 00100100
>>> ESI: f7b489f8 EDI: e92ad980 EBP: 00000000 ESP: f29c7ec0
>>> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>>> Process kcopyd (pid: 26230, ti=f29c6000 task=e8512430 task.ti=f29c6000)
>>> Stack: c044e03f 0000000d 00000000 c85948c0 00000000 c044f2e7 0009bc30
>>> 00000000
>>> 0000e705 00000000 e8e41288 e92ad980 00000000 c044e0f0 e8e41288
>>> c7800ec8
>>> 00000000 c0449224 00000000 c7800fb4 00000400 00000000 00000000
>>> f2bdfbb0
>>> Call Trace:
>>> [<c044e03f>] pending_complete+0x9f/0x110
>>> [<c044f2e7>] persistent_commit+0xc7/0x110
>>> [<c044e0f0>] copy_callback+0x30/0x40
>>> [<c0449224>] segment_complete+0x154/0x1d0
>>> [<c0448e55>] run_complete_job+0x45/0x80
>>> [<c04490d0>] segment_complete+0x0/0x1d0
>>> [<c0448e10>] run_complete_job+0x0/0x80
>>> [<c0449014>] process_jobs+0x14/0x70
>>> [<c0449070>] do_work+0x0/0x40
>>> [<c0449086>] do_work+0x16/0x40
>>> [<c013502d>] run_workqueue+0x4d/0xf0
>>> [<c013514d>] worker_thread+0x7d/0xc0
>>> [<c01382e0>] autoremove_wake_function+0x0/0x30
>>> [<c0526583>] __sched_text_start+0x1e3/0x4a0
>>> [<c01382e0>] autoremove_wake_function+0x0/0x30
>>> [<c0121a2b>] complete+0x2b/0x40
>>> [<c01350d0>] worker_thread+0x0/0xc0
>>> [<c0137db4>] kthread+0x44/0x70
>>> [<c0137d70>] kthread+0x0/0x70
>>> [<c0104c57>] kernel_thread_helper+0x7/0x10
>>> =======================
>>> Code: 4b 0c e8 cf ff ff ff 8b 56 08 8d 04 c2 8b 10 89 13 89 18 89 5a 04
>>> 89 43 04 5b 5e c3 8d 76 00 8d bc 27 00 00 00 00 8b 48 04 8b 10 <89> 11
>>> 89 4a 04 c7 00 00 01 10 00 c7 40 04 00 02 20 00 c3 90 8d
>>> EIP: [<c044d485>] remove_exception+0x5/0x20 SS:ESP 0068:f29c7ec0
>>> ---[ end trace 834a1d3742a1be05 ]---
>>>
>>>
>>>
>>> addr2line returned include/linux/list.h:93 for EIP c044d485:
>>>
>>>
>>> static inline void __list_del(struct list_head * prev, struct list_head
>>> * next)
>>> {
>>> next->prev = prev;
>>> prev->next = next; //line 93
>>> }
>>>
>>>
>>>
>>>
>>> A few weeks ago I got similar call trace with plain kernel 2.6.27 and
>>> patches from mail thread:
>>>
>>>
>>> BUG: unable to handle kernel paging request at 00200200
>>> IP: [<c044bf65>] remove_exception+0x5/0x20
>>> *pdpt = 0000000029acc001 *pde = 0000000000000000
>>> Oops: 0002 [#1] SMP
>>> Modules linked in: iscsi_trgt mptctl mptbase st sg drbd bonding
>>> iscsi_tcp libiscsi scsi_transport_iscsi aacraid sata_nv forcedeth button
>>> ftdi_sio usbserial
>>>
>>> Pid: 31375, comm: kcopyd Not tainted (2.6.27 #21)
>>> EIP: 0060:[<c044bf65>] EFLAGS: 00010282 CPU: 1
>>> EIP is at remove_exception+0x5/0x20
>>> EAX: f276da88 EBX: f276da88 ECX: 00200200 EDX: 00100100
>>> ESI: c79a4a58 EDI: c9268cc0 EBP: 00000000 ESP: ecbcbec0
>>> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
>>> Process kcopyd (pid: 31375, ti=ecbca000 task=e6d9d220 task.ti=ecbca000)
>>> Stack: c044cb1f 0000000e 00000000 c916b480 00000000 c044dde3 00018f47
>>> 00000000
>>> 00002870 00000000 c70cba48 c9268cc0 00000000 c044cbd0 c70cba48
>>> c720aec8
>>> 00000000 c0447d04 00000000 c720afb4 00000400 00000000 00000000
>>> efc65580
>>> Call Trace:
>>> [<c044cb1f>] pending_complete+0x9f/0x110
>>> [<c044dde3>] persistent_commit+0xe3/0x110
>>> [<c044cbd0>] copy_callback+0x30/0x40
>>> [<c0447d04>] segment_complete+0x154/0x1d0
>>> [<c0447935>] run_complete_job+0x45/0x80
>>> [<c0447bb0>] segment_complete+0x0/0x1d0
>>> [<c04478f0>] run_complete_job+0x0/0x80
>>> [<c0447af4>] process_jobs+0x14/0x70
>>> [<c0447b50>] do_work+0x0/0x40
>>> [<c0447b66>] do_work+0x16/0x40
>>> [<c013509d>] run_workqueue+0x4d/0xf0
>>> [<c01351bd>] worker_thread+0x7d/0xc0
>>> [<c0138350>] autoremove_wake_function+0x0/0x30
>>> [<c0524f2c>] __sched_text_start+0x1ec/0x4b0
>>> [<c0138350>] autoremove_wake_function+0x0/0x30
>>> [<c0121a9b>] complete+0x2b/0x40
>>> [<c0135140>] worker_thread+0x0/0xc0
>>> [<c0137e24>] kthread+0x44/0x70
>>> [<c0137de0>] kthread+0x0/0x70
>>> [<c0104c57>] kernel_thread_helper+0x7/0x10
>>> =======================
>>> Code: 4b 0c e8 cf ff ff ff 8b 56 08 8d 04 c2 8b 10 89 13 89 18 89 5a 04
>>> 89 43 04 5b 5e c3 8d 76 00 8d bc 27 00 00 00 00 8b 48 04 8b 10 <89> 11
>>> 89 4a 04 c7 00 00 01 10 00 c7 40 04 00 02 20 00 c3 90 8d
>>> EIP: [<c044bf65>] remove_exception+0x5/0x20 SS:ESP 0068:ecbcbec0
>>> ---[ end trace 25afcedfe7eb0a2b ]---
>>>
>>> Is this known problem or something new? Thanks
>>>
>>>
>>> Mikulas Patocka wrote:
>>>
>>>
>>>> On Thu, 23 Oct 2008, aluno3@poczta.onet.pl wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> I used dm-snapshot-fix-primary-pe-race.patch and last patch related with
>>>>> pending_exception.After the same test and workload everything work
>>>>> correctly so far.Is it final patch?
>>>>>
>>>>>
>>>>>
>>>> Yes, these two patches are expected to be the final fix. Thanks for the
>>>> testing. If you get some more crashes even with these two, write about
>>>> them.
>>>>
>>>> Mikulas
>>>>
>>>>
>>>>
>>>>
>>>>> best and thanks
>>>>>
>>>>>
>>>>> Mikulas Patocka wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Oh, sorry for this "struct struct" in the patch in free_pending_exception,
>>>>>> replace it just with one "struct". I forgot to refresh the patch before
>>>>>> sending it.
>>>>>>
>>>>>> Mikulas
>>>>>>
>>>>>> On Wed, 22 Oct 2008, Mikulas Patocka wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Wed, 22 Oct 2008, aluno3@poczta.onet.pl wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> I used your patch and I ran test the same workload. After a few hours
>>>>>>>> test, everything is OK. Is it possible? Test is still running.When I get
>>>>>>>> something wrong from kernel I write to You again.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> That's good that it works. So try this. Keep the first patch (it is this
>>>>>>> one ---
>>>>>>> http://people.redhat.com/mpatocka/patches/kernel/2.6.27/dm-snapshot-fix-primary-pe-race.patch
>>>>>>> --- I think Milan already sent it to you and you have it applied). Undo
>>>>>>> the second patch (that one that hides deallocation with /* */ ). And apply
>>>>>>> this. Run the same test.
>>>>>>>
>>>>>>> Mikulas
>>>>>>>
>>>>>>> ---
>>>>>>> drivers/md/dm-snap.c | 10 +++++++++-
>>>>>>> drivers/md/dm-snap.h | 2 ++
>>>>>>> 2 files changed, 11 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> Index: linux-2.6.27-clean/drivers/md/dm-snap.c
>>>>>>> ===================================================================
>>>>>>> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.c 2008-10-22 15:41:24.000000000 +0200
>>>>>>> +++ linux-2.6.27-clean/drivers/md/dm-snap.c 2008-10-22 15:51:33.000000000 +0200
>>>>>>> @@ -368,6 +368,7 @@ static struct dm_snap_pending_exception
>>>>>>> struct dm_snap_pending_exception *pe = mempool_alloc(s->pending_pool,
>>>>>>> GFP_NOIO);
>>>>>>>
>>>>>>> + atomic_inc(&s->n_pending_exceptions);
>>>>>>> pe->snap = s;
>>>>>>>
>>>>>>> return pe;
>>>>>>> @@ -375,7 +376,10 @@ static struct dm_snap_pending_exception
>>>>>>>
>>>>>>> static void free_pending_exception(struct dm_snap_pending_exception *pe)
>>>>>>> {
>>>>>>> - mempool_free(pe, pe->snap->pending_pool);
>>>>>>> + struct struct dm_snapshot *s = pe->snap;
>>>>>>> + mempool_free(pe, s->pending_pool);
>>>>>>> + smp_mb__before_atomic_dec();
>>>>>>> + atomic_dec(&s->n_pending_exceptions);
>>>>>>> }
>>>>>>>
>>>>>>> static void insert_completed_exception(struct dm_snapshot *s,
>>>>>>> @@ -601,6 +605,7 @@ static int snapshot_ctr(struct dm_target
>>>>>>> s->valid = 1;
>>>>>>> s->active = 0;
>>>>>>> s->last_percent = 0;
>>>>>>> + atomic_set(&s->n_pending_exceptions, 0);
>>>>>>> init_rwsem(&s->lock);
>>>>>>> spin_lock_init(&s->pe_lock);
>>>>>>> s->ti = ti;
>>>>>>> @@ -727,6 +732,9 @@ static void snapshot_dtr(struct dm_targe
>>>>>>> /* After this returns there can be no new kcopyd jobs. */
>>>>>>> unregister_snapshot(s);
>>>>>>>
>>>>>>> + while (atomic_read(&s->n_pending_exceptions))
>>>>>>> + yield();
>>>>>>> +
>>>>>>> #ifdef CONFIG_DM_DEBUG
>>>>>>> for (i = 0; i < DM_TRACKED_CHUNK_HASH_SIZE; i++)
>>>>>>> BUG_ON(!hlist_empty(&s->tracked_chunk_hash[i]));
>>>>>>> Index: linux-2.6.27-clean/drivers/md/dm-snap.h
>>>>>>> ===================================================================
>>>>>>> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.h 2008-10-22 15:45:08.000000000 +0200
>>>>>>> +++ linux-2.6.27-clean/drivers/md/dm-snap.h 2008-10-22 15:46:49.000000000 +0200
>>>>>>> @@ -163,6 +163,8 @@ struct dm_snapshot {
>>>>>>>
>>>>>>> mempool_t *pending_pool;
>>>>>>>
>>>>>>> + atomic_t n_pending_exceptions;
>>>>>>> +
>>>>>>> struct exception_table pending;
>>>>>>> struct exception_table complete;
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> dm-devel mailing list
>>>>>>> dm-devel@redhat.com
>>>>>>> https://www.redhat.com/mailman/listinfo/dm-devel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>
>>
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Calltrace in dm-snapshot in 2.6.27 kernel
2008-11-28 7:28 ` aluno3
@ 2008-12-02 2:10 ` Mikulas Patocka
0 siblings, 0 replies; 17+ messages in thread
From: Mikulas Patocka @ 2008-12-02 2:10 UTC (permalink / raw)
To: aluno3@poczta.onet.pl; +Cc: device-mapper development
Hi
I don't know really what is causing it. Just try this patch. It removes a
lot of dirty logic from dm-snapshots. (you might need to hand-edit the
patch a little bit depending on other patches that you have applied).
Mikulas
dm-snapshot-rework-origin-write.patch:
Rework writing to snapshot origin.
The previous code selected one exception as "primary_pe", linked all other
exceptions on it and used reference counting to wait until all exceptions are
reallocated. This didn't work with exceptions with different chunk sizes:
https://bugzilla.redhat.com/show_bug.cgi?id=182659
I removed all the complexity with exceptions linking and reference counting.
Currently, bio is linked on one exception and when that exception is
reallocated, the bio is retried to possibly wait for other exceptions.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
drivers/md/dm-snap.c | 174 +++++++++++++++++----------------------------------
1 file changed, 61 insertions(+), 113 deletions(-)
Index: linux-2.6.28-rc5-devel/drivers/md/dm-snap.c
===================================================================
--- linux-2.6.28-rc5-devel.orig/drivers/md/dm-snap.c 2008-11-25 16:10:37.000000000 +0100
+++ linux-2.6.28-rc5-devel/drivers/md/dm-snap.c 2008-11-25 16:10:42.000000000 +0100
@@ -56,28 +56,6 @@ struct dm_snap_pending_exception {
struct bio_list origin_bios;
struct bio_list snapshot_bios;
- /*
- * Short-term queue of pending exceptions prior to submission.
- */
- struct list_head list;
-
- /*
- * The primary pending_exception is the one that holds
- * the ref_count and the list of origin_bios for a
- * group of pending_exceptions. It is always last to get freed.
- * These fields get set up when writing to the origin.
- */
- struct dm_snap_pending_exception *primary_pe;
-
- /*
- * Number of pending_exceptions processing this chunk.
- * When this drops to zero we must complete the origin bios.
- * If incrementing or decrementing this, hold pe->snap->lock for
- * the sibling concerned and not pe->primary_pe->snap->lock unless
- * they are the same.
- */
- atomic_t ref_count;
-
/* Pointer back to snapshot context */
struct dm_snapshot *snap;
@@ -758,6 +736,28 @@ static void flush_bios(struct bio *bio)
}
}
+static int do_origin(struct dm_dev *origin, struct bio *bio);
+
+/*
+ * Flush a list of buffers.
+ */
+static void retry_origin_bios(struct dm_snapshot *s, struct bio *bio)
+{
+ struct bio *n;
+ int r;
+
+ while (bio) {
+ n = bio->bi_next;
+ bio->bi_next = NULL;
+ r = do_origin(s->origin, bio);
+ if (r == DM_MAPIO_REMAPPED)
+ generic_make_request(bio);
+ else
+ BUG_ON(r != DM_MAPIO_SUBMITTED);
+ bio = n;
+ }
+}
+
/*
* Error a list of buffers.
*/
@@ -791,39 +791,6 @@ static void __invalidate_snapshot(struct
dm_table_event(s->ti->table);
}
-static void get_pending_exception(struct dm_snap_pending_exception *pe)
-{
- atomic_inc(&pe->ref_count);
-}
-
-static struct bio *put_pending_exception(struct dm_snap_pending_exception *pe)
-{
- struct dm_snap_pending_exception *primary_pe;
- struct bio *origin_bios = NULL;
-
- primary_pe = pe->primary_pe;
-
- /*
- * If this pe is involved in a write to the origin and
- * it is the last sibling to complete then release
- * the bios for the original write to the origin.
- */
- if (primary_pe &&
- atomic_dec_and_test(&primary_pe->ref_count)) {
- origin_bios = bio_list_get(&primary_pe->origin_bios);
- free_pending_exception(primary_pe);
- }
-
- /*
- * Free the pe if it's not linked to an origin write or if
- * it's not itself a primary pe.
- */
- if (!primary_pe || primary_pe != pe)
- free_pending_exception(pe);
-
- return origin_bios;
-}
-
static void pending_complete(struct dm_snap_pending_exception *pe, int success)
{
struct dm_snap_exception *e;
@@ -872,7 +839,8 @@ static void pending_complete(struct dm_s
out:
remove_exception(&pe->e);
snapshot_bios = bio_list_get(&pe->snapshot_bios);
- origin_bios = put_pending_exception(pe);
+ origin_bios = bio_list_get(&pe->origin_bios);
+ free_pending_exception(pe);
up_write(&s->lock);
@@ -882,7 +850,7 @@ static void pending_complete(struct dm_s
else
flush_bios(snapshot_bios);
- flush_bios(origin_bios);
+ retry_origin_bios(s, origin_bios);
}
static void commit_callback(void *context, int success)
@@ -944,11 +912,11 @@ static void start_copy(struct dm_snap_pe
* this.
*/
static struct dm_snap_pending_exception *
-__find_pending_exception(struct dm_snapshot *s, struct bio *bio)
+__find_pending_exception(struct dm_snapshot *s, sector_t sector)
{
struct dm_snap_exception *e;
struct dm_snap_pending_exception *pe;
- chunk_t chunk = sector_to_chunk(s, bio->bi_sector);
+ chunk_t chunk = sector_to_chunk(s, sector);
/*
* Is there a pending exception for this already ?
@@ -983,8 +951,6 @@ __find_pending_exception(struct dm_snaps
pe->e.old_chunk = chunk;
bio_list_init(&pe->origin_bios);
bio_list_init(&pe->snapshot_bios);
- pe->primary_pe = NULL;
- atomic_set(&pe->ref_count, 0);
pe->started = 0;
if (s->store.prepare_exception(&s->store, &pe->e)) {
@@ -992,7 +958,6 @@ __find_pending_exception(struct dm_snaps
return NULL;
}
- get_pending_exception(pe);
insert_exception(&s->pending, &pe->e);
out:
@@ -1046,7 +1011,7 @@ static int snapshot_map(struct dm_target
* writeable.
*/
if (bio_rw(bio) == WRITE) {
- pe = __find_pending_exception(s, bio);
+ pe = __find_pending_exception(s, bio->bi_sector);
if (!pe) {
__invalidate_snapshot(s, -ENOMEM);
r = -EIO;
@@ -1140,14 +1105,20 @@ static int snapshot_status(struct dm_tar
/*-----------------------------------------------------------------
* Origin methods
*---------------------------------------------------------------*/
-static int __origin_write(struct list_head *snapshots, struct bio *bio)
+
+/*
+ * Returns:
+ * DM_MAPIO_REMAPPED: bio may be submitted to origin device
+ * DM_MAPIO_SUBMITTED: bio was queued on queue on one of exceptions
+ */
+
+static int __origin_write(struct list_head *snapshots, sector_t sector, struct bio *bio)
{
- int r = DM_MAPIO_REMAPPED, first = 0;
+ int r = DM_MAPIO_REMAPPED;
struct dm_snapshot *snap;
struct dm_snap_exception *e;
- struct dm_snap_pending_exception *pe, *next_pe, *primary_pe = NULL;
+ struct dm_snap_pending_exception *pe, *pe_to_start = NULL;
chunk_t chunk;
- LIST_HEAD(pe_queue);
/* Do all the snapshots on this origin */
list_for_each_entry (snap, snapshots, list) {
@@ -1159,86 +1130,63 @@ static int __origin_write(struct list_he
goto next_snapshot;
/* Nothing to do if writing beyond end of snapshot */
- if (bio->bi_sector >= dm_table_get_size(snap->ti->table))
+ if (sector >= dm_table_get_size(snap->ti->table))
goto next_snapshot;
/*
* Remember, different snapshots can have
* different chunk sizes.
*/
- chunk = sector_to_chunk(snap, bio->bi_sector);
+ chunk = sector_to_chunk(snap, sector);
/*
* Check exception table to see if block
* is already remapped in this snapshot
* and trigger an exception if not.
- *
- * ref_count is initialised to 1 so pending_complete()
- * won't destroy the primary_pe while we're inside this loop.
*/
e = lookup_exception(&snap->complete, chunk);
if (e)
goto next_snapshot;
- pe = __find_pending_exception(snap, bio);
+ pe = __find_pending_exception(snap, sector);
if (!pe) {
__invalidate_snapshot(snap, -ENOMEM);
goto next_snapshot;
}
- if (!primary_pe) {
- /*
- * Either every pe here has same
- * primary_pe or none has one yet.
- */
- if (pe->primary_pe)
- primary_pe = pe->primary_pe;
- else {
- primary_pe = pe;
- first = 1;
- }
-
- bio_list_add(&primary_pe->origin_bios, bio);
-
- r = DM_MAPIO_SUBMITTED;
- }
+ r = DM_MAPIO_SUBMITTED;
- if (!pe->primary_pe) {
- pe->primary_pe = primary_pe;
- get_pending_exception(primary_pe);
+ if (bio) {
+ bio_list_add(&pe->origin_bios, bio);
+ bio = NULL;
+
+ if (!pe->started) {
+ pe->started = 1;
+ pe_to_start = pe;
+ }
}
if (!pe->started) {
pe->started = 1;
- list_add_tail(&pe->list, &pe_queue);
+ start_copy(pe);
}
next_snapshot:
up_write(&snap->lock);
}
- if (!primary_pe)
- return r;
-
/*
- * If this is the first time we're processing this chunk and
- * ref_count is now 1 it means all the pending exceptions
- * got completed while we were in the loop above, so it falls to
- * us here to remove the primary_pe and submit any origin_bios.
+ * pe_to_start is a small performance improvement:
+ * To avoid calling __origin_write N times for N snapshots, we start
+ * the snapshot where we queued the bio as the last one.
+ *
+ * If we start it as the last one, it finishes most likely as the last
+ * one and exceptions in other snapshots will be already finished when
+ * the bio will be retried.
*/
- if (first && atomic_dec_and_test(&primary_pe->ref_count)) {
- flush_bios(bio_list_get(&primary_pe->origin_bios));
- free_pending_exception(primary_pe);
- /* If we got here, pe_queue is necessarily empty. */
- return r;
- }
-
- /*
- * Now that we have a complete pe list we can start the copying.
- */
- list_for_each_entry_safe(pe, next_pe, &pe_queue, list)
- start_copy(pe);
+ if (pe_to_start)
+ start_copy(pe_to_start);
return r;
}
@@ -1254,7 +1202,7 @@ static int do_origin(struct dm_dev *orig
down_read(&_origins_lock);
o = __lookup_origin(origin->bdev);
if (o)
- r = __origin_write(&o->snapshots, bio);
+ r = __origin_write(&o->snapshots, bio->bi_sector, bio);
up_read(&_origins_lock);
return r;
On Fri, 28 Nov 2008, aluno3@poczta.onet.pl wrote:
> Hi
>
> More info about problem:
>
> In my previous mail I wrote that I brought to call trace after a few
> minutes but without use "dm snapshot: fix primary_pe race"and "wait for
> chunks in destructor" of course and that one call trace ("after a few
> minutes") involve problem with "wait for chunks in destructor".
>
> But,yesterday I got next call trace:
>
> Pid: 25597, comm: kcopyd Not tainted (2.6.27.7 #47)
> EIP: 0060:[<c044d495>] EFLAGS: 00010282 CPU: 1
> EIP is at remove_exception+0x5/0x20
> EAX: c7b3d348 EBX: c7b3d348 ECX: 00200200 EDX: 00100100
> ESI: c799d770 EDI: ea6a99c0 EBP: 00000000 ESP: f26b5ec0
> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> Process kcopyd (pid: 25597, ti=f26b4000 task=f09e11f0 task.ti=f26b4000)
> Stack: c044e04f 0000000c 00000000 f1980c40 00000000 c044f313 001396a7 00000000
> 000013ba 00000000 c7b3dd88 ea6a99c0 00000000 c044e100 c7b3dd88 f30f7aa8
> 00000000 c0449234 00000000 f30f7b94 00000400 00000000 00000000 ea53fcb8
> Call Trace:
> [<c044e04f>] pending_complete+0x9f/0x110
> [<c044f313>] persistent_commit+0xe3/0x110
> [<c044e100>] copy_callback+0x30/0x40
> [<c0449234>] segment_complete+0x154/0x1d0
> [<c0448e65>] run_complete_job+0x45/0x80
> [<c04490e0>] segment_complete+0x0/0x1d0
> [<c0448e20>] run_complete_job+0x0/0x80
> [<c0449024>] process_jobs+0x14/0x70
> [<c0449080>] do_work+0x0/0x40
> [<c0449096>] do_work+0x16/0x40
> [<c013502d>] run_workqueue+0x4d/0xf0
> [<c013514d>] worker_thread+0x7d/0xc0
> [<c01382e0>] autoremove_wake_function+0x0/0x30
> [<c0526543>] __sched_text_start+0x1e3/0x4a0
> [<c01382e0>] autoremove_wake_function+0x0/0x30
> [<c0121a2b>] complete+0x2b/0x40
> [<c01350d0>] worker_thread+0x0/0xc0
> [<c0137db4>] kthread+0x44/0x70
> [<c0137d70>] kthread+0x0/0x70
> [<c0104c57>] kernel_thread_helper+0x7/0x10
> =======================
> Code: 4b 0c e8 cf ff ff ff 8b 56 08 8d 04 c2 8b 10 89 13 89 18 89 5a 04 89 43
> 04 5b 5e c3 8d 76 00 8d bc 27 00 00 00 00 8b 48 04 8b 10 <89> 11 89 4a 04 c7
> 00 00 01 10 00 c7 40 04 00 02 20 00 c3 90 8d
> EIP: [<c044d495>] remove_exception+0x5/0x20 SS:ESP 0068:f26b5ec0
> ---[ end trace 8a6182ef9a00114f ]---
>
> root@53434231:~# uname -a
> Linux 53434231 2.6.27.7 #47 SMP Tue Nov 25 08:57:37 CET
> 2008 i686 GNU/Linux
>
>
> root@53434231:~# addr2line -e ./vmlinux c044d495
> include/linux/list.h:93
>
>
> git log for my drivers/md/ show in order:
>
> dm snapshot: wait for chunks in destructor
> dm snapshot: fix register_snapshot deadlock
> dm snapshot: drop unused last_percent
> dm raid1: flush workqueue before destruction
> md: fix bug in raid10 recovery.
> md: linear: Fix a division by zero bug for very small arrays.
> dm snapshot: fix primary_pe race
> dm kcopyd: avoid queue shuffle
> md: Fix rdev_size_store with size == 0
> ...
>
>
>
>
>
> aluno3@poczta.onet.pl wrote:
> > Hi
> >
> >
> > Yes, I am sure that I used 2.6.27.6. I checked it again.
> >
> > When I tested 2.6.27 without "dm snapshot: fix primary_pe race" I
> > brought to call trace after a few minutes but with use "dm snapshot: fix
> > primary_pe race" bring to call trace is very hard.Sometimes even after a
> > few days test. I don`t have reproducible scenario :(. I tested kernel
> > with use Bacula, Rsync, LVM, Snapshot together and very heavy load.It
> > happened only 2 times through four weeks test.How can I help in this
> > case? Thanks and best
> >
> >
> > Mikulas Patocka wrote:
> >
> >> Hi
> >>
> >> This was supposed to be fixed with "dm snapshot: fix primary_pe race"
> >> patch in 2.6.27.4. Are you sure that you really see it on 2.6.27.6? If so,
> >> it looks like the bug wasn't fixed yet.
> >>
> >> How often does it happen? Do you have some reproducible scenario for this
> >> bug?
> >>
> >> Mikulas
> >>
> >>
> >>
> >>> Hi
> >>>
> >>> I tested kernel 2.6.27.6 with patch from 2.6.28rc (wait for chunks in destructor,fix register_snapshot deadlock,) and I identified next problem with kernel and dm but repeatability this problem is very small.I got call trace:
> >>>
> >>>
> >>> Pid: 26230, comm: kcopyd Not tainted (2.6.27.6 #36)
> >>> EIP: 0060:[<c044d485>] EFLAGS: 00010282 CPU: 1
> >>> EIP is at remove_exception+0x5/0x20
> >>> EAX: ca3b5908 EBX: ca3b5908 ECX: 00200200 EDX: 00100100
> >>> ESI: f7b489f8 EDI: e92ad980 EBP: 00000000 ESP: f29c7ec0
> >>> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> >>> Process kcopyd (pid: 26230, ti=f29c6000 task=e8512430 task.ti=f29c6000)
> >>> Stack: c044e03f 0000000d 00000000 c85948c0 00000000 c044f2e7 0009bc30
> >>> 00000000
> >>> 0000e705 00000000 e8e41288 e92ad980 00000000 c044e0f0 e8e41288
> >>> c7800ec8
> >>> 00000000 c0449224 00000000 c7800fb4 00000400 00000000 00000000
> >>> f2bdfbb0
> >>> Call Trace:
> >>> [<c044e03f>] pending_complete+0x9f/0x110
> >>> [<c044f2e7>] persistent_commit+0xc7/0x110
> >>> [<c044e0f0>] copy_callback+0x30/0x40
> >>> [<c0449224>] segment_complete+0x154/0x1d0
> >>> [<c0448e55>] run_complete_job+0x45/0x80
> >>> [<c04490d0>] segment_complete+0x0/0x1d0
> >>> [<c0448e10>] run_complete_job+0x0/0x80
> >>> [<c0449014>] process_jobs+0x14/0x70
> >>> [<c0449070>] do_work+0x0/0x40
> >>> [<c0449086>] do_work+0x16/0x40
> >>> [<c013502d>] run_workqueue+0x4d/0xf0
> >>> [<c013514d>] worker_thread+0x7d/0xc0
> >>> [<c01382e0>] autoremove_wake_function+0x0/0x30
> >>> [<c0526583>] __sched_text_start+0x1e3/0x4a0
> >>> [<c01382e0>] autoremove_wake_function+0x0/0x30
> >>> [<c0121a2b>] complete+0x2b/0x40
> >>> [<c01350d0>] worker_thread+0x0/0xc0
> >>> [<c0137db4>] kthread+0x44/0x70
> >>> [<c0137d70>] kthread+0x0/0x70
> >>> [<c0104c57>] kernel_thread_helper+0x7/0x10
> >>> =======================
> >>> Code: 4b 0c e8 cf ff ff ff 8b 56 08 8d 04 c2 8b 10 89 13 89 18 89 5a 04
> >>> 89 43 04 5b 5e c3 8d 76 00 8d bc 27 00 00 00 00 8b 48 04 8b 10 <89> 11
> >>> 89 4a 04 c7 00 00 01 10 00 c7 40 04 00 02 20 00 c3 90 8d
> >>> EIP: [<c044d485>] remove_exception+0x5/0x20 SS:ESP 0068:f29c7ec0
> >>> ---[ end trace 834a1d3742a1be05 ]---
> >>>
> >>>
> >>>
> >>> addr2line returned include/linux/list.h:93 for EIP c044d485:
> >>>
> >>>
> >>> static inline void __list_del(struct list_head * prev, struct list_head
> >>> * next)
> >>> {
> >>> next->prev = prev;
> >>> prev->next = next; //line 93
> >>> }
> >>>
> >>>
> >>>
> >>>
> >>> A few weeks ago I got similar call trace with plain kernel 2.6.27 and
> >>> patches from mail thread:
> >>>
> >>>
> >>> BUG: unable to handle kernel paging request at 00200200
> >>> IP: [<c044bf65>] remove_exception+0x5/0x20
> >>> *pdpt = 0000000029acc001 *pde = 0000000000000000
> >>> Oops: 0002 [#1] SMP
> >>> Modules linked in: iscsi_trgt mptctl mptbase st sg drbd bonding
> >>> iscsi_tcp libiscsi scsi_transport_iscsi aacraid sata_nv forcedeth button
> >>> ftdi_sio usbserial
> >>>
> >>> Pid: 31375, comm: kcopyd Not tainted (2.6.27 #21)
> >>> EIP: 0060:[<c044bf65>] EFLAGS: 00010282 CPU: 1
> >>> EIP is at remove_exception+0x5/0x20
> >>> EAX: f276da88 EBX: f276da88 ECX: 00200200 EDX: 00100100
> >>> ESI: c79a4a58 EDI: c9268cc0 EBP: 00000000 ESP: ecbcbec0
> >>> DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
> >>> Process kcopyd (pid: 31375, ti=ecbca000 task=e6d9d220 task.ti=ecbca000)
> >>> Stack: c044cb1f 0000000e 00000000 c916b480 00000000 c044dde3 00018f47
> >>> 00000000
> >>> 00002870 00000000 c70cba48 c9268cc0 00000000 c044cbd0 c70cba48
> >>> c720aec8
> >>> 00000000 c0447d04 00000000 c720afb4 00000400 00000000 00000000
> >>> efc65580
> >>> Call Trace:
> >>> [<c044cb1f>] pending_complete+0x9f/0x110
> >>> [<c044dde3>] persistent_commit+0xe3/0x110
> >>> [<c044cbd0>] copy_callback+0x30/0x40
> >>> [<c0447d04>] segment_complete+0x154/0x1d0
> >>> [<c0447935>] run_complete_job+0x45/0x80
> >>> [<c0447bb0>] segment_complete+0x0/0x1d0
> >>> [<c04478f0>] run_complete_job+0x0/0x80
> >>> [<c0447af4>] process_jobs+0x14/0x70
> >>> [<c0447b50>] do_work+0x0/0x40
> >>> [<c0447b66>] do_work+0x16/0x40
> >>> [<c013509d>] run_workqueue+0x4d/0xf0
> >>> [<c01351bd>] worker_thread+0x7d/0xc0
> >>> [<c0138350>] autoremove_wake_function+0x0/0x30
> >>> [<c0524f2c>] __sched_text_start+0x1ec/0x4b0
> >>> [<c0138350>] autoremove_wake_function+0x0/0x30
> >>> [<c0121a9b>] complete+0x2b/0x40
> >>> [<c0135140>] worker_thread+0x0/0xc0
> >>> [<c0137e24>] kthread+0x44/0x70
> >>> [<c0137de0>] kthread+0x0/0x70
> >>> [<c0104c57>] kernel_thread_helper+0x7/0x10
> >>> =======================
> >>> Code: 4b 0c e8 cf ff ff ff 8b 56 08 8d 04 c2 8b 10 89 13 89 18 89 5a 04
> >>> 89 43 04 5b 5e c3 8d 76 00 8d bc 27 00 00 00 00 8b 48 04 8b 10 <89> 11
> >>> 89 4a 04 c7 00 00 01 10 00 c7 40 04 00 02 20 00 c3 90 8d
> >>> EIP: [<c044bf65>] remove_exception+0x5/0x20 SS:ESP 0068:ecbcbec0
> >>> ---[ end trace 25afcedfe7eb0a2b ]---
> >>>
> >>> Is this known problem or something new? Thanks
> >>>
> >>>
> >>> Mikulas Patocka wrote:
> >>>
> >>>
> >>>> On Thu, 23 Oct 2008, aluno3@poczta.onet.pl wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> I used dm-snapshot-fix-primary-pe-race.patch and last patch related with
> >>>>> pending_exception.After the same test and workload everything work
> >>>>> correctly so far.Is it final patch?
> >>>>>
> >>>>>
> >>>>>
> >>>> Yes, these two patches are expected to be the final fix. Thanks for the
> >>>> testing. If you get some more crashes even with these two, write about
> >>>> them.
> >>>>
> >>>> Mikulas
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> best and thanks
> >>>>>
> >>>>>
> >>>>> Mikulas Patocka wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Oh, sorry for this "struct struct" in the patch in free_pending_exception,
> >>>>>> replace it just with one "struct". I forgot to refresh the patch before
> >>>>>> sending it.
> >>>>>>
> >>>>>> Mikulas
> >>>>>>
> >>>>>> On Wed, 22 Oct 2008, Mikulas Patocka wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Wed, 22 Oct 2008, aluno3@poczta.onet.pl wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> Hi
> >>>>>>>>
> >>>>>>>> I used your patch and I ran test the same workload. After a few hours
> >>>>>>>> test, everything is OK. Is it possible? Test is still running.When I get
> >>>>>>>> something wrong from kernel I write to You again.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>> Hi
> >>>>>>>
> >>>>>>> That's good that it works. So try this. Keep the first patch (it is this
> >>>>>>> one ---
> >>>>>>> http://people.redhat.com/mpatocka/patches/kernel/2.6.27/dm-snapshot-fix-primary-pe-race.patch
> >>>>>>> --- I think Milan already sent it to you and you have it applied). Undo
> >>>>>>> the second patch (that one that hides deallocation with /* */ ). And apply
> >>>>>>> this. Run the same test.
> >>>>>>>
> >>>>>>> Mikulas
> >>>>>>>
> >>>>>>> ---
> >>>>>>> drivers/md/dm-snap.c | 10 +++++++++-
> >>>>>>> drivers/md/dm-snap.h | 2 ++
> >>>>>>> 2 files changed, 11 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> Index: linux-2.6.27-clean/drivers/md/dm-snap.c
> >>>>>>> ===================================================================
> >>>>>>> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.c 2008-10-22 15:41:24.000000000 +0200
> >>>>>>> +++ linux-2.6.27-clean/drivers/md/dm-snap.c 2008-10-22 15:51:33.000000000 +0200
> >>>>>>> @@ -368,6 +368,7 @@ static struct dm_snap_pending_exception
> >>>>>>> struct dm_snap_pending_exception *pe = mempool_alloc(s->pending_pool,
> >>>>>>> GFP_NOIO);
> >>>>>>>
> >>>>>>> + atomic_inc(&s->n_pending_exceptions);
> >>>>>>> pe->snap = s;
> >>>>>>>
> >>>>>>> return pe;
> >>>>>>> @@ -375,7 +376,10 @@ static struct dm_snap_pending_exception
> >>>>>>>
> >>>>>>> static void free_pending_exception(struct dm_snap_pending_exception *pe)
> >>>>>>> {
> >>>>>>> - mempool_free(pe, pe->snap->pending_pool);
> >>>>>>> + struct struct dm_snapshot *s = pe->snap;
> >>>>>>> + mempool_free(pe, s->pending_pool);
> >>>>>>> + smp_mb__before_atomic_dec();
> >>>>>>> + atomic_dec(&s->n_pending_exceptions);
> >>>>>>> }
> >>>>>>>
> >>>>>>> static void insert_completed_exception(struct dm_snapshot *s,
> >>>>>>> @@ -601,6 +605,7 @@ static int snapshot_ctr(struct dm_target
> >>>>>>> s->valid = 1;
> >>>>>>> s->active = 0;
> >>>>>>> s->last_percent = 0;
> >>>>>>> + atomic_set(&s->n_pending_exceptions, 0);
> >>>>>>> init_rwsem(&s->lock);
> >>>>>>> spin_lock_init(&s->pe_lock);
> >>>>>>> s->ti = ti;
> >>>>>>> @@ -727,6 +732,9 @@ static void snapshot_dtr(struct dm_targe
> >>>>>>> /* After this returns there can be no new kcopyd jobs. */
> >>>>>>> unregister_snapshot(s);
> >>>>>>>
> >>>>>>> + while (atomic_read(&s->n_pending_exceptions))
> >>>>>>> + yield();
> >>>>>>> +
> >>>>>>> #ifdef CONFIG_DM_DEBUG
> >>>>>>> for (i = 0; i < DM_TRACKED_CHUNK_HASH_SIZE; i++)
> >>>>>>> BUG_ON(!hlist_empty(&s->tracked_chunk_hash[i]));
> >>>>>>> Index: linux-2.6.27-clean/drivers/md/dm-snap.h
> >>>>>>> ===================================================================
> >>>>>>> --- linux-2.6.27-clean.orig/drivers/md/dm-snap.h 2008-10-22 15:45:08.000000000 +0200
> >>>>>>> +++ linux-2.6.27-clean/drivers/md/dm-snap.h 2008-10-22 15:46:49.000000000 +0200
> >>>>>>> @@ -163,6 +163,8 @@ struct dm_snapshot {
> >>>>>>>
> >>>>>>> mempool_t *pending_pool;
> >>>>>>>
> >>>>>>> + atomic_t n_pending_exceptions;
> >>>>>>> +
> >>>>>>> struct exception_table pending;
> >>>>>>> struct exception_table complete;
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> dm-devel mailing list
> >>>>>>> dm-devel@redhat.com
> >>>>>>> https://www.redhat.com/mailman/listinfo/dm-devel
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>>
> >>
> >>
> >
> > --
> > dm-devel mailing list
> > dm-devel@redhat.com
> > https://www.redhat.com/mailman/listinfo/dm-devel
> >
> >
>
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2008-12-02 2:10 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-20 6:23 Calltrace in dm-snapshot in 2.6.27 kernel aluno3
2008-10-20 8:43 ` Milan Broz
2008-10-21 6:39 ` aluno3
2008-10-21 13:55 ` Mikulas Patocka
[not found] ` <48FDFF53.5080007@poczta.onet.pl>
2008-10-21 17:22 ` Mikulas Patocka
2008-10-21 18:42 ` aluno3
2008-10-21 21:43 ` Mikulas Patocka
2008-10-22 13:37 ` aluno3
2008-10-22 15:45 ` Mikulas Patocka
2008-10-22 16:39 ` Mikulas Patocka
2008-10-23 11:30 ` aluno3
2008-10-23 13:40 ` Mikulas Patocka
2008-11-19 8:31 ` aluno3
2008-11-24 10:52 ` Mikulas Patocka
2008-11-26 7:38 ` aluno3
2008-11-28 7:28 ` aluno3
2008-12-02 2:10 ` Mikulas Patocka
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.