* Fwd: raid1 data corruption during resync [not found] <D4FE2D75-4208-48C9-A4D0-432F092E5AE9@redhat.com> @ 2014-09-08 15:52 ` Brassow Jonathan 2014-09-09 1:08 ` NeilBrown 0 siblings, 1 reply; 17+ messages in thread From: Brassow Jonathan @ 2014-09-08 15:52 UTC (permalink / raw) To: linux-raid@vger.kernel.org Raid; +Cc: NeilBrown Begin forwarded message: > From: Brassow Jonathan <jbrassow@redhat.com> > Date: September 4, 2014 9:01:58 AM CDT > To: NeilBrown <neilb@suse.de> > Cc: Eivind Sarto <eivindsarto@gmail.com>, "linux-raid@vger.kernel.org Raid" <linux-raid@vger.kernel.org> > Subject: Re: raid1 data corruption during resync > > > On Sep 4, 2014, at 12:28 AM, NeilBrown wrote: > >> >> Neither of these explain the hang you are seeing. >> I note that the "md0-resync" thread isn't listed. I don't suppose you know >> what it was doing (stack trace)?? >> Also, had "md: md0: sync done" appeared in syslog yet? > > The sync has not yet completed (no message). I'm not sure why the resync thread didn't automatically report, but I've grabbed the entire trace from the machine ('echo t > /proc/sysrq-trigger') and it appears there. The traces are attached. (Would you rather have something so large posted in-line?) I didn't see this message come through yet. I am resending it with only the trace you requested from the mdX_resync thread. If you need the entire list of traces, I can try resending that. brassow Sep 4 08:52:00 bp-01 kernel: mdX_resync D 0000000000000008 0 12374 2 0x00000080 Sep 4 08:52:00 bp-01 kernel: ffff880207a5bb38 0000000000000046 0000000000000296 ffff88021727efc0 Sep 4 08:52:00 bp-01 kernel: ffff880207a58010 0000000000012bc0 0000000000012bc0 ffff880214de0f40 Sep 4 08:52:00 bp-01 kernel: ffff880207a5bb60 ffff88040171e178 ffff88040171e100 ffff880207a5bb58 Sep 4 08:52:00 bp-01 kernel: Call Trace: Sep 4 08:52:00 bp-01 kernel: [<ffffffff81580549>] schedule+0x29/0x70 Sep 4 08:52:00 bp-01 kernel: [<ffffffffa038a122>] raise_barrier+0xe2/0x160 [raid1] Sep 4 08:52:00 bp-01 kernel: [<ffffffff8108eb00>] ? bit_waitqueue+0xe0/0xe0 Sep 4 08:52:00 bp-01 kernel: [<ffffffffa038b701>] sync_request+0x161/0x9e0 [raid1] Sep 4 08:52:00 bp-01 kernel: [<ffffffff8108ee13>] ? __wake_up+0x53/0x70 Sep 4 08:52:00 bp-01 kernel: [<ffffffff81460009>] md_do_sync+0x849/0xd40 Sep 4 08:52:00 bp-01 kernel: [<ffffffff8108847f>] ? put_prev_entity+0x2f/0x400 Sep 4 08:52:00 bp-01 kernel: [<ffffffff81460816>] md_thread+0x116/0x150 Sep 4 08:52:00 bp-01 kernel: [<ffffffff8158006e>] ? __schedule+0x34e/0x6e0 Sep 4 08:52:00 bp-01 kernel: [<ffffffff81460700>] ? md_rdev_init+0x110/0x110 Sep 4 08:52:00 bp-01 kernel: [<ffffffff8107083e>] kthread+0xce/0xf0 Sep 4 08:52:00 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 Sep 4 08:52:00 bp-01 kernel: [<ffffffff8158412c>] ret_from_fork+0x7c/0xb0 Sep 4 08:52:00 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-09-08 15:52 ` Fwd: raid1 data corruption during resync Brassow Jonathan @ 2014-09-09 1:08 ` NeilBrown 0 siblings, 0 replies; 17+ messages in thread From: NeilBrown @ 2014-09-09 1:08 UTC (permalink / raw) To: Brassow Jonathan; +Cc: linux-raid@vger.kernel.org Raid [-- Attachment #1: Type: text/plain, Size: 3950 bytes --] On Mon, 8 Sep 2014 10:52:49 -0500 Brassow Jonathan <jbrassow@redhat.com> wrote: > > > Begin forwarded message: > > > From: Brassow Jonathan <jbrassow@redhat.com> > > Date: September 4, 2014 9:01:58 AM CDT > > To: NeilBrown <neilb@suse.de> > > Cc: Eivind Sarto <eivindsarto@gmail.com>, "linux-raid@vger.kernel.org Raid" <linux-raid@vger.kernel.org> > > Subject: Re: raid1 data corruption during resync > > > > > > On Sep 4, 2014, at 12:28 AM, NeilBrown wrote: > > > >> > >> Neither of these explain the hang you are seeing. > >> I note that the "md0-resync" thread isn't listed. I don't suppose you know > >> what it was doing (stack trace)?? > >> Also, had "md: md0: sync done" appeared in syslog yet? > > > > The sync has not yet completed (no message). I'm not sure why the resync thread didn't automatically report, but I've grabbed the entire trace from the machine ('echo t > /proc/sysrq-trigger') and it appears there. The traces are attached. (Would you rather have something so large posted in-line?) > > I didn't see this message come through yet. I am resending it with only the trace you requested from the mdX_resync thread. If you need the entire list of traces, I can try resending that. > > brassow > > Sep 4 08:52:00 bp-01 kernel: mdX_resync D 0000000000000008 0 12374 2 0x00000080 > Sep 4 08:52:00 bp-01 kernel: ffff880207a5bb38 0000000000000046 0000000000000296 ffff88021727efc0 > Sep 4 08:52:00 bp-01 kernel: ffff880207a58010 0000000000012bc0 0000000000012bc0 ffff880214de0f40 > Sep 4 08:52:00 bp-01 kernel: ffff880207a5bb60 ffff88040171e178 ffff88040171e100 ffff880207a5bb58 > Sep 4 08:52:00 bp-01 kernel: Call Trace: > Sep 4 08:52:00 bp-01 kernel: [<ffffffff81580549>] schedule+0x29/0x70 > Sep 4 08:52:00 bp-01 kernel: [<ffffffffa038a122>] raise_barrier+0xe2/0x160 [raid1] > Sep 4 08:52:00 bp-01 kernel: [<ffffffff8108eb00>] ? bit_waitqueue+0xe0/0xe0 > Sep 4 08:52:00 bp-01 kernel: [<ffffffffa038b701>] sync_request+0x161/0x9e0 [raid1] > Sep 4 08:52:00 bp-01 kernel: [<ffffffff8108ee13>] ? __wake_up+0x53/0x70 > Sep 4 08:52:00 bp-01 kernel: [<ffffffff81460009>] md_do_sync+0x849/0xd40 > Sep 4 08:52:00 bp-01 kernel: [<ffffffff8108847f>] ? put_prev_entity+0x2f/0x400 > Sep 4 08:52:00 bp-01 kernel: [<ffffffff81460816>] md_thread+0x116/0x150 > Sep 4 08:52:00 bp-01 kernel: [<ffffffff8158006e>] ? __schedule+0x34e/0x6e0 > Sep 4 08:52:00 bp-01 kernel: [<ffffffff81460700>] ? md_rdev_init+0x110/0x110 > Sep 4 08:52:00 bp-01 kernel: [<ffffffff8107083e>] kthread+0xce/0xf0 > Sep 4 08:52:00 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 > Sep 4 08:52:00 bp-01 kernel: [<ffffffff8158412c>] ret_from_fork+0x7c/0xb0 > Sep 4 08:52:00 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 > I did get the original thanks, I guess it just didn't make it to the list as well. Probably there is a limit on attachment sizes which isn't entirely unreasonable. No brilliant ideas yet... That fact that there are no kernel messages: no "sync done" or "redirecting sector" or "unrecoverable I/O" is a little surprising... It appears that conf->barrier is elevated, implying that there is a resync request that is still in-flight. However there I cannot think were it would be. It might help if I could see the disassemble of raise_barrier() so I could confirm that "raise_barrier+0xe2/0x160" is in the first or the second "wait_event_lock_irq" call. I assume it is in the first, and is waiting for the request that kcopyd wants to submit, to complete. It it were the second then it would be waiting for ->start_next_window to increase. That happens when allow_barrier() is called. If that were the blockage, it means that some normal write is in-flight, rather than an sync request. But I don't know where it could be either. So - still mystified. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* raid1 data corruption during resync @ 2014-08-29 19:29 Eivind Sarto 2014-09-02 14:10 ` Brassow Jonathan ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Eivind Sarto @ 2014-08-29 19:29 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid I am seeing occasional data corruption during raid1 resync. Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug. Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero. It no longer does this. It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it. Once I validate that the fix indeed works, I will post a proper patch. Do you have any feedback? — drivers/md/raid1.c 2014-08-22 15:19:15.000000000 -0700 +++ /tmp/raid1.c 2014-08-29 12:07:51.000000000 -0700 @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf * handling. */ wait_event_lock_irq(conf->wait_barrier, - !conf->array_frozen && + !conf->array_frozen && !conf->nr_pending && conf->barrier < RESYNC_DEPTH && (conf->start_next_window >= conf->next_resync + RESYNC_SECTORS), -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-08-29 19:29 Eivind Sarto @ 2014-09-02 14:10 ` Brassow Jonathan 2014-09-02 16:43 ` Eivind Sarto 2014-09-02 17:04 ` Eivind Sarto 2014-09-02 16:59 ` Brassow Jonathan 2014-09-02 19:24 ` Brassow Jonathan 2 siblings, 2 replies; 17+ messages in thread From: Brassow Jonathan @ 2014-09-02 14:10 UTC (permalink / raw) To: Eivind Sarto; +Cc: NeilBrown, linux-raid There is absolutely an issue with the mentioned commit. We are seeing symptoms of a different sort. Our testing group is doing device fault testing. (They are using LVM to create the RAID devices, but as you know, that uses the same underlying kernel code.) We encounter situations where the resync thread is stuck waiting for a barrier to come down, but it never does. All I/O to the RAID1 device blocks. There are a few different variations to how the problem manifests itself, but I have bisected the kernel to this commit (79ef3a8). Unfortunately, it does take quite some time (hours) to reproduce the issue with our test scripts. I will try testing this proposed patch while I try to figure out what the patch is doing and what might have gone wrong. brassow On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote: > I am seeing occasional data corruption during raid1 resync. > Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug. > Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero. It no longer does this. > It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it. > Once I validate that the fix indeed works, I will post a proper patch. > Do you have any feedback? > > — drivers/md/raid1.c 2014-08-22 15:19:15.000000000 -0700 > +++ /tmp/raid1.c 2014-08-29 12:07:51.000000000 -0700 > @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf > * handling. > */ > wait_event_lock_irq(conf->wait_barrier, > - !conf->array_frozen && > + !conf->array_frozen && !conf->nr_pending && > conf->barrier < RESYNC_DEPTH && > (conf->start_next_window >= > conf->next_resync + RESYNC_SECTORS), > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-09-02 14:10 ` Brassow Jonathan @ 2014-09-02 16:43 ` Eivind Sarto 2014-09-02 17:04 ` Eivind Sarto 1 sibling, 0 replies; 17+ messages in thread From: Eivind Sarto @ 2014-09-02 16:43 UTC (permalink / raw) To: Brassow Jonathan; +Cc: Eivind Sarto, NeilBrown, linux-raid On Sep 2, 2014, at 7:10 AM, Brassow Jonathan <jbrassow@redhat.com> wrote: > There is absolutely an issue with the mentioned commit. > > We are seeing symptoms of a different sort. Our testing group is doing device fault testing. (They are using LVM to create the RAID devices, but as you know, that uses the same underlying kernel code.) We encounter situations where the resync thread is stuck waiting for a barrier to come down, but it never does. All I/O to the RAID1 device blocks. > > There are a few different variations to how the problem manifests itself, but I have bisected the kernel to this commit (79ef3a8). > > Unfortunately, it does take quite some time (hours) to reproduce the issue with our test scripts. I will try testing this proposed patch while I try to figure out what the patch is doing and what might have gone wrong. > > brassow > > > On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote: > >> I am seeing occasional data corruption during raid1 resync. >> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug. >> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero. It no longer does this. >> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it. >> Once I validate that the fix indeed works, I will post a proper patch. >> Do you have any feedback? >> >> — drivers/md/raid1.c 2014-08-22 15:19:15.000000000 -0700 >> +++ /tmp/raid1.c 2014-08-29 12:07:51.000000000 -0700 >> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf >> * handling. >> */ >> wait_event_lock_irq(conf->wait_barrier, >> - !conf->array_frozen && >> + !conf->array_frozen && !conf->nr_pending && >> conf->barrier < RESYNC_DEPTH && >> (conf->start_next_window >= >> conf->next_resync + RESYNC_SECTORS), >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > The suggested patch is not the right thing to do. I got confused about the big comment about /* Barriers…. */ The comment suggests that conf->nr_pending is used for the implementation of barriers. This is not the case. The new barrier code no longer uses nr_pending. nr_pending is now only used when freezing the array. Sorry for the confusion. I back-ported the new barrier code to an older version of linux. I could have introduced a bug during the back-porting. I will try and upgrade my system to a newer kernel and see if I can reproduce the corruption. -eivind-- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-09-02 14:10 ` Brassow Jonathan 2014-09-02 16:43 ` Eivind Sarto @ 2014-09-02 17:04 ` Eivind Sarto 1 sibling, 0 replies; 17+ messages in thread From: Eivind Sarto @ 2014-09-02 17:04 UTC (permalink / raw) To: Brassow Jonathan; +Cc: Eivind Sarto, NeilBrown, linux-raid On Sep 2, 2014, at 7:10 AM, Brassow Jonathan <jbrassow@redhat.com> wrote: > There is absolutely an issue with the mentioned commit. > > We are seeing symptoms of a different sort. Our testing group is doing device fault testing. (They are using LVM to create the RAID devices, but as you know, that uses the same underlying kernel code.) We encounter situations where the resync thread is stuck waiting for a barrier to come down, but it never does. All I/O to the RAID1 device blocks. > > There are a few different variations to how the problem manifests itself, but I have bisected the kernel to this commit (79ef3a8). > > Unfortunately, it does take quite some time (hours) to reproduce the issue with our test scripts. I will try testing this proposed patch while I try to figure out what the patch is doing and what might have gone wrong. > > brassow > > > On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote: > >> I am seeing occasional data corruption during raid1 resync. >> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug. >> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero. It no longer does this. >> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it. >> Once I validate that the fix indeed works, I will post a proper patch. >> Do you have any feedback? >> >> — drivers/md/raid1.c 2014-08-22 15:19:15.000000000 -0700 >> +++ /tmp/raid1.c 2014-08-29 12:07:51.000000000 -0700 >> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf >> * handling. >> */ >> wait_event_lock_irq(conf->wait_barrier, >> - !conf->array_frozen && >> + !conf->array_frozen && !conf->nr_pending && >> conf->barrier < RESYNC_DEPTH && >> (conf->start_next_window >= >> conf->next_resync + RESYNC_SECTORS), >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > Yes, you are correct. I think above patch would just return the resync to the old implementation. No resync-window. Just either one of resync-IO or user-IO at any given time for the entire array. As I said, the comment in the resync code no longer matches the actual implementation. -eivind-- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-08-29 19:29 Eivind Sarto 2014-09-02 14:10 ` Brassow Jonathan @ 2014-09-02 16:59 ` Brassow Jonathan 2014-09-02 19:24 ` Brassow Jonathan 2 siblings, 0 replies; 17+ messages in thread From: Brassow Jonathan @ 2014-09-02 16:59 UTC (permalink / raw) To: Eivind Sarto; +Cc: NeilBrown, linux-raid@vger.kernel.org Raid, majianpeng On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote: > I am seeing occasional data corruption during raid1 resync. > Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug. > Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero. It no longer does this. > It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it. > Once I validate that the fix indeed works, I will post a proper patch. > Do you have any feedback? > > — drivers/md/raid1.c 2014-08-22 15:19:15.000000000 -0700 > +++ /tmp/raid1.c 2014-08-29 12:07:51.000000000 -0700 > @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf > * handling. > */ > wait_event_lock_irq(conf->wait_barrier, > - !conf->array_frozen && > + !conf->array_frozen && !conf->nr_pending && > conf->barrier < RESYNC_DEPTH && > (conf->start_next_window >= > conf->next_resync + RESYNC_SECTORS), I've been through the description of the mentioned commit and briefly looked at the code. I don't understand the changes or the algorithm quite yet, but wouldn't the change you are proposing here eliminate the benefit that the commit was trying to provide? Wouldn't it simply return the behavior to the old way? I think the point of the commit is to allow nominal I/O to run concurrently to resync I/O as long as it is not done to 'window' in which the resync I/O is running. By adding the check for nr_pending, wouldn't you be blocking recovery from taking place if there are any nominal I/O's outstanding - even if they are not to the same area of the device? CC'ing patch author to get his thoughts as well. brassow-- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-08-29 19:29 Eivind Sarto 2014-09-02 14:10 ` Brassow Jonathan 2014-09-02 16:59 ` Brassow Jonathan @ 2014-09-02 19:24 ` Brassow Jonathan 2014-09-02 22:07 ` Eivind Sarto 2 siblings, 1 reply; 17+ messages in thread From: Brassow Jonathan @ 2014-09-02 19:24 UTC (permalink / raw) To: Eivind Sarto; +Cc: NeilBrown, linux-raid On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote: > I am seeing occasional data corruption during raid1 resync. > Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug. > Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero. It no longer does this. > It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it. > Once I validate that the fix indeed works, I will post a proper patch. > Do you have any feedback? > > — drivers/md/raid1.c 2014-08-22 15:19:15.000000000 -0700 > +++ /tmp/raid1.c 2014-08-29 12:07:51.000000000 -0700 > @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf > * handling. > */ > wait_event_lock_irq(conf->wait_barrier, > - !conf->array_frozen && > + !conf->array_frozen && !conf->nr_pending && > conf->barrier < RESYNC_DEPTH && > (conf->start_next_window >= > conf->next_resync + RESYNC_SECTORS), This patch does not work - at least, it doesn't fix the issues I'm seeing. My system hangs (in various places, like the resync thread) after commit 79ef3a8. When testing this patch, I also added some code to dm-raid.c to allow me to print-out some of the variables when I encounter a problem. After applying this patch and printing the variables, I see: Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: start_next_window = 12288 Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: current_window_requests = -46 5257 Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: next_window_requests = -11562 Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_pending = 0 Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_waiting = 0 Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_queued = 0 Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: barrier = 1 Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: array_frozen = 0 Some of those values look pretty bizarre to me and suggest the accounting is pretty messed up. brassow -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-09-02 19:24 ` Brassow Jonathan @ 2014-09-02 22:07 ` Eivind Sarto 2014-09-02 22:14 ` Brassow Jonathan 2014-09-02 23:55 ` NeilBrown 0 siblings, 2 replies; 17+ messages in thread From: Eivind Sarto @ 2014-09-02 22:07 UTC (permalink / raw) To: Brassow Jonathan; +Cc: Eivind Sarto, NeilBrown, linux-raid On Sep 2, 2014, at 12:24 PM, Brassow Jonathan <jbrassow@redhat.com> wrote: > > On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote: > >> I am seeing occasional data corruption during raid1 resync. >> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug. >> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero. It no longer does this. >> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it. >> Once I validate that the fix indeed works, I will post a proper patch. >> Do you have any feedback? >> >> — drivers/md/raid1.c 2014-08-22 15:19:15.000000000 -0700 >> +++ /tmp/raid1.c 2014-08-29 12:07:51.000000000 -0700 >> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf >> * handling. >> */ >> wait_event_lock_irq(conf->wait_barrier, >> - !conf->array_frozen && >> + !conf->array_frozen && !conf->nr_pending && >> conf->barrier < RESYNC_DEPTH && >> (conf->start_next_window >= >> conf->next_resync + RESYNC_SECTORS), > > This patch does not work - at least, it doesn't fix the issues I'm seeing. My system hangs (in various places, like the resync thread) after commit 79ef3a8. When testing this patch, I also added some code to dm-raid.c to allow me to print-out some of the variables when I encounter a problem. After applying this patch and printing the variables, I see: > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: start_next_window = 12288 > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: current_window_requests = -46 > 5257 > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: next_window_requests = -11562 > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_pending = 0 > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_waiting = 0 > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_queued = 0 > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: barrier = 1 > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: array_frozen = 0 > > Some of those values look pretty bizarre to me and suggest the accounting is pretty messed up. > > brassow > After reviewing commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 I notice that wait_barrier() will now only exclude writes. User reads are not excluded even if the fall within the resync window. The old implementation used to exclude both reads and writes while resync-IO is active. Could this be a cause of data corruption? -eivind-- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-09-02 22:07 ` Eivind Sarto @ 2014-09-02 22:14 ` Brassow Jonathan 2014-09-02 23:55 ` NeilBrown 1 sibling, 0 replies; 17+ messages in thread From: Brassow Jonathan @ 2014-09-02 22:14 UTC (permalink / raw) To: Eivind Sarto; +Cc: NeilBrown, linux-raid@vger.kernel.org Raid On Sep 2, 2014, at 5:07 PM, Eivind Sarto wrote: > > On Sep 2, 2014, at 12:24 PM, Brassow Jonathan <jbrassow@redhat.com> wrote: > >> >> On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote: >> >>> I am seeing occasional data corruption during raid1 resync. >>> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug. >>> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero. It no longer does this. >>> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it. >>> Once I validate that the fix indeed works, I will post a proper patch. >>> Do you have any feedback? >>> >>> — drivers/md/raid1.c 2014-08-22 15:19:15.000000000 -0700 >>> +++ /tmp/raid1.c 2014-08-29 12:07:51.000000000 -0700 >>> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf >>> * handling. >>> */ >>> wait_event_lock_irq(conf->wait_barrier, >>> - !conf->array_frozen && >>> + !conf->array_frozen && !conf->nr_pending && >>> conf->barrier < RESYNC_DEPTH && >>> (conf->start_next_window >= >>> conf->next_resync + RESYNC_SECTORS), >> >> This patch does not work - at least, it doesn't fix the issues I'm seeing. My system hangs (in various places, like the resync thread) after commit 79ef3a8. When testing this patch, I also added some code to dm-raid.c to allow me to print-out some of the variables when I encounter a problem. After applying this patch and printing the variables, I see: >> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: start_next_window = 12288 >> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: current_window_requests = -46 >> 5257 >> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: next_window_requests = -11562 >> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_pending = 0 >> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_waiting = 0 >> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_queued = 0 >> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: barrier = 1 >> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: array_frozen = 0 >> >> Some of those values look pretty bizarre to me and suggest the accounting is pretty messed up. >> >> brassow >> > > After reviewing commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 I notice that wait_barrier() will now only exclude writes. User reads are not excluded even if the fall within the resync window. > The old implementation used to exclude both reads and writes while resync-IO is active. > Could this be a cause of data corruption? Not sure yet. It is also worth mentioning that a follow-on commit (8b592b6) was meant to fix problems with request counting. I did not have that patch applied when I gathered the above info. brassow -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-09-02 22:07 ` Eivind Sarto 2014-09-02 22:14 ` Brassow Jonathan @ 2014-09-02 23:55 ` NeilBrown 2014-09-03 0:48 ` Eivind Sarto 2014-09-03 1:18 ` Brassow Jonathan 1 sibling, 2 replies; 17+ messages in thread From: NeilBrown @ 2014-09-02 23:55 UTC (permalink / raw) To: Eivind Sarto; +Cc: Brassow Jonathan, linux-raid [-- Attachment #1: Type: text/plain, Size: 3245 bytes --] On Tue, 2 Sep 2014 15:07:26 -0700 Eivind Sarto <eivindsarto@gmail.com> wrote: > > On Sep 2, 2014, at 12:24 PM, Brassow Jonathan <jbrassow@redhat.com> wrote: > > > > > On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote: > > > >> I am seeing occasional data corruption during raid1 resync. > >> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug. > >> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero. It no longer does this. > >> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it. > >> Once I validate that the fix indeed works, I will post a proper patch. > >> Do you have any feedback? > >> > >> — drivers/md/raid1.c 2014-08-22 15:19:15.000000000 -0700 > >> +++ /tmp/raid1.c 2014-08-29 12:07:51.000000000 -0700 > >> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf > >> * handling. > >> */ > >> wait_event_lock_irq(conf->wait_barrier, > >> - !conf->array_frozen && > >> + !conf->array_frozen && !conf->nr_pending && > >> conf->barrier < RESYNC_DEPTH && > >> (conf->start_next_window >= > >> conf->next_resync + RESYNC_SECTORS), > > > > This patch does not work - at least, it doesn't fix the issues I'm seeing. My system hangs (in various places, like the resync thread) after commit 79ef3a8. When testing this patch, I also added some code to dm-raid.c to allow me to print-out some of the variables when I encounter a problem. After applying this patch and printing the variables, I see: > > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: start_next_window = 12288 > > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: current_window_requests = -46 > > 5257 > > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: next_window_requests = -11562 > > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_pending = 0 > > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_waiting = 0 > > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_queued = 0 > > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: barrier = 1 > > Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: array_frozen = 0 > > > > Some of those values look pretty bizarre to me and suggest the accounting is pretty messed up. > > > > brassow > > > > After reviewing commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 I notice that wait_barrier() will now only exclude writes. User reads are not excluded even if the fall within the resync window. > The old implementation used to exclude both reads and writes while resync-IO is active. > Could this be a cause of data corruption? > Could be. From read_balance: if (conf->mddev->recovery_cp < MaxSector && (this_sector + sectors >= conf->next_resync)) choose_first = 1; else choose_first = 0; This used to be safe because a read immediately before next_resync would wait until all resync requests completed. But now that read requests don't block it isn't safe. Probably best to make this: choose_first = (conf->mddev->recovery_cp < this_sector + sectors); Can you test that? Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-09-02 23:55 ` NeilBrown @ 2014-09-03 0:48 ` Eivind Sarto 2014-09-03 1:18 ` Brassow Jonathan 1 sibling, 0 replies; 17+ messages in thread From: Eivind Sarto @ 2014-09-03 0:48 UTC (permalink / raw) To: NeilBrown; +Cc: Eivind Sarto, Brassow Jonathan, linux-raid On Sep 2, 2014, at 4:55 PM, NeilBrown <neilb@suse.de> wrote: > On Tue, 2 Sep 2014 15:07:26 -0700 Eivind Sarto <eivindsarto@gmail.com> wrote: > >> >> On Sep 2, 2014, at 12:24 PM, Brassow Jonathan <jbrassow@redhat.com> wrote: >> >>> >>> On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote: >>> >>>> I am seeing occasional data corruption during raid1 resync. >>>> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug. >>>> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero. It no longer does this. >>>> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it. >>>> Once I validate that the fix indeed works, I will post a proper patch. >>>> Do you have any feedback? >>>> >>>> — drivers/md/raid1.c 2014-08-22 15:19:15.000000000 -0700 >>>> +++ /tmp/raid1.c 2014-08-29 12:07:51.000000000 -0700 >>>> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf >>>> * handling. >>>> */ >>>> wait_event_lock_irq(conf->wait_barrier, >>>> - !conf->array_frozen && >>>> + !conf->array_frozen && !conf->nr_pending && >>>> conf->barrier < RESYNC_DEPTH && >>>> (conf->start_next_window >= >>>> conf->next_resync + RESYNC_SECTORS), >>> >>> This patch does not work - at least, it doesn't fix the issues I'm seeing. My system hangs (in various places, like the resync thread) after commit 79ef3a8. When testing this patch, I also added some code to dm-raid.c to allow me to print-out some of the variables when I encounter a problem. After applying this patch and printing the variables, I see: >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: start_next_window = 12288 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: current_window_requests = -46 >>> 5257 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: next_window_requests = -11562 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_pending = 0 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_waiting = 0 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_queued = 0 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: barrier = 1 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: array_frozen = 0 >>> >>> Some of those values look pretty bizarre to me and suggest the accounting is pretty messed up. >>> >>> brassow >>> >> >> After reviewing commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 I notice that wait_barrier() will now only exclude writes. User reads are not excluded even if the fall within the resync window. >> The old implementation used to exclude both reads and writes while resync-IO is active. >> Could this be a cause of data corruption? >> > > Could be. > From read_balance: > > if (conf->mddev->recovery_cp < MaxSector && > (this_sector + sectors >= conf->next_resync)) > choose_first = 1; > else > choose_first = 0; > > This used to be safe because a read immediately before next_resync would wait > until all resync requests completed. But now that read requests don't block > it isn't safe. > Probably best to make this: > > choose_first = (conf->mddev->recovery_cp < this_sector + sectors); > > Can you test that? > > Thanks, > NeilBrown I’ll give it a try. -eivind -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-09-02 23:55 ` NeilBrown 2014-09-03 0:48 ` Eivind Sarto @ 2014-09-03 1:18 ` Brassow Jonathan 2014-09-03 1:31 ` NeilBrown 1 sibling, 1 reply; 17+ messages in thread From: Brassow Jonathan @ 2014-09-03 1:18 UTC (permalink / raw) To: NeilBrown; +Cc: Eivind Sarto, linux-raid On Sep 2, 2014, at 6:55 PM, NeilBrown wrote: > On Tue, 2 Sep 2014 15:07:26 -0700 Eivind Sarto <eivindsarto@gmail.com> wrote: > >> >> On Sep 2, 2014, at 12:24 PM, Brassow Jonathan <jbrassow@redhat.com> wrote: >> >>> >>> On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote: >>> >>>> I am seeing occasional data corruption during raid1 resync. >>>> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug. >>>> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero. It no longer does this. >>>> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it. >>>> Once I validate that the fix indeed works, I will post a proper patch. >>>> Do you have any feedback? >>>> >>>> — drivers/md/raid1.c 2014-08-22 15:19:15.000000000 -0700 >>>> +++ /tmp/raid1.c 2014-08-29 12:07:51.000000000 -0700 >>>> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf >>>> * handling. >>>> */ >>>> wait_event_lock_irq(conf->wait_barrier, >>>> - !conf->array_frozen && >>>> + !conf->array_frozen && !conf->nr_pending && >>>> conf->barrier < RESYNC_DEPTH && >>>> (conf->start_next_window >= >>>> conf->next_resync + RESYNC_SECTORS), >>> >>> This patch does not work - at least, it doesn't fix the issues I'm seeing. My system hangs (in various places, like the resync thread) after commit 79ef3a8. When testing this patch, I also added some code to dm-raid.c to allow me to print-out some of the variables when I encounter a problem. After applying this patch and printing the variables, I see: >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: start_next_window = 12288 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: current_window_requests = -46 >>> 5257 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: next_window_requests = -11562 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_pending = 0 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_waiting = 0 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_queued = 0 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: barrier = 1 >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: array_frozen = 0 >>> >>> Some of those values look pretty bizarre to me and suggest the accounting is pretty messed up. >>> >>> brassow >>> >> >> After reviewing commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 I notice that wait_barrier() will now only exclude writes. User reads are not excluded even if the fall within the resync window. >> The old implementation used to exclude both reads and writes while resync-IO is active. >> Could this be a cause of data corruption? >> > > Could be. > From read_balance: > > if (conf->mddev->recovery_cp < MaxSector && > (this_sector + sectors >= conf->next_resync)) > choose_first = 1; > else > choose_first = 0; > > This used to be safe because a read immediately before next_resync would wait > until all resync requests completed. But now that read requests don't block > it isn't safe. > Probably best to make this: > > choose_first = (conf->mddev->recovery_cp < this_sector + sectors); > > Can you test that? Unfortunately, I doubt very much that that will solve the issue that I am seeing. (Although, I'd welcome this fix also.) I have predominantly been testing RHEL kernels, but I've seen this with the latest upstream kernel as well. When doing fault testing, like failing a drive while the resync is happening, the processes can become hung. Here's one example, but they vary: INFO: task mdX_resync:11334 blocked for more than 120 seconds. Not tainted 2.6.32-497.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mdX_resync D 0000000000000000 0 11334 2 0x00000080 ffff8801fada5be0 0000000000000046 ffff88021a3bcc00 0000000300000001 ffff880218b92200 ffff880218668f50 0000000000000282 0000000000000003 0000000000000001 0000000000000282 ffff88021bfb45f8 ffff8801fada5fd8 Call Trace: [<ffffffff8140e439>] ? md_wakeup_thread+0x39/0x70 [<ffffffffa02b81dd>] sync_request+0x26d/0xb90 [raid1] [<ffffffff81064b90>] ? default_wake_function+0x0/0x20 [<ffffffff810577e9>] ? __wake_up_common+0x59/0x90 [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8105bd23>] ? __wake_up+0x53/0x70 [<ffffffff81414a47>] md_do_sync+0x6c7/0xd20 [<ffffffff814154f5>] md_thread+0x115/0x150 [<ffffffff814153e0>] ? md_thread+0x0/0x150 [<ffffffff8109e66e>] kthread+0x9e/0xc0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 and... INFO: task kjournald:10732 blocked for more than 120 seconds. Not tainted 2.6.32-497.el6.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kjournald D 0000000000000008 0 10732 2 0x00000080 ffff8801d71238b0 0000000000000046 0000000000000000 0000000300000001 ffff880218b92200 ffff880218668f50 000003783fd7424b 0000000000000003 0000000000000001 000000010035a163 ffff880219135af8 ffff8801d7123fd8 Call Trace: [<ffffffffa02b463f>] wait_barrier+0xcf/0x280 [raid1] [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa02b71bf>] make_request+0x8f/0xe40 [raid1] [<ffffffff81126af5>] ? mempool_alloc_slab+0x15/0x20 [<ffffffff81126c93>] ? mempool_alloc+0x63/0x140 [<ffffffff811c9d6b>] ? bio_alloc_bioset+0x5b/0xf0 [<ffffffffa02cf018>] raid_map+0x18/0x20 [dm_raid] Bisecting has revealed that the commit that is the subject of this thread is the root of the problem. It does take a long time to reproduce the issue with the scripts that I have, but if there is info I can get you, let me know. brassow -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-09-03 1:18 ` Brassow Jonathan @ 2014-09-03 1:31 ` NeilBrown 2014-09-03 1:45 ` Brassow Jonathan 0 siblings, 1 reply; 17+ messages in thread From: NeilBrown @ 2014-09-03 1:31 UTC (permalink / raw) To: Brassow Jonathan; +Cc: Eivind Sarto, linux-raid [-- Attachment #1: Type: text/plain, Size: 7491 bytes --] On Tue, 2 Sep 2014 20:18:06 -0500 Brassow Jonathan <jbrassow@redhat.com> wrote: > > On Sep 2, 2014, at 6:55 PM, NeilBrown wrote: > > > On Tue, 2 Sep 2014 15:07:26 -0700 Eivind Sarto <eivindsarto@gmail.com> wrote: > > > >> > >> On Sep 2, 2014, at 12:24 PM, Brassow Jonathan <jbrassow@redhat.com> wrote: > >> > >>> > >>> On Aug 29, 2014, at 2:29 PM, Eivind Sarto wrote: > >>> > >>>> I am seeing occasional data corruption during raid1 resync. > >>>> Reviewing the raid1 code, I suspect that commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 introduced a bug. > >>>> Prior to this commit raise_barrier() used to wait for conf->nr_pending to become zero. It no longer does this. > >>>> It is not easy to reproduce the corruption, so I wanted to ask about the following potential fix while I am still testing it. > >>>> Once I validate that the fix indeed works, I will post a proper patch. > >>>> Do you have any feedback? > >>>> > >>>> — drivers/md/raid1.c 2014-08-22 15:19:15.000000000 -0700 > >>>> +++ /tmp/raid1.c 2014-08-29 12:07:51.000000000 -0700 > >>>> @@ -851,7 +851,7 @@ static void raise_barrier(struct r1conf > >>>> * handling. > >>>> */ > >>>> wait_event_lock_irq(conf->wait_barrier, > >>>> - !conf->array_frozen && > >>>> + !conf->array_frozen && !conf->nr_pending && > >>>> conf->barrier < RESYNC_DEPTH && > >>>> (conf->start_next_window >= > >>>> conf->next_resync + RESYNC_SECTORS), > >>> > >>> This patch does not work - at least, it doesn't fix the issues I'm seeing. My system hangs (in various places, like the resync thread) after commit 79ef3a8. When testing this patch, I also added some code to dm-raid.c to allow me to print-out some of the variables when I encounter a problem. After applying this patch and printing the variables, I see: > >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: start_next_window = 12288 > >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: current_window_requests = -46 > >>> 5257 > >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: next_window_requests = -11562 > >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_pending = 0 > >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_waiting = 0 > >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: nr_queued = 0 > >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: barrier = 1 > >>> Sep 2 14:04:15 bp-01 kernel: device-mapper: raid: array_frozen = 0 > >>> > >>> Some of those values look pretty bizarre to me and suggest the accounting is pretty messed up. > >>> > >>> brassow > >>> > >> > >> After reviewing commit 79ef3a8aa1cb1523cc231c9a90a278333c21f761 I notice that wait_barrier() will now only exclude writes. User reads are not excluded even if the fall within the resync window. > >> The old implementation used to exclude both reads and writes while resync-IO is active. > >> Could this be a cause of data corruption? > >> > > > > Could be. > > From read_balance: > > > > if (conf->mddev->recovery_cp < MaxSector && > > (this_sector + sectors >= conf->next_resync)) > > choose_first = 1; > > else > > choose_first = 0; > > > > This used to be safe because a read immediately before next_resync would wait > > until all resync requests completed. But now that read requests don't block > > it isn't safe. > > Probably best to make this: > > > > choose_first = (conf->mddev->recovery_cp < this_sector + sectors); > > > > Can you test that? > > Unfortunately, I doubt very much that that will solve the issue that I am seeing. (Although, I'd welcome this fix also.) > > I have predominantly been testing RHEL kernels, but I've seen this with the latest upstream kernel as well. When doing fault testing, like failing a drive while the resync is happening, the processes can become hung. Here's one example, but they vary: > > INFO: task mdX_resync:11334 blocked for more than 120 seconds. > Not tainted 2.6.32-497.el6.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > mdX_resync D 0000000000000000 0 11334 2 0x00000080 > ffff8801fada5be0 0000000000000046 ffff88021a3bcc00 0000000300000001 > ffff880218b92200 ffff880218668f50 0000000000000282 0000000000000003 > 0000000000000001 0000000000000282 ffff88021bfb45f8 ffff8801fada5fd8 > Call Trace: > [<ffffffff8140e439>] ? md_wakeup_thread+0x39/0x70 > [<ffffffffa02b81dd>] sync_request+0x26d/0xb90 [raid1] > [<ffffffff81064b90>] ? default_wake_function+0x0/0x20 > [<ffffffff810577e9>] ? __wake_up_common+0x59/0x90 > [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 > [<ffffffff8105bd23>] ? __wake_up+0x53/0x70 > [<ffffffff81414a47>] md_do_sync+0x6c7/0xd20 > [<ffffffff814154f5>] md_thread+0x115/0x150 > [<ffffffff814153e0>] ? md_thread+0x0/0x150 > [<ffffffff8109e66e>] kthread+0x9e/0xc0 > [<ffffffff8100c20a>] child_rip+0xa/0x20 > [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 > [<ffffffff8100c200>] ? child_rip+0x0/0x20 > > > and... > > > INFO: task kjournald:10732 blocked for more than 120 seconds. > Not tainted 2.6.32-497.el6.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kjournald D 0000000000000008 0 10732 2 0x00000080 > ffff8801d71238b0 0000000000000046 0000000000000000 0000000300000001 > ffff880218b92200 ffff880218668f50 000003783fd7424b 0000000000000003 > 0000000000000001 000000010035a163 ffff880219135af8 ffff8801d7123fd8 > Call Trace: > [<ffffffffa02b463f>] wait_barrier+0xcf/0x280 [raid1] > [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 > [<ffffffffa02b71bf>] make_request+0x8f/0xe40 [raid1] > [<ffffffff81126af5>] ? mempool_alloc_slab+0x15/0x20 > [<ffffffff81126c93>] ? mempool_alloc+0x63/0x140 > [<ffffffff811c9d6b>] ? bio_alloc_bioset+0x5b/0xf0 > [<ffffffffa02cf018>] raid_map+0x18/0x20 [dm_raid] > > Bisecting has revealed that the commit that is the subject of this thread is the root of the problem. It does take a long time to reproduce the issue with the scripts that I have, but if there is info I can get you, let me know. > > brassow Thanks for the details Jon. I don't suppose you have a stack trace from an upstream kernel? Not that I doubt you have one, and I suspect it is very similar. But I've learnt not to trust "suspect" and "presume" and "no doubt". I'll have a look anyway. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-09-03 1:31 ` NeilBrown @ 2014-09-03 1:45 ` Brassow Jonathan 2014-09-03 21:39 ` Brassow Jonathan 0 siblings, 1 reply; 17+ messages in thread From: Brassow Jonathan @ 2014-09-03 1:45 UTC (permalink / raw) To: NeilBrown; +Cc: Eivind Sarto, linux-raid On Sep 2, 2014, at 8:31 PM, NeilBrown wrote: > Thanks for the details Jon. I don't suppose you have a stack trace from an > upstream kernel? Not that I doubt you have one, and I suspect it is very > similar. But I've learnt not to trust "suspect" and "presume" and "no doubt". > > I'll have a look anyway. Indeed - me too. I'll get started on recreating it on the latest upstream kernel now. It does take several hours to reproduce on average, so I likely won't send the update until tomorrow. brassow ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-09-03 1:45 ` Brassow Jonathan @ 2014-09-03 21:39 ` Brassow Jonathan 2014-09-04 5:28 ` NeilBrown 0 siblings, 1 reply; 17+ messages in thread From: Brassow Jonathan @ 2014-09-03 21:39 UTC (permalink / raw) To: NeilBrown; +Cc: Eivind Sarto, linux-raid@vger.kernel.org Raid On Sep 2, 2014, at 8:45 PM, Brassow Jonathan wrote: > > On Sep 2, 2014, at 8:31 PM, NeilBrown wrote: > >> Thanks for the details Jon. I don't suppose you have a stack trace from an >> upstream kernel? Not that I doubt you have one, and I suspect it is very >> similar. But I've learnt not to trust "suspect" and "presume" and "no doubt". >> >> I'll have a look anyway. > > Indeed - me too. I'll get started on recreating it on the latest upstream kernel now. It does take several hours to reproduce on average, so I likely won't send the update until tomorrow. > > brassow The test being run does the following: 1) Writes I/O to an FS which is sitting on a RAID dev (created via LVM, but using MD personalities) 2) The RAID dev is undergoing its initial resync 3) The device in slot 0 suffers a failure. (I can't be sure at the moment if I've seen this trigger from a device failure from a different slot.) Sorry the test is so involved. I haven't been able to create a simple reproducer and the tests I have take quite a while (i.e. a number of iterations) to hit the problem. These backtraces came after 45min, which is actually really quick compared to what I've seen. The kernel I used was 3.17.0-rc3. It is unmodified - I have not applied the patch for reads to potentially resync'ing areas. brassow Sep 3 09:18:36 bp-01 kernel: INFO: task kworker/u130:1:6600 blocked for more than 120 seconds. Sep 3 09:18:36 bp-01 kernel: Tainted: G E 3.17.0-rc3 #1 Sep 3 09:18:36 bp-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 3 09:18:36 bp-01 kernel: kworker/u130:1 D 0000000000000008 0 6600 2 0x00000080 Sep 3 09:18:36 bp-01 kernel: Workqueue: writeback bdi_writeback_workfn (flush-253:11) Sep 3 09:18:36 bp-01 kernel: ffff880415b3b868 0000000000000046 ffff880415b3b858 ffff88021727efc0 Sep 3 09:18:36 bp-01 kernel: ffff880415b38010 0000000000012bc0 0000000000012bc0 ffff8804024e4150 Sep 3 09:18:36 bp-01 kernel: ffff880415b3b838 ffff88041ea52bc0 ffff8804024e4150 ffff880415b3b920 Sep 3 09:18:36 bp-01 kernel: Call Trace: Sep 3 09:18:36 bp-01 kernel: [<ffffffff81580830>] ? yield_to+0x180/0x180 Sep 3 09:18:36 bp-01 kernel: [<ffffffff81580549>] schedule+0x29/0x70 Sep 3 09:18:36 bp-01 kernel: [<ffffffff8158061c>] io_schedule+0x8c/0xd0 Sep 3 09:18:36 bp-01 kernel: [<ffffffff8158085c>] bit_wait_io+0x2c/0x50 Sep 3 09:18:36 bp-01 kernel: [<ffffffff81580946>] __wait_on_bit_lock+0x76/0xb0 Sep 3 09:18:36 bp-01 kernel: [<ffffffff81135057>] __lock_page+0x67/0x70 Sep 3 09:18:36 bp-01 kernel: [<ffffffff8108eb90>] ? wake_atomic_t_function+0x40/0x40 Sep 3 09:18:36 bp-01 kernel: [<ffffffff81141468>] write_cache_pages+0x318/0x510 Sep 3 09:18:36 bp-01 kernel: [<ffffffffa01d8b5a>] ? ext4_writepages+0x5a/0x6a0 [ext4] Sep 3 09:18:36 bp-01 kernel: [<ffffffff8113ffc0>] ? set_page_dirty+0x60/0x60 Sep 3 09:18:36 bp-01 kernel: [<ffffffff811416b1>] generic_writepages+0x51/0x80 Sep 3 09:18:36 bp-01 kernel: [<ffffffff81141715>] do_writepages+0x35/0x40 Sep 3 09:18:36 bp-01 kernel: [<ffffffff811bfcf9>] __writeback_single_inode+0x49/0x230 Sep 3 09:18:36 bp-01 kernel: [<ffffffff8108eeef>] ? wake_up_bit+0x2f/0x40 Sep 3 09:18:36 bp-01 kernel: [<ffffffff811c3139>] writeback_sb_inodes+0x249/0x360 Sep 3 09:18:36 bp-01 kernel: [<ffffffff8119a415>] ? put_super+0x25/0x40 Sep 3 09:18:36 bp-01 kernel: [<ffffffff811c32ee>] __writeback_inodes_wb+0x9e/0xd0 Sep 3 09:18:36 bp-01 kernel: [<ffffffff811c351b>] wb_writeback+0x1fb/0x2c0 Sep 3 09:18:36 bp-01 kernel: [<ffffffff811c3662>] wb_do_writeback+0x82/0x1f0 Sep 3 09:18:36 bp-01 kernel: [<ffffffff81078fda>] ? ttwu_do_activate.clone.0+0x4a/0x60 Sep 3 09:18:36 bp-01 kernel: [<ffffffff811c3840>] bdi_writeback_workfn+0x70/0x210 Sep 3 09:18:36 bp-01 kernel: [<ffffffff8106b762>] process_one_work+0x182/0x450 Sep 3 09:18:36 bp-01 kernel: [<ffffffff8106bb4f>] worker_thread+0x11f/0x3c0 Sep 3 09:18:36 bp-01 kernel: [<ffffffff8106ba30>] ? process_one_work+0x450/0x450 Sep 3 09:18:36 bp-01 kernel: [<ffffffff8107083e>] kthread+0xce/0xf0 Sep 3 09:18:36 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 Sep 3 09:18:36 bp-01 kernel: [<ffffffff8158412c>] ret_from_fork+0x7c/0xb0 Sep 3 09:18:36 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 Sep 3 09:18:36 bp-01 kernel: INFO: task kworker/0:4:8765 blocked for more than 120 seconds. Sep 3 09:18:36 bp-01 kernel: Tainted: G E 3.17.0-rc3 #1 Sep 3 09:18:36 bp-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 3 09:18:36 bp-01 kernel: kworker/0:4 D 0000000000000000 0 8765 2 0x00000080 Sep 3 09:18:36 bp-01 kernel: Workqueue: kcopyd do_work [dm_mod] Sep 3 09:18:36 bp-01 kernel: ffff8802102ab898 0000000000000046 0000000000000000 ffffffff81a19480 Sep 3 09:18:36 bp-01 kernel: ffff8802102a8010 0000000000012bc0 0000000000012bc0 ffff880215950010 Sep 3 09:18:36 bp-01 kernel: ffff8802102ab898 ffff88040171e100 ffff88040171e178 ffff8802102ab8c8 Sep 3 09:18:36 bp-01 kernel: Call Trace: Sep 3 09:18:36 bp-01 kernel: [<ffffffff81580549>] schedule+0x29/0x70 Sep 3 09:18:36 bp-01 kernel: [<ffffffffa0389e99>] wait_barrier+0xb9/0x260 [raid1] Sep 3 09:18:36 bp-01 kernel: [<ffffffff8108eb00>] ? bit_waitqueue+0xe0/0xe0 Sep 3 09:18:36 bp-01 kernel: [<ffffffff8108ee13>] ? __wake_up+0x53/0x70 Sep 3 09:18:36 bp-01 kernel: [<ffffffffa038caaa>] make_request+0x9a/0xc00 [raid1] Sep 3 09:18:37 bp-01 lvm[2109]: device-mapper: waitevent ioctl on (LVM-PSbaO1Zwzw2wYurZrkrTkI5grmcS3IQ3ot9VJPwplnllFf45AU5vBkbtjfLU0adG) failed: Interrupted system call Sep 3 09:18:37 bp-01 kernel: [<ffffffff81137f60>] ? mempool_alloc+0x60/0x170 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81186ea8>] ? cache_alloc_refill+0x98/0x2c0 Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03a4018>] raid_map+0x18/0x20 [dm_raid] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa000336a>] __map_bio+0x4a/0x120 [dm_mod] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0003723>] __clone_and_map_data_bio+0x113/0x130 [dm_mod] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa00037ac>] __split_and_process_non_flush+0x6c/0xb0 [dm_mod] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0003991>] __split_and_process_bio+0x1a1/0x200 [dm_mod] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0003b12>] _dm_request+0x122/0x190 [dm_mod] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0003ba8>] dm_request+0x28/0x40 [dm_mod] Sep 3 09:18:37 bp-01 kernel: [<ffffffff812693b0>] generic_make_request+0xc0/0x100 Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03ba00c>] pending_complete+0x13c/0x2a0 [dm_snapshot] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03ba29e>] commit_callback+0xe/0x10 [dm_snapshot] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03bcea8>] persistent_commit_exception+0xb8/0x160 [dm_snapshot] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03ba1a3>] complete_exception+0x33/0x50 [dm_snapshot] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03ba26a>] copy_callback+0xaa/0xd0 [dm_snapshot] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03ba1c0>] ? complete_exception+0x50/0x50 [dm_snapshot] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa000b788>] run_complete_job+0x98/0xe0 [dm_mod] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa000b21b>] process_jobs+0x5b/0x100 [dm_mod] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa000b6f0>] ? dispatch_job+0x80/0x80 [dm_mod] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa000b2ff>] do_work+0x3f/0x80 [dm_mod] Sep 3 09:18:37 bp-01 kernel: [<ffffffff8106b762>] process_one_work+0x182/0x450 Sep 3 09:18:37 bp-01 kernel: [<ffffffff8106bb4f>] worker_thread+0x11f/0x3c0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff8106ba30>] ? process_one_work+0x450/0x450 Sep 3 09:18:37 bp-01 kernel: [<ffffffff8107083e>] kthread+0xce/0xf0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 Sep 3 09:18:37 bp-01 kernel: [<ffffffff8158412c>] ret_from_fork+0x7c/0xb0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 Sep 3 09:18:37 bp-01 kernel: INFO: task kjournald:12268 blocked for more than 120 seconds. Sep 3 09:18:37 bp-01 kernel: Tainted: G E 3.17.0-rc3 #1 Sep 3 09:18:37 bp-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 3 09:18:37 bp-01 kernel: kjournald D 0000000000000000 0 12268 2 0x00000080 Sep 3 09:18:37 bp-01 kernel: ffff88040d373c08 0000000000000046 ffff88040d373bd8 ffffffff81a19480 Sep 3 09:18:37 bp-01 kernel: ffff88040d370010 0000000000012bc0 0000000000012bc0 ffff8804153a0e80 Sep 3 09:18:37 bp-01 kernel: ffff88040d373c08 ffff88021fa12bc0 ffff8804153a0e80 ffffffff81580830 Sep 3 09:18:37 bp-01 kernel: Call Trace: Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580830>] ? yield_to+0x180/0x180 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580549>] schedule+0x29/0x70 Sep 3 09:18:37 bp-01 kernel: [<ffffffff8158061c>] io_schedule+0x8c/0xd0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff8158085c>] bit_wait_io+0x2c/0x50 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580b45>] __wait_on_bit+0x65/0x90 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580830>] ? yield_to+0x180/0x180 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580be8>] out_of_line_wait_on_bit+0x78/0x90 Sep 3 09:18:37 bp-01 kernel: [<ffffffff8108eb90>] ? wake_atomic_t_function+0x40/0x40 Sep 3 09:18:37 bp-01 kernel: [<ffffffff811ca14e>] __wait_on_buffer+0x2e/0x30 Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0448372>] journal_commit_transaction+0x872/0xf80 [jbd] Sep 3 09:18:37 bp-01 kernel: [<ffffffff8108847f>] ? put_prev_entity+0x2f/0x400 Sep 3 09:18:37 bp-01 kernel: [<ffffffff810b21bb>] ? try_to_del_timer_sync+0x5b/0x70 Sep 3 09:18:37 bp-01 kernel: [<ffffffffa044bae1>] kjournald+0xf1/0x270 [jbd] Sep 3 09:18:37 bp-01 kernel: [<ffffffff8108eb00>] ? bit_waitqueue+0xe0/0xe0 Sep 3 09:18:37 bp-01 kernel: [<ffffffffa044b9f0>] ? commit_timeout+0x10/0x10 [jbd] Sep 3 09:18:37 bp-01 kernel: [<ffffffff8107083e>] kthread+0xce/0xf0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 Sep 3 09:18:37 bp-01 kernel: [<ffffffff8158412c>] ret_from_fork+0x7c/0xb0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 Sep 3 09:18:37 bp-01 kernel: INFO: task xdoio:12364 blocked for more than 120 seconds. Sep 3 09:18:37 bp-01 kernel: Tainted: G E 3.17.0-rc3 #1 Sep 3 09:18:37 bp-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 3 09:18:37 bp-01 kernel: xdoio D 0000000000000005 0 12364 12363 0x00000080 Sep 3 09:18:37 bp-01 kernel: ffff8804037af978 0000000000000082 ffff8804037af948 ffff88021726c1d0 Sep 3 09:18:37 bp-01 kernel: ffff8804037ac010 0000000000012bc0 0000000000012bc0 ffff8804063a0050 Sep 3 09:18:37 bp-01 kernel: ffff8804037af978 ffff88021fab2bc0 ffff8804063a0050 ffff8804037afa30 Sep 3 09:18:37 bp-01 kernel: Call Trace: Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580830>] ? yield_to+0x180/0x180 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580549>] schedule+0x29/0x70 Sep 3 09:18:37 bp-01 kernel: [<ffffffff8158061c>] io_schedule+0x8c/0xd0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff8158085c>] bit_wait_io+0x2c/0x50 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580946>] __wait_on_bit_lock+0x76/0xb0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580830>] ? yield_to+0x180/0x180 Sep 3 09:18:37 bp-01 kernel: [<ffffffff815809f8>] out_of_line_wait_on_bit_lock+0x78/0x90 Sep 3 09:18:37 bp-01 kernel: [<ffffffff8108eb90>] ? wake_atomic_t_function+0x40/0x40 Sep 3 09:18:37 bp-01 kernel: [<ffffffff811ca36e>] __lock_buffer+0x2e/0x30 Sep 3 09:18:37 bp-01 kernel: [<ffffffff811cb520>] __sync_dirty_buffer+0xa0/0xd0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff811cb563>] sync_dirty_buffer+0x13/0x20 Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0446e12>] journal_dirty_data+0x1f2/0x260 [jbd] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0464200>] ext3_journal_dirty_data+0x20/0x50 [ext3] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0464255>] journal_dirty_data_fn+0x25/0x30 [ext3] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa04638d4>] walk_page_buffers+0x84/0xc0 [ext3] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0464230>] ? ext3_journal_dirty_data+0x50/0x50 [ext3] Sep 3 09:18:37 bp-01 kernel: [<ffffffffa046797f>] ext3_ordered_write_end+0xaf/0x1e0 [ext3] Sep 3 09:18:37 bp-01 kernel: [<ffffffff81134b42>] generic_perform_write+0x112/0x1c0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81137aaf>] __generic_file_write_iter+0x18f/0x390 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81137cfe>] generic_file_write_iter+0x4e/0xd0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81197347>] do_iter_readv_writev+0x77/0xb0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff8122e483>] ? security_file_permission+0x23/0x90 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81137cb0>] ? __generic_file_write_iter+0x390/0x390 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81137cb0>] ? __generic_file_write_iter+0x390/0x390 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81198a60>] do_readv_writev+0xd0/0x320 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81137cb0>] ? __generic_file_write_iter+0x390/0x390 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81197400>] ? do_sync_readv_writev+0x80/0x80 Sep 3 09:18:37 bp-01 kernel: [<ffffffff810c724f>] ? do_futex+0xaf/0x1b0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81198cf1>] vfs_writev+0x41/0x50 Sep 3 09:18:37 bp-01 kernel: [<ffffffff81198e26>] SyS_writev+0x56/0xf0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff810ec456>] ? __audit_syscall_exit+0x216/0x2c0 Sep 3 09:18:37 bp-01 kernel: [<ffffffff815841d2>] system_call_fastpath+0x16/0x1b ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: raid1 data corruption during resync 2014-09-03 21:39 ` Brassow Jonathan @ 2014-09-04 5:28 ` NeilBrown 0 siblings, 0 replies; 17+ messages in thread From: NeilBrown @ 2014-09-04 5:28 UTC (permalink / raw) To: Brassow Jonathan; +Cc: Eivind Sarto, linux-raid@vger.kernel.org Raid [-- Attachment #1: Type: text/plain, Size: 14911 bytes --] On Wed, 3 Sep 2014 16:39:52 -0500 Brassow Jonathan <jbrassow@redhat.com> wrote: > > On Sep 2, 2014, at 8:45 PM, Brassow Jonathan wrote: > > > > > On Sep 2, 2014, at 8:31 PM, NeilBrown wrote: > > > >> Thanks for the details Jon. I don't suppose you have a stack trace from an > >> upstream kernel? Not that I doubt you have one, and I suspect it is very > >> similar. But I've learnt not to trust "suspect" and "presume" and "no doubt". > >> > >> I'll have a look anyway. > > > > Indeed - me too. I'll get started on recreating it on the latest upstream kernel now. It does take several hours to reproduce on average, so I likely won't send the update until tomorrow. > > > > brassow > > The test being run does the following: > 1) Writes I/O to an FS which is sitting on a RAID dev (created via LVM, but using MD personalities) > 2) The RAID dev is undergoing its initial resync > 3) The device in slot 0 suffers a failure. (I can't be sure at the moment if I've seen this trigger from a device failure from a different slot.) > Sorry the test is so involved. I haven't been able to create a simple reproducer and the tests I have take quite a while (i.e. a number of iterations) to hit the problem. These backtraces came after 45min, which is actually really quick compared to what I've seen. > > The kernel I used was 3.17.0-rc3. It is unmodified - I have not applied the patch for reads to potentially resync'ing areas. > > brassow > > Sep 3 09:18:36 bp-01 kernel: INFO: task kworker/u130:1:6600 blocked for more than 120 seconds. > Sep 3 09:18:36 bp-01 kernel: Tainted: G E 3.17.0-rc3 #1 > Sep 3 09:18:36 bp-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Sep 3 09:18:36 bp-01 kernel: kworker/u130:1 D 0000000000000008 0 6600 2 0x00000080 > Sep 3 09:18:36 bp-01 kernel: Workqueue: writeback bdi_writeback_workfn (flush-253:11) > Sep 3 09:18:36 bp-01 kernel: ffff880415b3b868 0000000000000046 ffff880415b3b858 ffff88021727efc0 > Sep 3 09:18:36 bp-01 kernel: ffff880415b38010 0000000000012bc0 0000000000012bc0 ffff8804024e4150 > Sep 3 09:18:36 bp-01 kernel: ffff880415b3b838 ffff88041ea52bc0 ffff8804024e4150 ffff880415b3b920 > Sep 3 09:18:36 bp-01 kernel: Call Trace: > Sep 3 09:18:36 bp-01 kernel: [<ffffffff81580830>] ? yield_to+0x180/0x180 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff81580549>] schedule+0x29/0x70 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff8158061c>] io_schedule+0x8c/0xd0 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff8158085c>] bit_wait_io+0x2c/0x50 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff81580946>] __wait_on_bit_lock+0x76/0xb0 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff81135057>] __lock_page+0x67/0x70 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff8108eb90>] ? wake_atomic_t_function+0x40/0x40 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff81141468>] write_cache_pages+0x318/0x510 > Sep 3 09:18:36 bp-01 kernel: [<ffffffffa01d8b5a>] ? ext4_writepages+0x5a/0x6a0 [ext4] > Sep 3 09:18:36 bp-01 kernel: [<ffffffff8113ffc0>] ? set_page_dirty+0x60/0x60 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff811416b1>] generic_writepages+0x51/0x80 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff81141715>] do_writepages+0x35/0x40 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff811bfcf9>] __writeback_single_inode+0x49/0x230 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff8108eeef>] ? wake_up_bit+0x2f/0x40 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff811c3139>] writeback_sb_inodes+0x249/0x360 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff8119a415>] ? put_super+0x25/0x40 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff811c32ee>] __writeback_inodes_wb+0x9e/0xd0 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff811c351b>] wb_writeback+0x1fb/0x2c0 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff811c3662>] wb_do_writeback+0x82/0x1f0 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff81078fda>] ? ttwu_do_activate.clone.0+0x4a/0x60 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff811c3840>] bdi_writeback_workfn+0x70/0x210 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff8106b762>] process_one_work+0x182/0x450 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff8106bb4f>] worker_thread+0x11f/0x3c0 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff8106ba30>] ? process_one_work+0x450/0x450 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff8107083e>] kthread+0xce/0xf0 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff8158412c>] ret_from_fork+0x7c/0xb0 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 > Sep 3 09:18:36 bp-01 kernel: INFO: task kworker/0:4:8765 blocked for more than 120 seconds. > Sep 3 09:18:36 bp-01 kernel: Tainted: G E 3.17.0-rc3 #1 > Sep 3 09:18:36 bp-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Sep 3 09:18:36 bp-01 kernel: kworker/0:4 D 0000000000000000 0 8765 2 0x00000080 > Sep 3 09:18:36 bp-01 kernel: Workqueue: kcopyd do_work [dm_mod] > Sep 3 09:18:36 bp-01 kernel: ffff8802102ab898 0000000000000046 0000000000000000 ffffffff81a19480 > Sep 3 09:18:36 bp-01 kernel: ffff8802102a8010 0000000000012bc0 0000000000012bc0 ffff880215950010 > Sep 3 09:18:36 bp-01 kernel: ffff8802102ab898 ffff88040171e100 ffff88040171e178 ffff8802102ab8c8 > Sep 3 09:18:36 bp-01 kernel: Call Trace: > Sep 3 09:18:36 bp-01 kernel: [<ffffffff81580549>] schedule+0x29/0x70 > Sep 3 09:18:36 bp-01 kernel: [<ffffffffa0389e99>] wait_barrier+0xb9/0x260 [raid1] > Sep 3 09:18:36 bp-01 kernel: [<ffffffff8108eb00>] ? bit_waitqueue+0xe0/0xe0 > Sep 3 09:18:36 bp-01 kernel: [<ffffffff8108ee13>] ? __wake_up+0x53/0x70 > Sep 3 09:18:36 bp-01 kernel: [<ffffffffa038caaa>] make_request+0x9a/0xc00 [raid1] > Sep 3 09:18:37 bp-01 lvm[2109]: device-mapper: waitevent ioctl on (LVM-PSbaO1Zwzw2wYurZrkrTkI5grmcS3IQ3ot9VJPwplnllFf45AU5vBkbtjfLU0adG) failed: Interrupted system call > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81137f60>] ? mempool_alloc+0x60/0x170 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81186ea8>] ? cache_alloc_refill+0x98/0x2c0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03a4018>] raid_map+0x18/0x20 [dm_raid] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa000336a>] __map_bio+0x4a/0x120 [dm_mod] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0003723>] __clone_and_map_data_bio+0x113/0x130 [dm_mod] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa00037ac>] __split_and_process_non_flush+0x6c/0xb0 [dm_mod] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0003991>] __split_and_process_bio+0x1a1/0x200 [dm_mod] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0003b12>] _dm_request+0x122/0x190 [dm_mod] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0003ba8>] dm_request+0x28/0x40 [dm_mod] > Sep 3 09:18:37 bp-01 kernel: [<ffffffff812693b0>] generic_make_request+0xc0/0x100 > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03ba00c>] pending_complete+0x13c/0x2a0 [dm_snapshot] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03ba29e>] commit_callback+0xe/0x10 [dm_snapshot] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03bcea8>] persistent_commit_exception+0xb8/0x160 [dm_snapshot] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03ba1a3>] complete_exception+0x33/0x50 [dm_snapshot] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03ba26a>] copy_callback+0xaa/0xd0 [dm_snapshot] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa03ba1c0>] ? complete_exception+0x50/0x50 [dm_snapshot] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa000b788>] run_complete_job+0x98/0xe0 [dm_mod] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa000b21b>] process_jobs+0x5b/0x100 [dm_mod] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa000b6f0>] ? dispatch_job+0x80/0x80 [dm_mod] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa000b2ff>] do_work+0x3f/0x80 [dm_mod] > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8106b762>] process_one_work+0x182/0x450 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8106bb4f>] worker_thread+0x11f/0x3c0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8106ba30>] ? process_one_work+0x450/0x450 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8107083e>] kthread+0xce/0xf0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8158412c>] ret_from_fork+0x7c/0xb0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 > Sep 3 09:18:37 bp-01 kernel: INFO: task kjournald:12268 blocked for more than 120 seconds. > Sep 3 09:18:37 bp-01 kernel: Tainted: G E 3.17.0-rc3 #1 > Sep 3 09:18:37 bp-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Sep 3 09:18:37 bp-01 kernel: kjournald D 0000000000000000 0 12268 2 0x00000080 > Sep 3 09:18:37 bp-01 kernel: ffff88040d373c08 0000000000000046 ffff88040d373bd8 ffffffff81a19480 > Sep 3 09:18:37 bp-01 kernel: ffff88040d370010 0000000000012bc0 0000000000012bc0 ffff8804153a0e80 > Sep 3 09:18:37 bp-01 kernel: ffff88040d373c08 ffff88021fa12bc0 ffff8804153a0e80 ffffffff81580830 > Sep 3 09:18:37 bp-01 kernel: Call Trace: > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580830>] ? yield_to+0x180/0x180 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580549>] schedule+0x29/0x70 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8158061c>] io_schedule+0x8c/0xd0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8158085c>] bit_wait_io+0x2c/0x50 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580b45>] __wait_on_bit+0x65/0x90 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580830>] ? yield_to+0x180/0x180 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580be8>] out_of_line_wait_on_bit+0x78/0x90 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8108eb90>] ? wake_atomic_t_function+0x40/0x40 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff811ca14e>] __wait_on_buffer+0x2e/0x30 > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0448372>] journal_commit_transaction+0x872/0xf80 [jbd] > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8108847f>] ? put_prev_entity+0x2f/0x400 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff810b21bb>] ? try_to_del_timer_sync+0x5b/0x70 > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa044bae1>] kjournald+0xf1/0x270 [jbd] > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8108eb00>] ? bit_waitqueue+0xe0/0xe0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa044b9f0>] ? commit_timeout+0x10/0x10 [jbd] > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8107083e>] kthread+0xce/0xf0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8158412c>] ret_from_fork+0x7c/0xb0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81070770>] ? kthread_freezable_should_stop+0x70/0x70 > Sep 3 09:18:37 bp-01 kernel: INFO: task xdoio:12364 blocked for more than 120 seconds. > Sep 3 09:18:37 bp-01 kernel: Tainted: G E 3.17.0-rc3 #1 > Sep 3 09:18:37 bp-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Sep 3 09:18:37 bp-01 kernel: xdoio D 0000000000000005 0 12364 12363 0x00000080 > Sep 3 09:18:37 bp-01 kernel: ffff8804037af978 0000000000000082 ffff8804037af948 ffff88021726c1d0 > Sep 3 09:18:37 bp-01 kernel: ffff8804037ac010 0000000000012bc0 0000000000012bc0 ffff8804063a0050 > Sep 3 09:18:37 bp-01 kernel: ffff8804037af978 ffff88021fab2bc0 ffff8804063a0050 ffff8804037afa30 > Sep 3 09:18:37 bp-01 kernel: Call Trace: > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580830>] ? yield_to+0x180/0x180 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580549>] schedule+0x29/0x70 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8158061c>] io_schedule+0x8c/0xd0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8158085c>] bit_wait_io+0x2c/0x50 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580946>] __wait_on_bit_lock+0x76/0xb0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81580830>] ? yield_to+0x180/0x180 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff815809f8>] out_of_line_wait_on_bit_lock+0x78/0x90 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8108eb90>] ? wake_atomic_t_function+0x40/0x40 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff811ca36e>] __lock_buffer+0x2e/0x30 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff811cb520>] __sync_dirty_buffer+0xa0/0xd0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff811cb563>] sync_dirty_buffer+0x13/0x20 > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0446e12>] journal_dirty_data+0x1f2/0x260 [jbd] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0464200>] ext3_journal_dirty_data+0x20/0x50 [ext3] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0464255>] journal_dirty_data_fn+0x25/0x30 [ext3] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa04638d4>] walk_page_buffers+0x84/0xc0 [ext3] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa0464230>] ? ext3_journal_dirty_data+0x50/0x50 [ext3] > Sep 3 09:18:37 bp-01 kernel: [<ffffffffa046797f>] ext3_ordered_write_end+0xaf/0x1e0 [ext3] > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81134b42>] generic_perform_write+0x112/0x1c0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81137aaf>] __generic_file_write_iter+0x18f/0x390 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81137cfe>] generic_file_write_iter+0x4e/0xd0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81197347>] do_iter_readv_writev+0x77/0xb0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff8122e483>] ? security_file_permission+0x23/0x90 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81137cb0>] ? __generic_file_write_iter+0x390/0x390 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81137cb0>] ? __generic_file_write_iter+0x390/0x390 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81198a60>] do_readv_writev+0xd0/0x320 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81137cb0>] ? __generic_file_write_iter+0x390/0x390 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81197400>] ? do_sync_readv_writev+0x80/0x80 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff810c724f>] ? do_futex+0xaf/0x1b0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81198cf1>] vfs_writev+0x41/0x50 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff81198e26>] SyS_writev+0x56/0xf0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff810ec456>] ? __audit_syscall_exit+0x216/0x2c0 > Sep 3 09:18:37 bp-01 kernel: [<ffffffff815841d2>] system_call_fastpath+0x16/0x1b Thanks a lot. I love having concrete data.... I had a theory. Then another. None of them have really stood up yet. But there are definitely problems with the code. 1/ When the resync aborts close_sync() may set ->start_next_window to MaxSector without merging next_window_requests into current_window_requests. So those numbers can be wrong. 2/ raise_barrier() cannot detect if there are any pending requests in the region it is about to resync. That can lead to corruption!! Neither of these explain the hang you are seeing. I note that the "md0-resync" thread isn't listed. I don't suppose you know what it was doing (stack trace)?? Also, had "md: md0: sync done" appeared in syslog yet? Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2014-09-09 1:08 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <D4FE2D75-4208-48C9-A4D0-432F092E5AE9@redhat.com> 2014-09-08 15:52 ` Fwd: raid1 data corruption during resync Brassow Jonathan 2014-09-09 1:08 ` NeilBrown 2014-08-29 19:29 Eivind Sarto 2014-09-02 14:10 ` Brassow Jonathan 2014-09-02 16:43 ` Eivind Sarto 2014-09-02 17:04 ` Eivind Sarto 2014-09-02 16:59 ` Brassow Jonathan 2014-09-02 19:24 ` Brassow Jonathan 2014-09-02 22:07 ` Eivind Sarto 2014-09-02 22:14 ` Brassow Jonathan 2014-09-02 23:55 ` NeilBrown 2014-09-03 0:48 ` Eivind Sarto 2014-09-03 1:18 ` Brassow Jonathan 2014-09-03 1:31 ` NeilBrown 2014-09-03 1:45 ` Brassow Jonathan 2014-09-03 21:39 ` Brassow Jonathan 2014-09-04 5:28 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).