Linux RAID subsystem development
 help / color / mirror / Atom feed
* Re: split scsi passthrough fields out of struct request V2
From: Jens Axboe @ 2017-01-27 16:56 UTC (permalink / raw)
  To: Bart Van Assche, hch@lst.de
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	snitzer@redhat.com, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <1485535925.4267.1.camel@sandisk.com>

On 01/27/2017 09:52 AM, Bart Van Assche wrote:
> On Fri, 2017-01-27 at 01:04 -0700, Jens Axboe wrote:
>> The previous patch had a bug if you didn't use a scheduler, here's a
>> version that should work fine in both cases. I've also updated the
>> above mentioned branch, so feel free to pull that as well and merge to
>> master like before.
> 
> Booting time is back to normal with commit f3a8ab7d55bc merged with
> v4.10-rc5. That's a great improvement. However, running the srp-test
> software triggers now a new complaint:
> 
> [  215.600386] sd 11:0:0:0: [sdh] Attached SCSI disk
> [  215.609485] sd 11:0:0:0: alua: port group 00 state A non-preferred supports TOlUSNA
> [  215.722900] scsi 13:0:0:0: alua: Detached
> [  215.724452] general protection fault: 0000 [#1] SMP
> [  215.724484] Modules linked in: dm_service_time ib_srp scsi_transport_srp target_core_user uio target_core_pscsi target_core_file ib_srpt target_core_iblock target_core_mod brd netconsole xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm msr configfs ib_cm iw_cm mlx4_ib ib_core sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp kvm_intel hid_generic kvm usbhid irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel mlx4_core ghash_clmulni_intel iTCO_wdt d
 cdbas pcbc tg3
> [  215.724629]  iTCO_vendor_support ptp aesni_intel pps_core aes_x86_64 pcspkr crypto_simd libphy ipmi_si glue_helper cryptd ipmi_devintf tpm_tis devlink fjes ipmi_msghandler tpm_tis_core tpm mei_me lpc_ich mei mfd_core button shpchp wmi mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm sr_mod cdrom ehci_pci ehci_hcd usbcore usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua autofs4
> [  215.724719] CPU: 9 PID: 8043 Comm: multipathd Not tainted 4.10.0-rc5-dbg+ #1
> [  215.724748] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.0.2 11/17/2014
> [  215.724775] task: ffff8801717998c0 task.stack: ffffc90002a9c000
> [  215.724804] RIP: 0010:scsi_device_put+0xb/0x30
> [  215.724829] RSP: 0018:ffffc90002a9faa0 EFLAGS: 00010246
> [  215.724855] RAX: 6b6b6b6b6b6b6b6b RBX: ffff88038bf85698 RCX: 0000000000000006
> [  215.724880] RDX: 0000000000000006 RSI: ffff88017179a108 RDI: ffff88038bf85698
> [  215.724906] RBP: ffffc90002a9faa8 R08: ffff880384786008 R09: 0000000100170007
> [  215.724932] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88038bf85698
> [  215.724958] R13: ffff88038919f090 R14: dead000000000100 R15: ffff88038a41dd28
> [  215.724983] FS:  00007fbf8c6cf700(0000) GS:ffff88046f440000(0000) knlGS:0000000000000000
> [  215.725010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  215.725035] CR2: 00007f1262ef3ee0 CR3: 000000044f6cc000 CR4: 00000000001406e0
> [  215.725060] Call Trace:
> [  215.725086]  scsi_disk_put+0x2d/0x40
> [  215.725110]  sd_release+0x3d/0xb0
> [  215.725137]  __blkdev_put+0x29e/0x360
> [  215.725163]  blkdev_put+0x49/0x170
> [  215.725192]  dm_put_table_device+0x58/0xc0 [dm_mod]
> [  215.725219]  dm_put_device+0x70/0xc0 [dm_mod]
> [  215.725269]  free_priority_group+0x92/0xc0 [dm_multipath]
> [  215.725295]  free_multipath+0x70/0xc0 [dm_multipath]
> [  215.725320]  multipath_dtr+0x19/0x20 [dm_multipath]
> [  215.725348]  dm_table_destroy+0x67/0x120 [dm_mod]
> [  215.725379]  dev_suspend+0xde/0x240 [dm_mod]
> [  215.725434]  ctl_ioctl+0x1f5/0x520 [dm_mod]
> [  215.725489]  dm_ctl_ioctl+0xe/0x20 [dm_mod]
> [  215.725515]  do_vfs_ioctl+0x8f/0x700
> [  215.725589]  SyS_ioctl+0x3c/0x70
> [  215.725614]  entry_SYSCALL_64_fastpath+0x18/0xad
> [  215.725641] RIP: 0033:0x7fbf8aca0667
> [  215.725665] RSP: 002b:00007fbf8c6cd668 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [  215.725692] RAX: ffffffffffffffda RBX: 0000000000000046 RCX: 00007fbf8aca0667
> [  215.725716] RDX: 00007fbf8006b940 RSI: 00000000c138fd06 RDI: 0000000000000007
> [  215.725743] RBP: 0000000000000009 R08: 00007fbf8c6cb3c0 R09: 00007fbf8b68d8d8
> [  215.725768] R10: 0000000000000075 R11: 0000000000000246 R12: 00007fbf8c6cd770
> [  215.725793] R13: 0000000000000013 R14: 00000000006168f0 R15: 0000000000f74780
> [  215.725820] Code: bc 24 b8 00 00 00 e8 55 c8 1c 00 48 83 c4 08 48 89 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 00 55 48 89 e5 53 48 8b 07 48 89 fb <48> 8b 80 a8 01 00 00 48 8b 38 e8 f6 68 c5 ff 48 8d bb 38 02 00 
> [  215.725903] RIP: scsi_device_put+0xb/0x30 RSP: ffffc90002a9faa0
> 
> (gdb) list *(scsi_device_put+0xb)
> 0xffffffff8149fc2b is in scsi_device_put (drivers/scsi/scsi.c:957).
> 952      * count of the underlying LLDD module.  The device is freed once the last
> 953      * user vanishes.
> 954      */
> 955     void scsi_device_put(struct scsi_device *sdev)
> 956     {
> 957             module_put(sdev->host->hostt->module);
> 958             put_device(&sdev->sdev_gendev);
> 959     }
> 960     EXPORT_SYMBOL(scsi_device_put);
> 961
> (gdb) disas scsi_device_put
> Dump of assembler code for function scsi_device_put:
>    0xffffffff8149fc20 <+0>:     push   %rbp
>    0xffffffff8149fc21 <+1>:     mov    %rsp,%rbp
>    0xffffffff8149fc24 <+4>:     push   %rbx
>    0xffffffff8149fc25 <+5>:     mov    (%rdi),%rax
>    0xffffffff8149fc28 <+8>:     mov    %rdi,%rbx
>    0xffffffff8149fc2b <+11>:    mov    0x1a8(%rax),%rax
>    0xffffffff8149fc32 <+18>:    mov    (%rax),%rdi
>    0xffffffff8149fc35 <+21>:    callq  0xffffffff810f6530 <module_put>
>    0xffffffff8149fc3a <+26>:    lea    0x238(%rbx),%rdi
>    0xffffffff8149fc41 <+33>:    callq  0xffffffff814714b0 <put_device>
>    0xffffffff8149fc46 <+38>:    pop    %rbx
>    0xffffffff8149fc47 <+39>:    pop    %rbp
>    0xffffffff8149fc48 <+40>:    retq    
> End of assembler dump.
> (gdb) print &((struct Scsi_Host *)0)->hostt  
> $2 = (struct scsi_host_template **) 0x1a8 <irq_stack_union+424>
> 
> Apparently scsi_device_put() was called for a SCSI device that was already
> freed (memory poisoning was enabled in my test). This is something I had
> not yet seen before.

I have no idea what this is, I haven't messed with life time or devices
or queues at all in that branch.

-- 
Jens Axboe

^ permalink raw reply

* Re: split scsi passthrough fields out of struct request V2
From: Jens Axboe @ 2017-01-27 16:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-raid, Mike Snitzer, linux-scsi, linux-block, dm-devel,
	Junichi Nomura
In-Reply-To: <20170127164233.GA17438@lst.de>

On 01/27/2017 09:42 AM, Christoph Hellwig wrote:
> On Fri, Jan 27, 2017 at 09:38:40AM -0700, Jens Axboe wrote:
>>> Ok, I'll repost what I have right now, which is on top of a merge
>>> of your block/for-4.11/next and your for-next from this morning
>>> my time.
>>
>> Perfect.
> 
> At least I tried, looks like the mail server is overloaded and crapped
> out three mails into it.  For now there is a git tree here:
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-pc-refactor

I grabbed it all from there. for-4.11/rq-refactor has been rebased to v3.
Basic testing looks fine here, at least on v2. I'll repeat the same and
then merge it into for-next as well.

-- 
Jens Axboe

^ permalink raw reply

* Re: [dm-devel] split scsi passthrough fields out of struct request V2
From: Bart Van Assche @ 2017-01-27 17:02 UTC (permalink / raw)
  To: hch@lst.de, axboe@fb.com
  Cc: linux-scsi@vger.kernel.org, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, linux-block@vger.kernel.org,
	snitzer@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <ecca734c-7a9c-f060-9335-c2de089849c2@fb.com>

On Thu, 2017-01-26 at 18:22 -0700, Jens Axboe wrote:
> What's your boot device? I've been booting this on a variety of setups,
> no problems observed. It's booting my laptop, and on SCSI and SATA as
> well. What is your root drive? What is the queue depth of it?
> Controller?

The boot device in my test setup is a SATA hard disk:

# cat /proc/cmdline  
BOOT_IMAGE=/boot/vmlinuz-4.10.0-rc5-dbg+ root=UUID=60a4b064-b3ef-4d28-96d3-3c13ecbec43e resume=/dev/sda2 showopts
# ls -l /dev/disk/by-uuid/60a4b064-b3ef-4d28-96d3-3c13ecbec43e
lrwxrwxrwx 1 root root 10 Jan 27 08:43 /dev/disk/by-uuid/60a4b064-b3ef-4d28-96d3-3c13ecbec43e -> ../../sda1
# cat /sys/block/sda/queue/nr_requests  
31
# lsscsi | grep sda
[0:0:0:0]    disk    ATA      ST1000NM0033-9ZM GA67  /dev/sda
# hdparm -i /dev/sda

/dev/sda:
 Model=ST1000NM0033-9ZM173, FwRev=GA67, SerialNo=Z1W2HM75
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
 BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=off
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4  
 DMA modes:  mdma0 mdma1 mdma2  
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6  
 AdvancedPM=no WriteCache=disabled
 Drive conforms to: Unspecified:  ATA/ATAPI-4,5,6,7

 * signifies the current active mode

Bart.

^ permalink raw reply

* Re: split scsi passthrough fields out of struct request V2
From: Bart Van Assche @ 2017-01-27 17:03 UTC (permalink / raw)
  To: hch@lst.de, axboe@fb.com
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	snitzer@redhat.com, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <2c696943-2a44-4f36-f0f8-0bebceb95a4a@fb.com>

On Fri, 2017-01-27 at 09:56 -0700, Jens Axboe wrote:
> I have no idea what this is, I haven't messed with life time or devices
> or queues at all in that branch.

The srp-test software passes with kernel v4.9. Something must have changed.
I'll see whether I can find some time to look further into this.

Bart.

^ permalink raw reply

* Re: [PATCH 05/18] block: allow specifying size for extra command data
From: Bart Van Assche @ 2017-01-27 17:21 UTC (permalink / raw)
  To: hch@lst.de, martin.petersen@oracle.com
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	snitzer@redhat.com, axboe@fb.com, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <20170127161254.GA16557@lst.de>

On Fri, 2017-01-27 at 17:12 +0100, Christoph Hellwig wrote:
> On Wed, Jan 25, 2017 at 10:15:55PM -0500, Martin K. Petersen wrote:
> > +static void *alloc_request_size(gfp_t gfp_mask, void *data)
> > 
> > I like alloc_request_simple() but alloc_request_size() seems a bit
> > contrived. _reserve? _extra? _special? Don't have any good suggestions,
> > I'm afraid.
> 
> Not that I'm a fan of _size, but I like the other suggestions even less.

Hello Christoph and Martin,

How about using the function names alloc_full_request() / free_full_request()
together with a comment that mentions that cmd_size is set by the LLD?

Bart.

^ permalink raw reply

* Re: [PATCH 05/18] block: allow specifying size for extra command data
From: Jens Axboe @ 2017-01-27 17:26 UTC (permalink / raw)
  To: Bart Van Assche, hch@lst.de, martin.petersen@oracle.com
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	snitzer@redhat.com, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <1485537668.4267.7.camel@sandisk.com>

On 01/27/2017 10:21 AM, Bart Van Assche wrote:
> On Fri, 2017-01-27 at 17:12 +0100, Christoph Hellwig wrote:
>> On Wed, Jan 25, 2017 at 10:15:55PM -0500, Martin K. Petersen wrote:
>>> +static void *alloc_request_size(gfp_t gfp_mask, void *data)
>>>
>>> I like alloc_request_simple() but alloc_request_size() seems a bit
>>> contrived. _reserve? _extra? _special? Don't have any good suggestions,
>>> I'm afraid.
>>
>> Not that I'm a fan of _size, but I like the other suggestions even less.
> 
> Hello Christoph and Martin,
> 
> How about using the function names alloc_full_request() / free_full_request()
> together with a comment that mentions that cmd_size is set by the LLD?

Since we use pdu in other places, how about alloc_request_pdu() or
alloc_request_with_pdu()?

And since it's all queued up, any bike shedding changes will have to be
incremental.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH v2] md linear: fix a race between linear_add() and linear_congested()
From: colyli @ 2017-01-27 17:30 UTC (permalink / raw)
  To: linux-raid; +Cc: Coly Li, Shaohua Li, Neil Brown, stable

Recently I receie a report that on Linux v3.0 based kerenl, hot add disk
to a md linear device causes kernel crash at linear_congested(). From the
crash image analysis, I find in linear_congested(), mddev->raid_disks
contains value N, but conf->disks[] only has N-1 pointers available. Then
a pointer deference to a NULL pointer crashes the kernel.

There is a race between linear_add() and linear_congested(), RCU stuffs
used in these two functions cannot avoid the race. Since Linuv v4.0
RCU code is replaced by introducing mddev_suspend().  After checking the
upstream code, it seems linear_congested() is not called in
generic_make_request() code patch, so mddev_suspend() cannot provent it
from being called. The possible race still exists.

Here I explain how the race still exists in current code.  For a machine
has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
md linear device; at the same time on other CPU, linear_congested() is
called to detect whether this md linear device is congested before issuing
an I/O request onto it.

Now I use a possible code execution time sequence to demo how the possible
race happens,

seq    linear_add()                linear_congested()
 0                                 conf=mddev->private
 1   oldconf=mddev->private
 2   mddev->raid_disks++
 3                              for (i=0; i<mddev->raid_disks;i++)
 4                                bdev_get_queue(conf->disks[i].rdev->bdev)
 5   mddev->private=newconf

In linear_add() mddev->raid_disks is increased in time seq 2, and on
another CPU in linear_congested() the for-loop iterates conf->disks[i] by
the increased mddev->raid_disks in time seq 3,4. But conf with one more
element (which is a pointer to struct dev_info type) to conf->disks[] is
not updated yet, accessing its structure member in time seq 4 will cause a
NULL pointer deference fault.

The fix includes 2 parts of modification,
 1) In linear_add(), update mddev->private with new value before
    increasing mddev->raid_disks, and to make sure on other CPUs their are
    seen to be updated in same order as linear_add() does (otherwise the
    race may still happen), a smp_mb() is necessary.
 2) RCU stuffs are back, to make sure in linear_add() the oldconf won't be
    destoried when it is still referenced in linear_congested().

A question is, by this fix, if mddev->private is update to new value in
linear_add(), but in linear_congested() the for-loop still tests old value
of mddev->raid_disks, then the iteration will miss the last element of
conf->disks[]. My answer is don't worry it, it's OK. the reasons are,
 - When updating mddev->private, the md linear device is suspend, no I/O
   may happen, it is safe to missing congestion status of the last
   new-added hard disk.
 - In the worst case linear_congested() returns 0 and I/O sent to this md
   linear device, but the new added hard disk is congested, then the I/O
   request will be blocked for a while if it just happenly hits the new
   added hard disk. linear_congested() is in code path of wb_congested(),
   which is quite hot in write back code path. Comparing to add locking
   code in linear_congested(), the cost of the worst case is acceptable.

The bug is reported on Linux v3.0 based kernel, it can and should be
applied to all kernels since Linux v3.0. I see linear_add() is merged into
mainline since Linux v2.6.18, maybe stable kernel maintainers after this
version may consider to pick this fix as well.

Changelog:
 - v2: add RCU stuffs by suggestion from Shaohua and Neil.
 - v1: initial effort.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.com>
Cc: stable@vger.kernel.org
---
 drivers/md/linear.c | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index 5975c99..4f1690c 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -58,13 +58,15 @@ static int linear_congested(struct mddev *mddev, int bits)
 	struct linear_conf *conf;
 	int i, ret = 0;
 
-	conf = mddev->private;
+	rcu_read_lock();
+	conf = rcu_dereference(mddev->private);
 
 	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
 		struct request_queue *q = bdev_get_queue(conf->disks[i].rdev->bdev);
 		ret |= bdi_congested(&q->backing_dev_info, bits);
 	}
 
+	rcu_read_unlock();
 	return ret;
 }
 
@@ -173,6 +175,13 @@ static int linear_run (struct mddev *mddev)
 	return ret;
 }
 
+static void free_conf(struct rcu_head *head)
+{
+	struct linear_conf *conf =
+			container_of(head, struct linear_conf, rcu);
+	kfree(conf);
+}
+
 static int linear_add(struct mddev *mddev, struct md_rdev *rdev)
 {
 	/* Adding a drive to a linear array allows the array to grow.
@@ -196,15 +205,27 @@ static int linear_add(struct mddev *mddev, struct md_rdev *rdev)
 	if (!newconf)
 		return -ENOMEM;
 
+	/* In linear_congested(), mddev->raid_disks and mddev->private
+	 * are accessed without protection by mddev_suspend(). If on
+	 * another CPU,  in linear_congested() mddev->private is still seen
+	 * to contains old value but mddev->raid_disks is seen to have the
+	 * increased value, the last iteration to conf->disks[i].rdev will
+	 * trigger a NULL pointer deference. To avoid this race, here
+	 * mddev->private must be updated before increasing
+	 * mddev->raid_disks, and a smp_mb() is required between them. Then
+	 * in linear_congested(), we are sure the updated mddev->private is
+	 * seen when iterating conf->disks[i].
+	 */
 	mddev_suspend(mddev);
-	oldconf = mddev->private;
+	oldconf = rcu_dereference(mddev->private);
+	rcu_assign_pointer(mddev->private, newconf);
+	smp_mb();
 	mddev->raid_disks++;
-	mddev->private = newconf;
 	md_set_array_sectors(mddev, linear_size(mddev, 0, 0));
 	set_capacity(mddev->gendisk, mddev->array_sectors);
 	mddev_resume(mddev);
 	revalidate_disk(mddev->gendisk);
-	kfree(oldconf);
+	call_rcu(&oldconf->rcu, free_conf);
 	return 0;
 }
 

^ permalink raw reply related

* Re: [PATCH 05/18] block: allow specifying size for extra command data
From: Bart Van Assche @ 2017-01-27 17:30 UTC (permalink / raw)
  To: hch@lst.de, axboe@fb.com, martin.petersen@oracle.com
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	snitzer@redhat.com, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <05d94df5-0ed8-5c3a-d6a6-7e9418366bb1@fb.com>

On Fri, 2017-01-27 at 10:26 -0700, Jens Axboe wrote:
> On 01/27/2017 10:21 AM, Bart Van Assche wrote:
> > On Fri, 2017-01-27 at 17:12 +0100, Christoph Hellwig wrote:
> > > On Wed, Jan 25, 2017 at 10:15:55PM -0500, Martin K. Petersen wrote:
> > > > +static void *alloc_request_size(gfp_t gfp_mask, void *data)
> > > > 
> > > > I like alloc_request_simple() but alloc_request_size() seems a bit
> > > > contrived. _reserve? _extra? _special? Don't have any good suggestions,
> > > > I'm afraid.
> > > 
> > > Not that I'm a fan of _size, but I like the other suggestions even less.
> > 
> > Hello Christoph and Martin,
> > 
> > How about using the function names alloc_full_request() / free_full_request()
> > together with a comment that mentions that cmd_size is set by the LLD?
> 
> Since we use pdu in other places, how about alloc_request_pdu() or
> alloc_request_with_pdu()?
> 
> And since it's all queued up, any bike shedding changes will have to be
> incremental.

Hello Jens,

Other Linux subsystems use the term "private data" instead of PDU. How about
modifying the block layer such that it uses the same terminology? I'm
referring to function names like blk_mq_rq_from_pdu() and blk_mq_rq_to_pdu()

Thanks,

Bart.

^ permalink raw reply

* Re: [PATCH] md linear: fix a race between linear_add() and linear_congested()
From: Coly Li @ 2017-01-27 17:32 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Neil Brown, stable
In-Reply-To: <20170125180258.fil35zckwtebysqr@kernel.org>

On 2017/1/26 上午2:02, Shaohua Li wrote:
> On Wed, Jan 25, 2017 at 07:15:43PM +0800, colyli@suse.de wrote:
>> Recently I receie a report that on Linux v3.0 based kerenl, hot add disk
>> to a md linear device causes kernel crash at linear_congested(). From the
>> crash image analysis, I find in linear_congested(), mddev->raid_disks
>> contains value N, but conf->disks[] only has N-1 pointers available. Then
>> a pointer deference to a NULL pointer crashes the kernel.
>>
>> There is a race between linear_add() and linear_congested(), RCU stuffs
>> used in these two functions cannot avoid the race. Since Linuv v4.0
>> RCU code is replaced by introducing mddev_suspend().  After checking the
>> upstream code, it seems linear_congested() is not called in
>> generic_make_request() code patch, so mddev_suspend() cannot provent it
>> from being called. The possible race still exists.
>>
>> Here I explain how the race still exists in current code.  For a machine
>> has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
>> md linear device; at the same time on other CPU, linear_congested() is
>> called to detect whether this md linear device is congested before issuing
>> an I/O request onto it.
>>
>> Now I use a possible code execution time sequence to demo how the possible
>> race happens, 
>>
>> seq    linear_add()                linear_congested()
>>  0                                 conf=mddev->private
>>  1   oldconf=mddev->private
>>  2   mddev->raid_disks++
>>  3                              for (i=0; i<mddev->raid_disks;i++)
>>  4                                bdev_get_queue(conf->disks[i].rdev->bdev)
>>  5   mddev->private=newconf
> 
> Good catch, this makes a lot of sense. However, this looks like an incomplete
> fix. step 0 will get the old conf, after step 5, linear_add will free the old
> conf. So it's possible linear_congested() will use the freed old conf. I think
> this is more likely to happen. The easist fix maybe put rcu_lock in
> linear_congested and free the old conf in a rcu callback.

Yes, RCU is still necessary here, I just compose and send out the second
version.

Thanks for pointing out this :-)

Coly

^ permalink raw reply

* Re: [PATCH 06/18] dm: remove incomple BLOCK_PC support
From: Bart Van Assche @ 2017-01-27 17:32 UTC (permalink / raw)
  To: hch@lst.de, axboe@fb.com
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	snitzer@redhat.com, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <1485365126-23210-7-git-send-email-hch@lst.de>

On Wed, 2017-01-25 at 18:25 +0100, Christoph Hellwig wrote:
> DM tries to copy a few fields around for BLOCK_PC requests, but given
> that no dm-target ever wires up scsi_cmd_ioctl BLOCK_PC can't actually
> be sent to dm.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply

* Re: [PATCH 05/18] block: allow specifying size for extra command data
From: Jens Axboe @ 2017-01-27 17:33 UTC (permalink / raw)
  To: Bart Van Assche, hch@lst.de, martin.petersen@oracle.com
  Cc: linux-scsi@vger.kernel.org, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, linux-block@vger.kernel.org,
	snitzer@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <1485538224.4267.9.camel@sandisk.com>

On 01/27/2017 10:30 AM, Bart Van Assche wrote:
> On Fri, 2017-01-27 at 10:26 -0700, Jens Axboe wrote:
>> On 01/27/2017 10:21 AM, Bart Van Assche wrote:
>>> On Fri, 2017-01-27 at 17:12 +0100, Christoph Hellwig wrote:
>>>> On Wed, Jan 25, 2017 at 10:15:55PM -0500, Martin K. Petersen wrote:
>>>>> +static void *alloc_request_size(gfp_t gfp_mask, void *data)
>>>>>
>>>>> I like alloc_request_simple() but alloc_request_size() seems a bit
>>>>> contrived. _reserve? _extra? _special? Don't have any good suggestions,
>>>>> I'm afraid.
>>>>
>>>> Not that I'm a fan of _size, but I like the other suggestions even less.
>>>
>>> Hello Christoph and Martin,
>>>
>>> How about using the function names alloc_full_request() / free_full_request()
>>> together with a comment that mentions that cmd_size is set by the LLD?
>>
>> Since we use pdu in other places, how about alloc_request_pdu() or
>> alloc_request_with_pdu()?
>>
>> And since it's all queued up, any bike shedding changes will have to be
>> incremental.
> 
> Hello Jens,
> 
> Other Linux subsystems use the term "private data" instead of PDU. How about
> modifying the block layer such that it uses the same terminology? I'm
> referring to function names like blk_mq_rq_from_pdu() and blk_mq_rq_to_pdu()

It's been pdu since it was introduced in 3.13, I really don't see a good
reason to change it. At least pdu or payload means something, where as
private is just... Well, not a big fan.

-- 
Jens Axboe


^ permalink raw reply

* Re: [PATCH 11/18] scsi: remove gfp_flags member in scsi_host_cmd_pool
From: Bart Van Assche @ 2017-01-27 17:38 UTC (permalink / raw)
  To: hch@lst.de, axboe@fb.com
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	snitzer@redhat.com, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <1485365126-23210-12-git-send-email-hch@lst.de>

On Wed, 2017-01-25 at 18:25 +0100, Christoph Hellwig wrote:
> When using the slab allocator we already decide at cache creation time if
> an allocation comes from a GFP_DMA pool using the SLAB_CACHE_DMA flag,
> and there is no point passing the kmalloc-family only GFP_DMA flag to
> kmem_cache_alloc.  Drop all the infrastructure for doing so.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply

* Re: [PATCH 12/18] scsi: respect unchecked_isa_dma for blk-mq
From: Bart Van Assche @ 2017-01-27 17:45 UTC (permalink / raw)
  To: hch@lst.de, axboe@fb.com
  Cc: linux-scsi@vger.kernel.org, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, linux-block@vger.kernel.org,
	snitzer@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <1485365126-23210-13-git-send-email-hch@lst.de>

On Wed, 2017-01-25 at 18:25 +0100, Christoph Hellwig wrote:
> Currently blk-mq always allocates the sense buffer using normal GFP_KERNEL
> allocation.  Refactor the cmd pool code to split the cmd and sense allocation
> and share the code to allocate the sense buffers as well as the sense buffer
> slab caches between the legacy and blk-mq path.
> 
> Note that this switches to lazy allocation of the sense slab caches - the
> slab caches (not the actual allocations) won't be destroy until the scsi
> module is unloaded instead of keeping track of hosts using them.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply

* Re: [PATCH] md linear: fix a race between linear_add() and linear_congested()
From: Coly Li @ 2017-01-27 17:45 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, Shaohua Li, stable
In-Reply-To: <8737g6vg7r.fsf@notabene.neil.brown.name>

On 2017/1/26 上午8:04, NeilBrown wrote:
> On Wed, Jan 25 2017, Shaohua Li wrote:
> 
>> On Wed, Jan 25, 2017 at 07:15:43PM +0800, colyli@suse.de wrote:
>>> Recently I receie a report that on Linux v3.0 based kerenl, hot
>>> add disk to a md linear device causes kernel crash at
>>> linear_congested(). From the crash image analysis, I find in
>>> linear_congested(), mddev->raid_disks contains value N, but
>>> conf->disks[] only has N-1 pointers available. Then a pointer
>>> deference to a NULL pointer crashes the kernel.
>>> 
>>> There is a race between linear_add() and linear_congested(),
>>> RCU stuffs used in these two functions cannot avoid the race.
>>> Since Linuv v4.0 RCU code is replaced by introducing
>>> mddev_suspend().  After checking the upstream code, it seems
>>> linear_congested() is not called in generic_make_request() code
>>> patch, so mddev_suspend() cannot provent it from being called.
>>> The possible race still exists.
>>> 
>>> Here I explain how the race still exists in current code.  For
>>> a machine has many CPUs, on one CPU, linear_add() is called to
>>> add a hard disk to a md linear device; at the same time on
>>> other CPU, linear_congested() is called to detect whether this
>>> md linear device is congested before issuing an I/O request
>>> onto it.
>>> 
>>> Now I use a possible code execution time sequence to demo how
>>> the possible race happens,
>>> 
>>> seq    linear_add()                linear_congested() 0
>>> conf=mddev->private 1   oldconf=mddev->private 2
>>> mddev->raid_disks++ 3                              for (i=0;
>>> i<mddev->raid_disks;i++) 4
>>> bdev_get_queue(conf->disks[i].rdev->bdev) 5
>>> mddev->private=newconf
>> 
>> Good catch, this makes a lot of sense. However, this looks like
>> an incomplete fix. step 0 will get the old conf, after step 5,
>> linear_add will free the old conf. So it's possible
>> linear_congested() will use the freed old conf. I think this is
>> more likely to happen. The easist fix maybe put rcu_lock in 
>> linear_congested and free the old conf in a rcu callback.
> 
> We used to use kfree_rcu() but removed it in
> 
> Commit: 3be260cc18f8 ("md/linear: remove rcu protections in favour
> of suspend/resume")
> 
> when we changed to suspend/resume the device.  That stops all IO,
> but doesn't stop the ->congested call.
> 
> So we probably should re-introduce kfree_rcu() to free oldconf. It
> might also be good to store a copy of raid_disks in linear_conf,
> like we do with r5conf, the ensure we never us inconsistent 
> ->raid_disks and ->disks[]

Hi Neil,

I just send out v2 patch which adds RCU stuffs back. I test it on my
small server, it survives.

Once thing I want to confirm here is the memory barrier in linear_add().

219         mddev_suspend(mddev);
220         oldconf = rcu_dereference(mddev->private);
221         rcu_assign_pointer(mddev->private, newconf);
222         smp_mb();
223         mddev->raid_disks++;
224         md_set_array_sectors(mddev, linear_size(mddev, 0, 0));
225         set_capacity(mddev->gendisk, mddev->array_sectors);
226         mddev_resume(mddev);
227         revalidate_disk(mddev->gendisk);
228         call_rcu(&oldconf->rcu, free_conf);

At LINE 222, I add a smp_mb(), from Documentations/memory-barrier.txt,
my understand is here I need a smp_wmb() or smp_mb(). I see other
places all use smp_mb() so I choose the stronger one -- smp_mb().

But from Documentation/whatisRCU.txt, it says about
rcu_assign_pointer(): "This function returns he new value, and also
executes any memory-barrier instructions required for a given CPU
architecture." So it seems smp_mb() at LINE 222 is unnecessary.

In v2 patch, I keep smp_mb() although I think it is unnecessary. I
will remove it if you or Shaohua may confirm it is unncessary as I think.


Another question is, I try to look at the code about r5conf, but I
still have no idea how to store a copy of r5conf. Could you please to
give me more hint ?

Thanks.

Coly

^ permalink raw reply

* Re: [PATCH 13/18] scsi: remove scsi_cmd_dma_pool
From: Bart Van Assche @ 2017-01-27 17:51 UTC (permalink / raw)
  To: hch@lst.de, axboe@fb.com
  Cc: linux-scsi@vger.kernel.org, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, linux-block@vger.kernel.org,
	snitzer@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <1485365126-23210-14-git-send-email-hch@lst.de>

On Wed, 2017-01-25 at 18:25 +0100, Christoph Hellwig wrote:
> There is no need for GFP_DMA allocations of the scsi_cmnd structures
> themselves, all that might be DMAed to or from is the actual payload,
> or the sense buffers.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply

* Re: [PATCH 14/18] scsi: remove __scsi_alloc_queue
From: Bart Van Assche @ 2017-01-27 17:58 UTC (permalink / raw)
  To: hch@lst.de, axboe@fb.com
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	snitzer@redhat.com, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <1485365126-23210-15-git-send-email-hch@lst.de>

On Wed, 2017-01-25 at 18:25 +0100, Christoph Hellwig wrote:
> diff --git a/include/scsi/scsi_transport.h b/include/scsi/scsi_transport.h
> index 8129239..b6e07b5 100644
> --- a/include/scsi/scsi_transport.h
> +++ b/include/scsi/scsi_transport.h
> @@ -119,4 +119,6 @@ scsi_transport_device_data(struct scsi_device *sdev)
>  		+ shost->transportt->device_private_offset;
>  }
>  
> +void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q);
> +
>  #endif /* SCSI_TRANSPORT_H */

Hello Christoph,

Since __scsi_init_queue() modifies data in the Scsi_Host structure, have you
considered to add the declaration for this function to <scsi/scsi_host.h>?
If you want to keep this declaration in <scsi/scsi_transport.h> please add a
direct include of that header file to drivers/scsi/scsi_lib.c such that the
declaration remains visible to the compiler if someone would minimize the
number of #include directives in SCSI header files.

Thanks,

Bart.

^ permalink raw reply

* Re: [PATCH 15/18] scsi: allocate scsi_cmnd structures as part of struct request
From: Bart Van Assche @ 2017-01-27 18:39 UTC (permalink / raw)
  To: hch@lst.de, axboe@fb.com
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	snitzer@redhat.com, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <1485365126-23210-16-git-send-email-hch@lst.de>

On Wed, 2017-01-25 at 18:25 +0100, Christoph Hellwig wrote:
> -unsigned char *scsi_alloc_sense_buffer(struct Scsi_Host *shost, gfp_t gfp_mask,
> -               int numa_node)
> +static unsigned char *scsi_alloc_sense_buffer(struct Scsi_Host *shost,
> +       gfp_t gfp_mask, int numa_node)
>  {
>         return kmem_cache_alloc_node(scsi_select_sense_cache(shost), gfp_mask,
>                         numa_node);
> @@ -697,14 +696,13 @@ static bool scsi_end_request(struct request *req, int error,
>  
>                 if (bidi_bytes)
>                         scsi_release_bidi_buffers(cmd);
> +               scsi_release_buffers(cmd);
> +               scsi_put_command(cmd);
>  
>                 spin_lock_irqsave(q->queue_lock, flags);
>                 blk_finish_request(req, error);
>                 spin_unlock_irqrestore(q->queue_lock, flags);
>  
> -               scsi_release_buffers(cmd);
> -
> -               scsi_put_command(cmd);
>                 scsi_run_queue(q);
>         }

Hello Christoph,

Why have the scsi_release_buffers() and scsi_put_command(cmd) calls been
moved up? I haven't found an explanation for this change in the patch
description.

Please also consider to remove the cmd->request->special = NULL assignments
via this patch. Since this patch makes the lifetime of struct scsi_cmnd and
struct request identical these assignments are no longer needed.

This patch introduces the function scsi_exit_rq(). Having two functions
for the single-queue path that release resources (scsi_release_buffers()
and scsi_exit_rq()) is confusing. Since every scsi_release_buffers() call
is followed by a blk_unprep_request() call, have you considered to move
the scsi_release_buffers() call into scsi_unprep_fn() via an additional
patch?

Thanks,

Bart.

^ permalink raw reply

* Re: [PATCH 16/18] block/bsg: move queue creation into bsg_setup_queue
From: Bart Van Assche @ 2017-01-27 18:48 UTC (permalink / raw)
  To: hch@lst.de, axboe@fb.com
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	snitzer@redhat.com, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <1485365126-23210-17-git-send-email-hch@lst.de>

On Wed, 2017-01-25 at 18:25 +0100, Christoph Hellwig wrote:
> Simply the boilerplate code needed for bsg nodes a bit.

Did you perhaps mean "Simplify"? Anyway, nice work!

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>

^ permalink raw reply

* Re: split scsi passthrough fields out of struct request V3
From: Bart Van Assche @ 2017-01-27 18:58 UTC (permalink / raw)
  To: hch@lst.de, axboe@fb.com
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	snitzer@redhat.com, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <1485534918-18239-1-git-send-email-hch@lst.de>

On Fri, 2017-01-27 at 17:34 +0100, Christoph Hellwig wrote:
> this series splits the support for SCSI passthrough commands from the
> main struct request used all over the block layer into a separate
> scsi_request structure that drivers that want to support SCSI passthough
> need to embedded as the first thing into their request-private data,
> similar to how we handle NVMe passthrough commands.
> 
> To support this I've added support for that the private data after
> request structure to the legacy request path instead, so that it can
> be treated the same way as the blk-mq path.  Compare to the current
> scsi_cmnd allocator that actually is a major simplification.
> 
> Changes since V2:
>  - remove req->cmd tracing
>  - minor spelling fixes
> 
> Changes since V1:
>  - fix handling of a NULL sense pointer in __scsi_execute
>  - clean up handling of the flush flags in the block layer and MD
>  - additional small cleanup in dm-rq

Hello Christoph,

Version 3 of the patch with title "block: split scsi_request out of
struct request" (commit 3c30af6ebe12) differs significantly from v2
of that patch that has been posted on several mailing lists. E.g. v2
moves __cmd[], cmd and cmd_len from struct request into struct
scsi_request but v3 not. Which version do you want us to review?

Thanks,

Bart.

^ permalink raw reply

* Re: Errorneous detection of degraded array
From: Luke Pyzowski @ 2017-01-27 19:44 UTC (permalink / raw)
  To: 'Andrei Borzenkov',
	'systemd-devel@lists.freedesktop.org',
	linux-raid@vger.kernel.org
In-Reply-To: <b842c2ce-7d32-a392-e9d9-9f330fa6a7cf@gmail.com>

I've modified a number of settings to try to resolve this, so far no success.
I've created an explicit mount file for the RAID array: /etc/systemd/system/share.mount
Inside there I've experimented with TimeoutSec=

In /etc/systemd/system/mdadm-last-resort@.timer I've worked with
OnActiveSec=

I've also tried (without an explicit mount file) to add x-systemd.device-timeout to /etc/fstab for the mount. 

Here's a few more system logs showing perhaps more detail. I've edited them to show only relevant details, full pastebin of logs: http://pastebin.com/sL8nKt7j
These logs were generated with TimeoutSec=120 in /etc/systemd/system/share.mount the description of the mount in the logs is: "Mount /share RAID partition explicitly"
And OnActiveSec=30 in /etc/systemd/system/mdadm-last-resort@.timer
From blkid:
/dev/md0: UUID="2b9114be-3d5a-41d7-8d4b-e5047d223129" TYPE="ext4"
/dev/md0 is the /share partition.

From /etc/mdadm.conf:
ARRAY /dev/md/0  metadata=1.2 UUID=97566d2f:ae7a169b:966f5840:3e8267f9 name=lnxnfs01:0

Boot begins at Jan 27 11:33:10
+4 seconds from boot:
Jan 27 11:33:14 lnxnfs01 systemd[1]: Found device /dev/disk/by-uuid/283669e9-f32c-498d-b848-c6f91738c959.
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdc operational as raid disk 2
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdx operational as raid disk 23
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdu operational as raid disk 20
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdt operational as raid disk 19
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdo operational as raid disk 14
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdn operational as raid disk 13
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdd operational as raid disk 3
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdv operational as raid disk 21
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sda operational as raid disk 0
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdf operational as raid disk 5
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdm operational as raid disk 12
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sde operational as raid disk 4
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdp operational as raid disk 15
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdi operational as raid disk 8
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdl operational as raid disk 11
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdk operational as raid disk 10
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sds operational as raid disk 18
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdb operational as raid disk 1
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdj operational as raid disk 9
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdg operational as raid disk 6
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdr operational as raid disk 17
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdh operational as raid disk 7
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdq operational as raid disk 16
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: device sdw operational as raid disk 22
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: allocated 25534kB
Jan 27 11:33:14 lnxnfs01 kernel: md/raid:md0: raid level 6 active with 24 out of 24 devices, algorithm 2
Jan 27 11:33:14 lnxnfs01 kernel: RAID conf printout:
Jan 27 11:33:14 lnxnfs01 kernel:  --- level:6 rd:24 wd:24
Jan 27 11:33:14 lnxnfs01 kernel:  disk 0, o:1, dev:sda
Jan 27 11:33:14 lnxnfs01 kernel:  disk 1, o:1, dev:sdb
Jan 27 11:33:14 lnxnfs01 kernel:  disk 2, o:1, dev:sdc
Jan 27 11:33:14 lnxnfs01 kernel:  disk 3, o:1, dev:sdd
Jan 27 11:33:14 lnxnfs01 kernel:  disk 4, o:1, dev:sde
Jan 27 11:33:14 lnxnfs01 kernel:  disk 5, o:1, dev:sdf
Jan 27 11:33:14 lnxnfs01 kernel:  disk 6, o:1, dev:sdg
Jan 27 11:33:14 lnxnfs01 kernel:  disk 7, o:1, dev:sdh
Jan 27 11:33:14 lnxnfs01 kernel:  disk 8, o:1, dev:sdi
Jan 27 11:33:14 lnxnfs01 kernel:  disk 9, o:1, dev:sdj
Jan 27 11:33:14 lnxnfs01 kernel:  disk 10, o:1, dev:sdk
Jan 27 11:33:14 lnxnfs01 kernel:  disk 11, o:1, dev:sdl
Jan 27 11:33:14 lnxnfs01 kernel:  disk 12, o:1, dev:sdm
Jan 27 11:33:14 lnxnfs01 kernel:  disk 13, o:1, dev:sdn
Jan 27 11:33:14 lnxnfs01 kernel:  disk 14, o:1, dev:sdo
Jan 27 11:33:14 lnxnfs01 kernel:  disk 15, o:1, dev:sdp
Jan 27 11:33:14 lnxnfs01 kernel:  disk 16, o:1, dev:sdq
Jan 27 11:33:14 lnxnfs01 kernel:  disk 17, o:1, dev:sdr
Jan 27 11:33:14 lnxnfs01 kernel:  disk 18, o:1, dev:sds
Jan 27 11:33:14 lnxnfs01 kernel:  disk 19, o:1, dev:sdt
Jan 27 11:33:14 lnxnfs01 kernel:  disk 20, o:1, dev:sdu
Jan 27 11:33:14 lnxnfs01 kernel:  disk 21, o:1, dev:sdv
Jan 27 11:33:14 lnxnfs01 kernel:  disk 22, o:1, dev:sdw
Jan 27 11:33:14 lnxnfs01 kernel:  disk 23, o:1, dev:sdx
Jan 27 11:33:14 lnxnfs01 kernel: md0: detected capacity change from 0 to 45062020923392
Jan 27 11:33:14 lnxnfs01 systemd[1]: Found device /dev/disk/by-uuid/2b9114be-3d5a-41d7-8d4b-e5047d223129.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started udev Wait for Complete Device Initialization.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started Timer to wait for more drives before activating degraded array..
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting Timer to wait for more drives before activating degraded array..
Jan 27 11:33:14 lnxnfs01 systemd[1]: Created slice system-lvm2\x2dpvscan.slice.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting system-lvm2\x2dpvscan.slice.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting LVM2 PV scan on device 9:127...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting Activation of DM RAID sets...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Mounting Mount /share RAID partition explicitly...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting File System Check on /dev/disk/by-uuid/283669e9-f32c-498d-b848-c6f91738c959...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Activating swap /dev/mapper/vg_root-swap...
Jan 27 11:33:14 lnxnfs01 lvm[1494]: 2 logical volume(s) in volume group "vg_root" now active
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started LVM2 PV scan on device 9:127.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Activated swap /dev/mapper/vg_root-swap.
Jan 27 11:33:14 lnxnfs01 kernel: Adding 32763900k swap on /dev/mapper/vg_root-swap.  Priority:-1 extents:1 across:32763900k FS
Jan 27 11:33:14 lnxnfs01 systemd-fsck[1499]: /dev/md126: clean, 345/64128 files, 47930/256240 blocks
Jan 27 11:33:14 lnxnfs01 systemd[1]: Reached target Swap.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Starting Swap.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Started File System Check on /dev/disk/by-uuid/283669e9-f32c-498d-b848-c6f91738c959.
Jan 27 11:33:14 lnxnfs01 systemd[1]: Mounting /boot...
Jan 27 11:33:14 lnxnfs01 systemd[1]: Mounted /boot.
Jan 27 11:33:14 lnxnfs01 kernel: EXT4-fs (md126): mounted filesystem with ordered data mode. Opts: (null)
Jan 27 11:33:21 lnxnfs01 systemd[1]: Started Activation of DM RAID sets.
Jan 27 11:33:21 lnxnfs01 systemd[1]: Reached target Encrypted Volumes.
Jan 27 11:33:21 lnxnfs01 systemd[1]: Starting Encrypted Volumes.
Jan 27 11:33:21 lnxnfs01 systemd[1]: Reached target Local File Systems.
Jan 27 11:33:21 lnxnfs01 systemd[1]: Starting Local File Systems.
...
Jan 27 11:33:21 lnxnfs01 kernel: EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)
Jan 27 11:33:21 lnxnfs01 systemd[1]: Mounted Mount /share RAID partition explicitly.

... + 31 seconds from disk initialization, expiration of 30 second timer from mdadm-last-resort@.timer

Jan 27 11:33:45 lnxnfs01 systemd[1]: Created slice system-mdadm\x2dlast\x2dresort.slice.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Starting system-mdadm\x2dlast\x2dresort.slice.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Stopped target Local File Systems.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Stopping Local File Systems.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Unmounting Mount /share RAID partition explicitly...
Jan 27 11:33:45 lnxnfs01 systemd[1]: Starting Activate md array even though degraded...
Jan 27 11:33:45 lnxnfs01 systemd[1]: Stopped (with error) /dev/md0.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Started Activate md array even though degraded.
Jan 27 11:33:45 lnxnfs01 systemd[1]: Unmounted Mount /share RAID partition explicitly.

... + 121 seconds from disk initialization, expiration of 120 second timer from TimeoutSec=120 in /etc/systemd/system/share.mount

Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-md-0.device/stop timed out.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Timed out stoppping /dev/md/0.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-md-0.device/stop failed with result 'timeout'.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-disk-by\x2did-md\x2dname\x2dlnxnfs01:0.device/stop timed out.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Timed out stoppping /dev/disk/by-id/md-name-lnxnfs01:0.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-disk-by\x2did-md\x2dname\x2dlnxnfs01:0.device/stop failed with result 'timeout'.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job sys-devices-virtual-block-md0.device/stop timed out.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Timed out stoppping /sys/devices/virtual/block/md0.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job sys-devices-virtual-block-md0.device/stop failed with result 'timeout'.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-disk-by\x2did-md\x2duuid\x2d97566d2f:ae7a169b:966f5840:3e8267f9.device/stop timed out.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Timed out stoppping /dev/disk/by-id/md-uuid-97566d2f:ae7a169b:966f5840:3e8267f9.
Jan 27 11:35:15 lnxnfs01 systemd[1]: Job dev-disk-by\x2did-md\x2duuid\x2d97566d2f:ae7a169b:966f5840:3e8267f9.device/stop failed with result 'timeout'.


In the logs above, the timer is started to wait for devices, it does not stop immediately because it sees a degraded array. It continues for 30 seconds and then kicks out my mount believing the raid is problematic. 

As an experiment, I've increased the values for the mdadm-last-resort@.timer to OnActiveSec=300 and TimeoutSec=320. This give me enough time to log in to the system. During that time, I can view the RAID and everything appears proper, yet 300 seconds later the mdadm-last-resort@.timer expires with an error on /dev/md0?

Perhaps systemd is working normally, but then the question is why is the raid not being assembled properly - which triggers /usr/lib/systemd/system/mdadm-last-resort@.service?

From your suggestion I need to next move to full udev debug logs? Can I delay the assembly of the RAID by the kernel if this is a race condition, as it appears that might be the case? Should that delay be early in the systemd startup process or elsewhere?

Thanks again,
Luke Pyzowski
_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

^ permalink raw reply

* Re: split scsi passthrough fields out of struct request V2
From: Bart Van Assche @ 2017-01-27 21:27 UTC (permalink / raw)
  To: hch@lst.de, axboe@fb.com
  Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org,
	snitzer@redhat.com, linux-raid@vger.kernel.org,
	dm-devel@redhat.com, j-nomura@ce.jp.nec.com
In-Reply-To: <1485365126-23210-1-git-send-email-hch@lst.de>

On Wed, 2017-01-25 at 18:25 +0100, Christoph Hellwig wrote:
> this series splits the support for SCSI passthrough commands from the
> main struct request used all over the block layer into a separate
> scsi_request structure that drivers that want to support SCSI passthough
> need to embedded as the first thing into their request-private data,
> similar to how we handle NVMe passthrough commands.
> 
> To support this I've added support for that the private data after
> request structure to the legacy request path instead, so that it can
> be treated the same way as the blk-mq path.  Compare to the current
> scsi_cmnd allocator that actually is a major simplification.
> 
> Changes since V1:
>  - fix handling of a NULL sense pointer in __scsi_execute
>  - clean up handling of the flush flags in the block layer and MD
>  - additional small cleanup in dm-rq

Hello Christoph,

A general comment: patch "block: allow specifying size for extra
command data" is a very welcome improvement but unfortunately also
introduces an inconsistency among block drivers. This patch series
namely creates two kinds of block drivers:
- Block drivers that use the block layer core to allocate
  request-private data. These block drivers set request.cmd_size
  to a non-zero value and do not need request.special.
- Block drivers that allocate request-private data themselves.
  These block drivers set request.cmd_size to zero and use
  request.special to translate a request pointer into the private
  data pointer.

Have you considered to convert all block drivers to the new
approach and to get rid of request.special? If so, do you already
have plans to start working on this? I'm namely wondering wheter I
should start working on this myself.

Thanks,

Bart.

^ permalink raw reply

* [GIT PULL] MD update for 4.10-rc6
From: Shaohua Li @ 2017-01-27 21:34 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel, linux-raid, neilb

Hi,

Please pull MD fixes. This pull fixes several corner cases for raid5 cache,
which is merged into this cycle.

Thanks,
Shaohua

The following changes since commit 557ed56cc75e0a33c15ba438734a280bac23bd32:

  Merge tag 'sound-4.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound (2017-01-12 14:45:59 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shli/md.git tags/md/4.10-rc6

for you to fetch changes up to 2e38a37f23c98d7fad87ff022670060b8a0e2bf5:

  md/r5cache: disable write back for degraded array (2017-01-24 11:26:06 -0800)

----------------------------------------------------------------
Shaohua Li (1):
      md/raid5-cache: delete meaningless code

Song Liu (5):
      md/r5cache: read data into orig_page for prexor of cached data
      md/raid5: move comment of fetch_block to right location
      md/r5cache: flush data only stripes in r5l_recovery_log()
      md/r5cache: shift complex rmw from read path to write path
      md/r5cache: disable write back for degraded array

 drivers/md/md.c          |   5 ++
 drivers/md/raid5-cache.c | 106 ++++++++++++++++++++++++++++++++++-------
 drivers/md/raid5.c       | 121 ++++++++++++++++++++++++++++++++++++-----------
 drivers/md/raid5.h       |   7 +++
 4 files changed, 194 insertions(+), 45 deletions(-)

^ permalink raw reply

* Re: [PATCH v2] md linear: fix a race between linear_add() and linear_congested()
From: Shaohua Li @ 2017-01-27 21:44 UTC (permalink / raw)
  To: colyli; +Cc: linux-raid, Shaohua Li, Neil Brown, stable
In-Reply-To: <1485538209-17201-1-git-send-email-colyli@suse.de>

On Sat, Jan 28, 2017 at 01:30:09AM +0800, colyli@suse.de wrote:
> Recently I receie a report that on Linux v3.0 based kerenl, hot add disk
> to a md linear device causes kernel crash at linear_congested(). From the
> crash image analysis, I find in linear_congested(), mddev->raid_disks
> contains value N, but conf->disks[] only has N-1 pointers available. Then
> a pointer deference to a NULL pointer crashes the kernel.
> 
> There is a race between linear_add() and linear_congested(), RCU stuffs
> used in these two functions cannot avoid the race. Since Linuv v4.0
> RCU code is replaced by introducing mddev_suspend().  After checking the
> upstream code, it seems linear_congested() is not called in
> generic_make_request() code patch, so mddev_suspend() cannot provent it
> from being called. The possible race still exists.
> 
> Here I explain how the race still exists in current code.  For a machine
> has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
> md linear device; at the same time on other CPU, linear_congested() is
> called to detect whether this md linear device is congested before issuing
> an I/O request onto it.
> 
> Now I use a possible code execution time sequence to demo how the possible
> race happens,
> 
> seq    linear_add()                linear_congested()
>  0                                 conf=mddev->private
>  1   oldconf=mddev->private
>  2   mddev->raid_disks++
>  3                              for (i=0; i<mddev->raid_disks;i++)
>  4                                bdev_get_queue(conf->disks[i].rdev->bdev)
>  5   mddev->private=newconf
> 
> In linear_add() mddev->raid_disks is increased in time seq 2, and on
> another CPU in linear_congested() the for-loop iterates conf->disks[i] by
> the increased mddev->raid_disks in time seq 3,4. But conf with one more
> element (which is a pointer to struct dev_info type) to conf->disks[] is
> not updated yet, accessing its structure member in time seq 4 will cause a
> NULL pointer deference fault.
> 
> The fix includes 2 parts of modification,
>  1) In linear_add(), update mddev->private with new value before
>     increasing mddev->raid_disks, and to make sure on other CPUs their are
>     seen to be updated in same order as linear_add() does (otherwise the
>     race may still happen), a smp_mb() is necessary.
>  2) RCU stuffs are back, to make sure in linear_add() the oldconf won't be
>     destoried when it is still referenced in linear_congested().
> 
> A question is, by this fix, if mddev->private is update to new value in
> linear_add(), but in linear_congested() the for-loop still tests old value
> of mddev->raid_disks, then the iteration will miss the last element of
> conf->disks[]. My answer is don't worry it, it's OK. the reasons are,
>  - When updating mddev->private, the md linear device is suspend, no I/O
>    may happen, it is safe to missing congestion status of the last
>    new-added hard disk.
>  - In the worst case linear_congested() returns 0 and I/O sent to this md
>    linear device, but the new added hard disk is congested, then the I/O
>    request will be blocked for a while if it just happenly hits the new
>    added hard disk. linear_congested() is in code path of wb_congested(),
>    which is quite hot in write back code path. Comparing to add locking
>    code in linear_congested(), the cost of the worst case is acceptable.
> 
> The bug is reported on Linux v3.0 based kernel, it can and should be
> applied to all kernels since Linux v3.0. I see linear_add() is merged into
> mainline since Linux v2.6.18, maybe stable kernel maintainers after this
> version may consider to pick this fix as well.
> 
> Changelog:
>  - v2: add RCU stuffs by suggestion from Shaohua and Neil.
>  - v1: initial effort.

Neil's idea is to store raid_disks in 'struct linear_conf'. In this way, we
never need to worry about the raid_disks and conf aren't consistent. So the
barrier in linear_add is unncessary.
 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Shaohua Li <shli@fb.com>
> Cc: Neil Brown <neilb@suse.com>
> Cc: stable@vger.kernel.org
> ---
>  drivers/md/linear.c | 29 +++++++++++++++++++++++++----
>  1 file changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/linear.c b/drivers/md/linear.c
> index 5975c99..4f1690c 100644
> --- a/drivers/md/linear.c
> +++ b/drivers/md/linear.c
> @@ -58,13 +58,15 @@ static int linear_congested(struct mddev *mddev, int bits)
>  	struct linear_conf *conf;
>  	int i, ret = 0;
>  
> -	conf = mddev->private;
> +	rcu_read_lock();
> +	conf = rcu_dereference(mddev->private);
>  
>  	for (i = 0; i < mddev->raid_disks && !ret ; i++) {
>  		struct request_queue *q = bdev_get_queue(conf->disks[i].rdev->bdev);
>  		ret |= bdi_congested(&q->backing_dev_info, bits);
>  	}
>  
> +	rcu_read_unlock();
>  	return ret;
>  }
>  
> @@ -173,6 +175,13 @@ static int linear_run (struct mddev *mddev)
>  	return ret;
>  }
>  
> +static void free_conf(struct rcu_head *head)
> +{
> +	struct linear_conf *conf =
> +			container_of(head, struct linear_conf, rcu);
> +	kfree(conf);
> +}
> +
>  static int linear_add(struct mddev *mddev, struct md_rdev *rdev)
>  {
>  	/* Adding a drive to a linear array allows the array to grow.
> @@ -196,15 +205,27 @@ static int linear_add(struct mddev *mddev, struct md_rdev *rdev)
>  	if (!newconf)
>  		return -ENOMEM;
>  
> +	/* In linear_congested(), mddev->raid_disks and mddev->private
> +	 * are accessed without protection by mddev_suspend(). If on
> +	 * another CPU,  in linear_congested() mddev->private is still seen
> +	 * to contains old value but mddev->raid_disks is seen to have the
> +	 * increased value, the last iteration to conf->disks[i].rdev will
> +	 * trigger a NULL pointer deference. To avoid this race, here
> +	 * mddev->private must be updated before increasing
> +	 * mddev->raid_disks, and a smp_mb() is required between them. Then
> +	 * in linear_congested(), we are sure the updated mddev->private is
> +	 * seen when iterating conf->disks[i].
> +	 */
>  	mddev_suspend(mddev);
> -	oldconf = mddev->private;
> +	oldconf = rcu_dereference(mddev->private);
> +	rcu_assign_pointer(mddev->private, newconf);
> +	smp_mb();
>  	mddev->raid_disks++;
> -	mddev->private = newconf;
>  	md_set_array_sectors(mddev, linear_size(mddev, 0, 0));
>  	set_capacity(mddev->gendisk, mddev->array_sectors);
>  	mddev_resume(mddev);
>  	revalidate_disk(mddev->gendisk);
> -	kfree(oldconf);
> +	call_rcu(&oldconf->rcu, free_conf);

we have a handy kfree_rcu just for this.

Thanks,
Shaohua

^ permalink raw reply

* Re: [PATCH 14/18] scsi: remove __scsi_alloc_queue
From: hch @ 2017-01-28  8:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch@lst.de, axboe@fb.com, linux-scsi@vger.kernel.org,
	linux-raid@vger.kernel.org, dm-devel@redhat.com,
	linux-block@vger.kernel.org, snitzer@redhat.com,
	j-nomura@ce.jp.nec.com
In-Reply-To: <1485539862.4267.17.camel@sandisk.com>

On Fri, Jan 27, 2017 at 05:58:02PM +0000, Bart Van Assche wrote:
> Since __scsi_init_queue() modifies data in the Scsi_Host structure, have you
> considered to add the declaration for this function to <scsi/scsi_host.h>?
> If you want to keep this declaration in <scsi/scsi_transport.h> please add a
> direct include of that header file to drivers/scsi/scsi_lib.c such that the
> declaration remains visible to the compiler if someone would minimize the
> number of #include directives in SCSI header files.

Feel free to send an incremental patch either way.  In the long run
I'd really like to kill off __scsi_init_queue and remove the transport
BSG queue abuse of SCSI internals, though.

^ permalink raw reply

* Re: [PATCH 15/18] scsi: allocate scsi_cmnd structures as part of struct request
From: hch @ 2017-01-28  8:25 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-raid@vger.kernel.org, snitzer@redhat.com,
	linux-scsi@vger.kernel.org, axboe@fb.com, dm-devel@redhat.com,
	linux-block@vger.kernel.org, j-nomura@ce.jp.nec.com, hch@lst.de
In-Reply-To: <1485542367.4267.19.camel@sandisk.com>

On Fri, Jan 27, 2017 at 06:39:46PM +0000, Bart Van Assche wrote:
> Why have the scsi_release_buffers() and scsi_put_command(cmd) calls been
> moved up? I haven't found an explanation for this change in the patch
> description.

Because they reference the scsi_cmnd, which are now part of the request
and thus freed by blk_finish_request.  And yes, I should have mentioned
it in the changelog, sorry.

> Please also consider to remove the cmd->request->special = NULL assignments
> via this patch. Since this patch makes the lifetime of struct scsi_cmnd and
> struct request identical these assignments are no longer needed.

True.  If I had to resend again I would have fixed it up, but it's probably
not worth the churn now.

> This patch introduces the function scsi_exit_rq(). Having two functions
> for the single-queue path that release resources (scsi_release_buffers()
> and scsi_exit_rq()) is confusing. Since every scsi_release_buffers() call
> is followed by a blk_unprep_request() call, have you considered to move
> the scsi_release_buffers() call into scsi_unprep_fn() via an additional
> patch?

We could have done that.  But it's just more change for a code path
that I hope won't survive this calendar year.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox