Linux block layer

Linux block layer
 help / color / mirror / Atom feed

* Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses
From: Minchan Kim @ 2017-03-07  8:55 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Johannes Thumshirn, Jens Axboe, Nitin Gupta, Christoph Hellwig,
	Sergey Senozhatsky, yizhan, Linux Block Layer Mailinglist,
	Linux Kernel Mailinglist
In-Reply-To: <ed4d83a1-9bdd-9e24-7768-ba5e85429110@suse.de>

On Tue, Mar 07, 2017 at 08:48:06AM +0100, Hannes Reinecke wrote:
> On 03/07/2017 08:23 AM, Minchan Kim wrote:
> > Hi Hannes,
> > 
> > On Tue, Mar 7, 2017 at 4:00 PM, Hannes Reinecke <hare@suse.de> wrote:
> >> On 03/07/2017 06:22 AM, Minchan Kim wrote:
> >>> Hello Johannes,
> >>>
> >>> On Mon, Mar 06, 2017 at 11:23:35AM +0100, Johannes Thumshirn wrote:
> >>>> zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
> >>>> the NVMe over Fabrics loopback target which potentially sends a huge bulk of
> >>>> pages attached to the bio's bvec this results in a kernel panic because of
> >>>> array out of bounds accesses in zram_decompress_page().
> >>>
> >>> First of all, thanks for the report and fix up!
> >>> Unfortunately, I'm not familiar with that interface of block layer.
> >>>
> >>> It seems this is a material for stable so I want to understand it clear.
> >>> Could you say more specific things to educate me?
> >>>
> >>> What scenario/When/How it is problem?  It will help for me to understand!
> >>>
> > 
> > Thanks for the quick response!
> > 
> >> The problem is that zram as it currently stands can only handle bios
> >> where each bvec contains a single page (or, to be precise, a chunk of
> >> data with a length of a page).
> > 
> > Right.
> > 
> >>
> >> This is not an automatic guarantee from the block layer (who is free to
> >> send us bios with arbitrary-sized bvecs), so we need to set the queue
> >> limits to ensure that.
> > 
> > What does it mean "bios with arbitrary-sized bvecs"?
> > What kinds of scenario is it used/useful?
> > 
> Each bio contains a list of bvecs, each of which points to a specific
> memory area:
> 
> struct bio_vec {
> 	struct page	*bv_page;
> 	unsigned int	bv_len;
> 	unsigned int	bv_offset;
> };
> 
> The trick now is that while 'bv_page' does point to a page, the memory
> area pointed to might in fact be contiguous (if several pages are
> adjacent). Hence we might be getting a bio_vec where bv_len is _larger_
> than a page.

Thanks for detail, Hannes!

If I understand it correctly, it seems to be related to bid_add_page
with high-order page. Right?

If so, I really wonder why I don't see such problem because several
places have used it and I expected some of them might do IO with
contiguous pages intentionally or by chance. Hmm,

IIUC, it's not a nvme specific problme but general problem which
can trigger normal FSes if they uses contiguos pages?

> 
> Hence the check for 'is_partial_io' in zram_drv.c (which just does a
> test 'if bv_len != PAGE_SIZE) is in fact wrong, as it would trigger for
> partial I/O (ie if the overall length of the bio_vec is _smaller_ than a
> page), but also for multipage bvecs (where the length of the bio_vec is
> _larger_ than a page).

Right. I need to look into that. Thanks for the pointing out!

> 
> So rather than fixing the bio scanning loop in zram it's easier to set
> the queue limits correctly so that 'is_partial_io' does the correct
> thing and the overall logic in zram doesn't need to be altered.


Isn't that approach require new bio allocation through blk_queue_split?
Maybe, it wouldn't make severe regression in zram-FS workload but need
to test.

Is there any ways to trigger the problem without real nvme device?
It would really help to test/measure zram.

Anyway, to me, it's really subtle at this moment so I doubt it should
be stable material. :(

^ permalink raw reply

* Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses
From: Johannes Thumshirn @ 2017-03-07  9:51 UTC (permalink / raw)
  To: Minchan Kim, Hannes Reinecke
  Cc: Jens Axboe, Nitin Gupta, Christoph Hellwig, Sergey Senozhatsky,
	yizhan, Linux Block Layer Mailinglist, Linux Kernel Mailinglist
In-Reply-To: <20170307085545.GA538@bbox>

On 03/07/2017 09:55 AM, Minchan Kim wrote:
> On Tue, Mar 07, 2017 at 08:48:06AM +0100, Hannes Reinecke wrote:
>> On 03/07/2017 08:23 AM, Minchan Kim wrote:
>>> Hi Hannes,
>>>
>>> On Tue, Mar 7, 2017 at 4:00 PM, Hannes Reinecke <hare@suse.de> wrote:
>>>> On 03/07/2017 06:22 AM, Minchan Kim wrote:
>>>>> Hello Johannes,
>>>>>
>>>>> On Mon, Mar 06, 2017 at 11:23:35AM +0100, Johannes Thumshirn wrote:
>>>>>> zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
>>>>>> the NVMe over Fabrics loopback target which potentially sends a huge bulk of
>>>>>> pages attached to the bio's bvec this results in a kernel panic because of
>>>>>> array out of bounds accesses in zram_decompress_page().
>>>>>
>>>>> First of all, thanks for the report and fix up!
>>>>> Unfortunately, I'm not familiar with that interface of block layer.
>>>>>
>>>>> It seems this is a material for stable so I want to understand it clear.
>>>>> Could you say more specific things to educate me?
>>>>>
>>>>> What scenario/When/How it is problem?  It will help for me to understand!
>>>>>
>>>
>>> Thanks for the quick response!
>>>
>>>> The problem is that zram as it currently stands can only handle bios
>>>> where each bvec contains a single page (or, to be precise, a chunk of
>>>> data with a length of a page).
>>>
>>> Right.
>>>
>>>>
>>>> This is not an automatic guarantee from the block layer (who is free to
>>>> send us bios with arbitrary-sized bvecs), so we need to set the queue
>>>> limits to ensure that.
>>>
>>> What does it mean "bios with arbitrary-sized bvecs"?
>>> What kinds of scenario is it used/useful?
>>>
>> Each bio contains a list of bvecs, each of which points to a specific
>> memory area:
>>
>> struct bio_vec {
>> 	struct page	*bv_page;
>> 	unsigned int	bv_len;
>> 	unsigned int	bv_offset;
>> };
>>
>> The trick now is that while 'bv_page' does point to a page, the memory
>> area pointed to might in fact be contiguous (if several pages are
>> adjacent). Hence we might be getting a bio_vec where bv_len is _larger_
>> than a page.
> 
> Thanks for detail, Hannes!
> 
> If I understand it correctly, it seems to be related to bid_add_page
> with high-order page. Right?
> 
> If so, I really wonder why I don't see such problem because several
> places have used it and I expected some of them might do IO with
> contiguous pages intentionally or by chance. Hmm,
> 
> IIUC, it's not a nvme specific problme but general problem which
> can trigger normal FSes if they uses contiguos pages?
> 

I'm not a FS expert, but a quick grep shows that non of the file-systems
does the

for_each_sg()
	while(bio_add_page()))

trick NVMe does.

>>
>> Hence the check for 'is_partial_io' in zram_drv.c (which just does a
>> test 'if bv_len != PAGE_SIZE) is in fact wrong, as it would trigger for
>> partial I/O (ie if the overall length of the bio_vec is _smaller_ than a
>> page), but also for multipage bvecs (where the length of the bio_vec is
>> _larger_ than a page).
> 
> Right. I need to look into that. Thanks for the pointing out!
> 
>>
>> So rather than fixing the bio scanning loop in zram it's easier to set
>> the queue limits correctly so that 'is_partial_io' does the correct
>> thing and the overall logic in zram doesn't need to be altered.
> 
> 
> Isn't that approach require new bio allocation through blk_queue_split?
> Maybe, it wouldn't make severe regression in zram-FS workload but need
> to test.

Yes, but blk_queue_split() needs information how big the bvecs can be,
hence the patch.

For details have a look into blk_bio_segment_split() in block/blk-merge.c

It get's the max_sectors from blk_max_size_offset() which is
q->limits.max_sectors when q->limits.chunk_sectors isn't set and
then loops over the bio's bvecs to check when to split the bio and then
calls bio_split() when appropriate.

> 
> Is there any ways to trigger the problem without real nvme device?
> It would really help to test/measure zram.

It isn't a /real/ device but the fabrics loopback target. If you want a
fast reproducible test-case, have a look at:

https://github.com/ddiss/rapido/
the cut_nvme_local.sh script set's up the correct VM for this test. Then
a simple mkfs.xfs /dev/nvme0n1 will oops.

> 
> Anyway, to me, it's really subtle at this moment so I doubt it should
> be stable material. :(

I'm not quite sure, it's at least 4.11 material. See above.

Thanks,
	Johannes


-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: Felix Imendï¿½rffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nï¿½rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply

* LOOP_CLR_FD on partition removes mapping for whole device
From: Mirko Vogt @ 2017-03-07 10:34 UTC (permalink / raw)
  To: linux-block

Hello,

I'm experiencing unexpected behaviour with the ioctl LOOP_CLR_FD on a
*partition* of a loop device: it disassociates the whole (parent) loop
*device*, rather than (trying to) remove the mapping of the addressed
partition.

== Scenario / Environment / Use

In my scenario, I `partprobe` a loop device for having convenient access
to its partitions (I think it uses ioctl(BLKRRPART), but I'm not totally
sure..).

When now calling ioctl(LOOP_CLR_FD) on the loop *partition*, the mapping
of its parent - the overall loop *device* - is removed.

A `strace` reveals, that umount indeed opens the loop partition, so the
ioctl is called on the right inode (I first assumed it being a bug in
umount calling the parent device).

I would not expect ioctl(LOOP_CLR_FD) being called on a loop partition
would result in disassociation of the overall device holding that partition.

Long story short (userland is busybox which behaves slightly different
than the typical tools from util-linux, e.g. umount *by default* gets
rid of the loop device mapping when being called on a dir mounting a
loop device[1]):

root@LEDE:/# loop=$(losetup -f)
root@LEDE:/# echo ${loop}
/dev/loop2
root@LEDE:/# losetup ${loop} /IMAGE
root@LEDE:/# ls -l ${loop}*
brw-------  1 root root     7,   2 Mar  6 20:09 /dev/loop2
root@LEDE:/# partprobe ${loop}
root@LEDE:/# ls -l ${loop}*
brw-------  1 root  root    7,   2 Mar  6 20:09 /dev/loop2
brw-------  1 root  root  259,   8 Mar  6 21:59 /dev/loop2p1
brw-------  1 root  root  259,   9 Mar  6 21:59 /dev/loop2p2
brw-------  1 root  root  259,  10 Mar  6 21:59 /dev/loop2p3
brw-------  1 root  root  259,  11 Mar  6 21:59 /dev/loop2p4
brw-------  1 root  root  259,  12 Mar  6 21:59 /dev/loop2p5
brw-------  1 root  root  259,  13 Mar  6 21:59 /dev/loop2p6
brw-------  1 root  root  259,  14 Mar  6 21:59 /dev/loop2p7
brw-------  1 root  root  259,  15 Mar  6 21:59 /dev/loop2p8
root@LEDE:/# mount ${loop}p8 /MOUNT       # mount loop partition
root@LEDE:/# losetup -a | grep $loop      # loop dev mapping still there
/dev/loop2: 0 /mnt/IMAGE
root@LEDE:/# strace umount /MOUNT 2> /log # unmount loop partition
root@LEDE:/# losetup -a | grep ${loop}    # loop device mapping is gone
root@LEDE:/# grep -i loop /log
open("/dev/loop2p7", O_RDONLY|O_LARGEFILE) = 3
ioctl(3, LOOP_CLR_FD)                   = 0
root@LEDE:/#

Is this intended behaviour or a bug?

Cheers

  mirko

http://lists.busybox.net/pipermail/busybox/2012-April/077619.html

^ permalink raw reply

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
From: Jan Kara @ 2017-03-07 11:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
	Dan Williams, Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown,
	Omar Sandoval
In-Reply-To: <20170306220459.GJ26127@htj.duckdns.org>

On Mon 06-03-17 17:04:59, Tejun Heo wrote:
> Hello,
> 
> On Mon, Mar 06, 2017 at 05:33:55PM +0100, Jan Kara wrote:
> > +	disk->flags &= ~GENHD_FL_UP;
> > +	/*
> > +	 * Make sure __blkdev_open() sees the disk is going away before
> > +	 * starting to unhash bdev inodes.
> > +	 */
> > +	smp_wmb();
> 
> But which rmb is this paired with?  Without paring memory barriers
> don't do anything.

It is paired with the fact that the test (disk->flags & GENHD_FL_UP) in
__blkdev_get() cannot be reordered before the bd_acquire() call in
blkdev_open() (or however the bdev is looked up). I was thinking so long
about how and where to sanely comment on that that I did not write anything
in the end.

I was also thinking about adding an explicit barrier before that test just
to make things explicit since as you say below the first blkdev open is not
that hot path. Maybe that would be a better option?

> Given that this isn't a super hot path, I think it'd be far better to
> stick to a simpler synchronization mechanism.

I'd be happy to do so however struct gendisk does not have any lock in it
and introducing a special lock just for that seems awkward.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses
From: Hannes Reinecke @ 2017-03-07  7:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Thumshirn, Jens Axboe, Nitin Gupta, Christoph Hellwig,
	Sergey Senozhatsky, yizhan, Linux Block Layer Mailinglist,
	Linux Kernel Mailinglist
In-Reply-To: <CAEwNFnAb4x+sm6=vWd1W4Xj7Snp1V_LtRRx9ORFKpm1b-58kyw@mail.gmail.com>

On 03/07/2017 08:23 AM, Minchan Kim wrote:
> Hi Hannes,
> 
> On Tue, Mar 7, 2017 at 4:00 PM, Hannes Reinecke <hare@suse.de> wrote:
>> On 03/07/2017 06:22 AM, Minchan Kim wrote:
>>> Hello Johannes,
>>>
>>> On Mon, Mar 06, 2017 at 11:23:35AM +0100, Johannes Thumshirn wrote:
>>>> zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
>>>> the NVMe over Fabrics loopback target which potentially sends a huge bulk of
>>>> pages attached to the bio's bvec this results in a kernel panic because of
>>>> array out of bounds accesses in zram_decompress_page().
>>>
>>> First of all, thanks for the report and fix up!
>>> Unfortunately, I'm not familiar with that interface of block layer.
>>>
>>> It seems this is a material for stable so I want to understand it clear.
>>> Could you say more specific things to educate me?
>>>
>>> What scenario/When/How it is problem?  It will help for me to understand!
>>>
> 
> Thanks for the quick response!
> 
>> The problem is that zram as it currently stands can only handle bios
>> where each bvec contains a single page (or, to be precise, a chunk of
>> data with a length of a page).
> 
> Right.
> 
>>
>> This is not an automatic guarantee from the block layer (who is free to
>> send us bios with arbitrary-sized bvecs), so we need to set the queue
>> limits to ensure that.
> 
> What does it mean "bios with arbitrary-sized bvecs"?
> What kinds of scenario is it used/useful?
> 
Each bio contains a list of bvecs, each of which points to a specific
memory area:

struct bio_vec {
	struct page	*bv_page;
	unsigned int	bv_len;
	unsigned int	bv_offset;
};

The trick now is that while 'bv_page' does point to a page, the memory
area pointed to might in fact be contiguous (if several pages are
adjacent). Hence we might be getting a bio_vec where bv_len is _larger_
than a page.

Hence the check for 'is_partial_io' in zram_drv.c (which just does a
test 'if bv_len != PAGE_SIZE) is in fact wrong, as it would trigger for
partial I/O (ie if the overall length of the bio_vec is _smaller_ than a
page), but also for multipage bvecs (where the length of the bio_vec is
_larger_ than a page).

So rather than fixing the bio scanning loop in zram it's easier to set
the queue limits correctly so that 'is_partial_io' does the correct
thing and the overall logic in zram doesn't need to be altered.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply

* Re: [PATCH 0/13 v2] block: Fix block device shutdown related races
From: Jan Kara @ 2017-03-07 13:57 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
	Dan Williams, Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo,
	NeilBrown
In-Reply-To: <20170306203818.GC16893@vader.DHCP.thefacebook.com>

On Mon 06-03-17 12:38:18, Omar Sandoval wrote:
> On Wed, Mar 01, 2017 at 04:11:09PM +0100, Jan Kara wrote:
> > On Tue 28-02-17 23:26:53, Omar Sandoval wrote:
> > > On Tue, Feb 28, 2017 at 11:25:28PM -0800, Omar Sandoval wrote:
> > > > On Wed, Feb 22, 2017 at 11:24:25AM +0100, Jan Kara wrote:
> > > > > On Tue 21-02-17 10:19:28, Jens Axboe wrote:
> > > > > > On 02/21/2017 10:09 AM, Jan Kara wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > this is a second revision of the patch set to fix several different races and
> > > > > > > issues I've found when testing device shutdown and reuse. The first three
> > > > > > > patches are fixes to problems in my previous series fixing BDI lifetime issues.
> > > > > > > Patch 4 fixes issues with reuse of BDI name with scsi devices. With it I cannot
> > > > > > > reproduce the BDI name reuse issues using Omar's stress test using scsi_debug
> > > > > > > so it can be used as a replacement of Dan's patches. Patches 5-11 fix oops that
> > > > > > > is triggered by __blkdev_put() calling inode_detach_wb() too early (the problem
> > > > > > > reported by Thiago). Patches 12 and 13 fix oops due to a bug in gendisk code
> > > > > > > where get_gendisk() can return already freed gendisk structure (again triggered
> > > > > > > by Omar's stress test).
> > > > > > > 
> > > > > > > People, please have a look at patches. They are mostly simple however the
> > > > > > > interactions are rather complex so I may have missed something. Also I'm
> > > > > > > happy for any additional testing these patches can get - I've stressed them
> > > > > > > with Omar's script, tested memcg writeback, tested static (not udev managed)
> > > > > > > device inodes.
> > > > > > > 
> > > > > > > Jens, I think at least patches 1-3 should go in together with my fixes you
> > > > > > > already have in your tree (or shortly after them). It is up to you whether
> > > > > > > you decide to delay my first fixes or pick these up quickly. Patch 4 is
> > > > > > > (IMHO a cleaner) replacement of Dan's patches so consider whether you want
> > > > > > > to use it instead of those patches.
> > > > > > 
> > > > > > I have applied 1-3 to my for-linus branch, which will go in after
> > > > > > the initial pull request has been pulled by Linus. Consider fixing up
> > > > > > #4 so it applies, I like it.
> > > > > 
> > > > > OK, attached is patch 4 rebased on top of Linus' tree from today which
> > > > > already has linux-block changes pulled in. I've left put_disk_devt() in
> > > > > blk_cleanup_queue() to maintain the logic in the original patch (now commit
> > > > > 0dba1314d4f8) that request_queue and gendisk each hold one devt reference.
> > > > > The bdi_unregister() call that is moved to del_gendisk() by this patch is
> > > > > now protected by the gendisk reference instead of the request_queue one
> > > > > so it still maintains the property that devt reference protects bdi
> > > > > registration-unregistration lifetime (as much as that is not needed anymore
> > > > > after this patch).
> > > > > 
> > > > > I have also updated the comment in the code and the changelog - they were
> > > > > somewhat stale after changes to the whole series Tejun suggested.
> > > > > 
> > > > > 								Honza
> > > > 
> > > > Hey, Jan, I just tested this out when I was seeing similar crashes with
> > > > sr instead of sd, and this fixed it.
> > > > 
> > > > Tested-by: Omar Sandoval <osandov@fb.com>
> > > 
> > > Just realized it wasn't clear, I'm talking about patch 4 specifically.
> > 
> > Thanks for confirmation!
> > 
> > 								Honza
> 
> Unfortunately, this patch actually seems to have introduced a
> regression. Reverting the patch fixes it.
> 
> I'm running a QEMU kvm vm with a virtio-scsi device and I get oopses
> like this:

What workload do you run? Or is it enough to just boot the kvm with
virtio-scsi enabled? Can you send me the kvm setup so that I can reproduce
this?

At least the KASAN error could be a result of the situation where someone
called bdi_register() but didn't call bdi_unregister() before dropping the
last bdi reference (which would make sense given I've moved
bdi_unregister() call). However I'd expect a warning from bdi_exit() in that
case and I don't see that in your kernel log. So I'll try to reproduce the
issue and debug more.

								Honza

> 
> [    6.741773] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
> [    6.744401] IP: virtscsi_queuecommand_single+0x21/0x40 [virtio_scsi]
> [    6.744985] PGD 0
> [    6.744985]
> [    6.744985] Oops: 0002 [#1] PREEMPT SMP
> [    6.744985] Modules linked in: virtio_scsi scsi_mod virtio_net
> [    6.744985] CPU: 3 PID: 5 Comm: kworker/u8:0 Not tainted 4.11.0-rc1 #36
> [    6.744985] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
> [    6.744985] Workqueue: events_unbound async_run_entry_fn
> [    6.744985] task: ffff88007c8672c0 task.stack: ffffc9000033c000
> [    6.750038] RIP: 0010:virtscsi_queuecommand_single+0x21/0x40 [virtio_scsi]
> [    6.750038] RSP: 0018:ffffc9000033f8b0 EFLAGS: 00010246
> [    6.750038] RAX: ffff88007c98d000 RBX: ffff880078c814c8 RCX: 0000000000000000
> [    6.750038] RDX: ffff880078c814c8 RSI: ffff880078c814c8 RDI: ffff88007c98d000
> [    6.750038] RBP: ffffc9000033f8b0 R08: 0000000000000000 R09: 0000000000000400
> [    6.750038] R10: ffff88007b1fe748 R11: 0000000000000024 R12: ffff880078c814c8
> [    6.750038] R13: ffff88007c98d000 R14: ffff880078c81380 R15: ffff88007b532000
> [    6.750038] FS:  0000000000000000(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000
> [    6.750038] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    6.750038] CR2: 0000000000000004 CR3: 0000000001c09000 CR4: 00000000000006e0
> [    6.750038] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    6.750038] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    6.750038] Call Trace:
> [    6.750038]  scsi_dispatch_cmd+0xa3/0x1d0 [scsi_mod]
> [    6.750038]  scsi_queue_rq+0x586/0x740 [scsi_mod]
> [    6.750038]  blk_mq_dispatch_rq_list+0x22a/0x420
> [    6.750038]  blk_mq_sched_dispatch_requests+0x142/0x1b0
> [    6.750038]  __blk_mq_run_hw_queue+0x94/0xb0
> [    6.750038]  blk_mq_run_hw_queue+0x82/0xa0
> [    6.750038]  blk_mq_sched_insert_request+0x89/0x1a0
> [    6.750038]  ? blk_execute_rq+0xc0/0xc0
> [    6.750038]  blk_execute_rq_nowait+0x9a/0x140
> [    6.750038]  ? bio_phys_segments+0x19/0x20
> [    6.750038]  blk_execute_rq+0x56/0xc0
> [    6.750038]  scsi_execute+0x128/0x270 [scsi_mod]
> [    6.750038]  scsi_probe_and_add_lun+0x247/0xc60 [scsi_mod]
> [    6.750038]  ? __pm_runtime_resume+0x4c/0x60
> [    6.750038]  __scsi_scan_target+0x102/0x520 [scsi_mod]
> [    6.750038]  ? account_entity_dequeue+0xab/0xe0
> [    6.750038]  scsi_scan_channel+0x81/0xa0 [scsi_mod]
> [    6.750038]  scsi_scan_host_selected+0xcc/0x100 [scsi_mod]
> [    6.750038]  do_scsi_scan_host+0x8d/0x90 [scsi_mod]
> [    6.750038]  do_scan_async+0x1c/0x1a0 [scsi_mod]
> [    6.750038]  async_run_entry_fn+0x4a/0x170
> [    6.750038]  process_one_work+0x165/0x430
> [    6.750038]  worker_thread+0x4e/0x490
> [    6.750038]  kthread+0x101/0x140
> [    6.750038]  ? process_one_work+0x430/0x430
> [    6.750038]  ? kthread_create_on_node+0x60/0x60
> [    6.750038]  ret_from_fork+0x2c/0x40
> [    6.750038] Code: ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 4e 30 48 89 f8 48 89 f2 48 89 e5 48 8b 89 88 01 00 00 48 8b 89 08 03 00 00 <f0> ff 41 04 48 8d bf c0 07 00 00 48 8d b0 c8 09 00 00 e8 08 fd
> [    6.750038] RIP: virtscsi_queuecommand_single+0x21/0x40 [virtio_scsi] RSP: ffffc9000033f8b0
> [    6.750038] CR2: 0000000000000004
> [    6.750038] ---[ end trace cee5df55872abfa0 ]---
> [    6.750038] note: kworker/u8:0[5] exited with preempt_count 1
> 
> Arch Linux 4.11.0-rc1 (ttyS0)
> 
> silver login: [   27.370267] scsi 0:0:53:0: tag#766 abort
> [   27.374501] scsi 0:0:53:0: device reset
> [   27.378539] scsi 0:0:53:0: Device offlined - not ready after error recovery
> 
> That crash is because tgt is NULL here:
> 
> ----
> static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
> 					struct scsi_cmnd *sc)
> {
> 	struct virtio_scsi *vscsi = shost_priv(sh);
> 	struct virtio_scsi_target_state *tgt =
> 				scsi_target(sc->device)->hostdata;
> 
> ======> atomic_inc(&tgt->reqs);
> 	return virtscsi_queuecommand(vscsi, &vscsi->req_vqs[0], sc);
> }
> ----
> 
> Applying the rest of this patch set didn't fix it. Additionally,
> enabling KASAN gives this use-after-free spew:
> ----
> [    7.789280] SCSI subsystem initialized
> [    7.809991] virtio_net virtio0 ens2: renamed from eth0
> [    7.828301] scsi host0: Virtio SCSI HBA
> [    7.916214] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
> [    7.946713] ==================================================================
> [    7.947609] BUG: KASAN: use-after-free in rb_erase+0x13a2/0x1a20 at addr ffff88006a915e30
> [    7.948583] Write of size 8 by task dhcpcd-run-hook/146
> [    7.949202] CPU: 0 PID: 146 Comm: dhcpcd-run-hook Not tainted 4.11.0-rc1-00011-gc35ccd2d43e9 #42
> [    7.950021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
> [    7.950021] Call Trace:
> [    7.950021]  <IRQ>
> [    7.950021]  dump_stack+0x63/0x86
> [    7.950021]  kasan_object_err+0x21/0x70
> [    7.950021]  kasan_report.part.1+0x23a/0x520
> [    7.950021]  ? rb_erase+0x13a2/0x1a20
> [    7.950021]  ? swake_up_locked+0x94/0x1c0
> [    7.950021]  __asan_report_store8_noabort+0x31/0x40
> [    7.950021]  rb_erase+0x13a2/0x1a20
> [    7.950021]  wb_congested_put+0x6a/0xd0
> [    7.950021]  __blkg_release_rcu+0x16b/0x230
> [    7.950021]  rcu_process_callbacks+0x602/0xfc0
> [    7.950021]  __do_softirq+0x14e/0x5b3
> [    7.950021]  irq_exit+0x14e/0x180
> [    7.950021]  smp_apic_timer_interrupt+0x7b/0xa0
> [    7.950021]  apic_timer_interrupt+0x89/0x90
> [    7.950021] RIP: 0010:copy_strings.isra.7+0x312/0x6b0
> [    7.950021] RSP: 0018:ffff880063bffcd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
> [    7.950021] RAX: 0000000000002000 RBX: 0000000000000fe3 RCX: 0000000000000002
> [    7.950021] RDX: ffff88006a734880 RSI: 0000000000000002 RDI: ffff880069750808
> [    7.950021] RBP: ffff880063bffdd0 R08: 8000000063ec3067 R09: 0000000000000040
> [    7.950021] R10: ffffed000c7d8600 R11: ffffffff82afa6a0 R12: 00007fffffffefe3
> [    7.950021] R13: 0000000000000015 R14: ffff880069750780 R15: dffffc0000000000
> [    7.950021]  </IRQ>
> [    7.950021]  ? set_binfmt+0x120/0x120
> [    7.950021]  ? insert_vm_struct+0x148/0x2e0
> [    7.950021]  ? kasan_slab_alloc+0x12/0x20
> [    7.950021]  ? count.isra.6.constprop.16+0x52/0x100
> [    7.950021]  do_execveat_common.isra.14+0xef1/0x1b80
> [    7.950021]  ? prepare_bprm_creds+0x100/0x100
> [    7.950021]  ? getname_flags+0x90/0x3f0
> [    7.950021]  ? __do_page_fault+0x5cc/0xbc0
> [    7.950021]  SyS_execve+0x3a/0x50
> [    7.950021]  ? ptregs_sys_vfork+0x10/0x10
> [    7.950021]  do_syscall_64+0x180/0x380
> [    7.950021]  entry_SYSCALL64_slow_path+0x25/0x25
> [    7.950021] RIP: 0033:0x7f85de145477
> [    7.950021] RSP: 002b:00007ffdf2da3568 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
> [    7.950021] RAX: ffffffffffffffda RBX: 0000000000d771c0 RCX: 00007f85de145477
> [    7.950021] RDX: 0000000000d35fb0 RSI: 0000000000d697e0 RDI: 0000000000d771c0
> [    7.950021] RBP: 0000000000000000 R08: 00007ffdf2da3560 R09: 0000000000000030
> [    7.950021] R10: 000000000000059a R11: 0000000000000202 R12: 0000000000d771c0
> [    7.950021] R13: 0000000000d697e0 R14: 0000000000d35fb0 R15: 0000000000000000
> [    7.950021] Object at ffff88006a915b00, in cache kmalloc-1024 size: 1024
> [    7.950021] Allocated:
> [    7.950021] PID = 72
> [    7.950021]  save_stack_trace+0x1b/0x20
> [    7.950021]  kasan_kmalloc+0xee/0x190
> [    7.950021]  kmem_cache_alloc_node_trace+0x138/0x200
> [    7.950021]  bdi_alloc_node+0x4c/0xa0
> [    7.950021]  blk_alloc_queue_node+0xdd/0x870
> [    7.950021]  blk_mq_init_queue+0x41/0x90
> [    7.950021]  scsi_mq_alloc_queue+0x41/0x130 [scsi_mod]
> [    7.950021]  scsi_alloc_sdev+0x90e/0xd00 [scsi_mod]
> [    7.950021]  scsi_probe_and_add_lun+0x14b3/0x2250 [scsi_mod]
> [    7.950021]  __scsi_scan_target+0x23d/0xb60 [scsi_mod]
> [    7.950021]  scsi_scan_channel+0x107/0x160 [scsi_mod]
> [    7.950021]  scsi_scan_host_selected+0x1bf/0x250 [scsi_mod]
> [    7.950021]  do_scsi_scan_host+0x1bc/0x250 [scsi_mod]
> [    7.950021]  do_scan_async+0x41/0x4b0 [scsi_mod]
> [    7.950021]  async_run_entry_fn+0xc3/0x630
> [    7.950021]  process_one_work+0x531/0xf40
> [    7.950021]  worker_thread+0xe4/0x10d0
> [    7.950021]  kthread+0x298/0x390
> [    7.950021]  ret_from_fork+0x2c/0x40
> [    7.950021] Freed:
> [    7.950021] PID = 72
> [    7.950021]  save_stack_trace+0x1b/0x20
> [    7.950021]  kasan_slab_free+0xb0/0x180
> [    7.950021]  kfree+0x9f/0x1d0
> [    7.950021]  bdi_put+0x2a/0x30
> [    7.950021]  blk_release_queue+0x51/0x320
> [    7.950021]  kobject_put+0x154/0x430
> [    7.950021]  blk_put_queue+0x15/0x20
> [    7.950021]  scsi_device_dev_release_usercontext+0x59b/0x880 [scsi_mod]
> [    7.950021]  execute_in_process_context+0xd9/0x130
> [    7.950021]  scsi_device_dev_release+0x1c/0x20 [scsi_mod]
> [    7.950021]  device_release+0x76/0x1e0
> [    7.950021]  kobject_put+0x154/0x430
> [    7.950021]  put_device+0x17/0x20
> [    7.950021]  __scsi_remove_device+0x18d/0x250 [scsi_mod]
> [    7.950021]  scsi_probe_and_add_lun+0x14d6/0x2250 [scsi_mod]
> [    7.950021]  __scsi_scan_target+0x23d/0xb60 [scsi_mod]
> [    7.950021]  scsi_scan_channel+0x107/0x160 [scsi_mod]
> [    7.950021]  scsi_scan_host_selected+0x1bf/0x250 [scsi_mod]
> [    7.950021]  do_scsi_scan_host+0x1bc/0x250 [scsi_mod]
> [    7.950021]  do_scan_async+0x41/0x4b0 [scsi_mod]
> [    7.950021]  async_run_entry_fn+0xc3/0x630
> [    7.950021]  process_one_work+0x531/0xf40
> [    7.950021]  worker_thread+0xe4/0x10d0
> [    7.950021]  kthread+0x298/0x390
> [    7.950021]  ret_from_fork+0x2c/0x40
> [    7.950021] Memory state around the buggy address:
> [    7.950021]  ffff88006a915d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [    7.950021]  ffff88006a915d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [    7.950021] >ffff88006a915e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [    7.950021]                                      ^
> [    7.950021]  ffff88006a915e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [    7.950021]  ffff88006a915f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> ----
> 
> Any ideas Jan?
> 
> Thanks.
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.
From: Jan Kara @ 2017-03-07 14:41 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-block, linux-kernel,
	LKP, Martin K. Petersen, linux-scsi
In-Reply-To: <1488821142.25848.27.camel@linux.vnet.ibm.com>

On Mon 06-03-17 09:25:42, James Bottomley wrote:
> On Mon, 2017-03-06 at 17:13 +0100, Jan Kara wrote:
> > On Mon 06-03-17 07:44:55, James Bottomley wrote:
...
> > > > Sure. The call trace is:
> > > > 
> > > > [   41.919244] ------------[ cut here ]------------
> > > > [   41.919263] WARNING: CPU: 4 PID: 2335 at
> > > > drivers/scsi/sd.c:3332
> > > > sd_shutdown+0x2f/0x100
> > > > [   41.919268] Modules linked in: scsi_debug(+) netconsole loop
> > > > btrfs
> > > > raid6_pq zlib_deflate lzo_compress xor
> > > > [   41.919319] CPU: 4 PID: 2335 Comm: modprobe Not tainted 4.11.0
> > > > -rc1
> > > > -xen+ #49
> > > > [   41.919325] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > > [   41.919331] Call Trace:
> > > > [   41.919343]  dump_stack+0x8e/0xf0
> > > > [   41.919354]  __warn+0x116/0x120
> > > > [   41.919361]  warn_slowpath_null+0x1d/0x20
> > > > [   41.919368]  sd_shutdown+0x2f/0x100
> > > > [   41.919374]  sd_remove+0x70/0xd0
> > > > 
> > > > *** Here is the unexpected step I guess...
> > > > 
> > > > [   41.919383]  driver_probe_device+0xe0/0x4c0
> > > 
> > > Exactly.  It's this, I think
> > > 
> > > 	bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE)
> > > &&
> > > 			   !drv->suppress_bind_attrs;
> > > 
> > > You have that config option set.
> > 
> > Yes - or better said 0-day testing has it set. Maybe that is not a 
> > sane default for 0-day tests? The option is explicitely marked as
> > "unstable"... Fengguang?
> > 
> > > So the drivers base layer is calling ->remove after probe and
> > > triggering the destruction of the queue.
> > > 
> > > What to do about this (apart from nuke such a stupid option) is
> > > somewhat more problematic.
> > 
> > I guess this is between you and Greg :).
> 
> Nice try, sport ... you qualify just behind Dan in the "not my problem"
> olympic hurdles event.  I'm annoyed because there's no indication in
> the log that the driver add behaviour is radically altered, so I've
> been staring at the wrong code for weeks.  However, the unbind/rebind
> should work, so the real issue is that your bdi changes (or perhaps
> something else in block) have induced a regression in the unbinding of
> upper layer drivers.  If you're going to release the bdi in
> del_gendisk, you have to have some mechanism for re-acquiring it on re
> -probe (likely with the same name) otherwise rebind is broken for every
> block driver.

So my patch does not release bdi in del_gendisk(). Bdi has two
initialization / destruction phases (similarly to request queue). You
allocate and initialize bdi through bdi_init(), then you call
bdi_register() to register it (which happens in device_add_disk()). On
shutdown you have to first call bdi_unregister() (used to be called from
blk_cleanup_queue(), my patch moved it to del_gendisk()). After that the
last reference to bdi may be dropped which does final bdi destruction.

So do I understand correctly that SCSI may call device_add_disk(),
del_gendisk() repeatedly for the same request queue? If yes, then indeed I
have a bug to fix... But gendisk seems to get allocated from scratch on
each probe so we don't call device_add_disk(), del_gendisk() more times on
the same disk, right?

> The fact that the second rebind tried with a different name indicates
> that sd_devt_release wasn't called, so some vestige of the devt remains
> on the subsequent rebind.

Yep, I guess that's caused by Dan's patch (commit 0dba1314d4f8 now) which
calls put_disk_devt() only in blk_cleanup_queue() which, if I understood
you correctly, does not get called during unbind-bind cycle? In fact Dan's
patch would end up leaking devt's because of repeated device_add_disk()
calls for the same request queue...

> Here's the problem: the queue belongs to SCSI (the lower layer), so it's
> not going to change because it doesn't see the release.  The gendisk and
> its allied stuff belongs to sd so it gets freed and re-created for the
> same queue.  Something in block is very confused when this happens.

Yep, I think the binding of request queue to different gendisks is
something I or Dan did not expect.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.
From: James Bottomley @ 2017-03-07 16:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Jens Axboe, linux-block, linux-kernel, LKP,
	Martin K. Petersen, linux-scsi
In-Reply-To: <20170307144109.GI2578@quack2.suse.cz>

On Tue, 2017-03-07 at 15:41 +0100, Jan Kara wrote:
> On Mon 06-03-17 09:25:42, James Bottomley wrote:
> > On Mon, 2017-03-06 at 17:13 +0100, Jan Kara wrote:
> > > On Mon 06-03-17 07:44:55, James Bottomley wrote:
> ...
> > > > > Sure. The call trace is:
> > > > > 
> > > > > [   41.919244] ------------[ cut here ]------------
> > > > > [   41.919263] WARNING: CPU: 4 PID: 2335 at
> > > > > drivers/scsi/sd.c:3332
> > > > > sd_shutdown+0x2f/0x100
> > > > > [   41.919268] Modules linked in: scsi_debug(+) netconsole 
> > > > > loop btrfs raid6_pq zlib_deflate lzo_compress xor
> > > > > [   41.919319] CPU: 4 PID: 2335 Comm: modprobe Not tainted 
> > > > > 4.11.0-rc1-xen+ #49
> > > > > [   41.919325] Hardware name: Bochs Bochs, BIOS Bochs
> > > > > 01/01/2011
> > > > > [   41.919331] Call Trace:
> > > > > [   41.919343]  dump_stack+0x8e/0xf0
> > > > > [   41.919354]  __warn+0x116/0x120
> > > > > [   41.919361]  warn_slowpath_null+0x1d/0x20
> > > > > [   41.919368]  sd_shutdown+0x2f/0x100
> > > > > [   41.919374]  sd_remove+0x70/0xd0
> > > > > 
> > > > > *** Here is the unexpected step I guess...
> > > > > 
> > > > > [   41.919383]  driver_probe_device+0xe0/0x4c0
> > > > 
> > > > Exactly.  It's this, I think
> > > > 
> > > > 	bool test_remove =
> > > > IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE)
> > > > &&
> > > > 			   !drv->suppress_bind_attrs;
> > > > 
> > > > You have that config option set.
> > > 
> > > Yes - or better said 0-day testing has it set. Maybe that is not 
> > > a sane default for 0-day tests? The option is explicitely marked 
> > > as "unstable"... Fengguang?
> > > 
> > > > So the drivers base layer is calling ->remove after probe and
> > > > triggering the destruction of the queue.
> > > > 
> > > > What to do about this (apart from nuke such a stupid option) is
> > > > somewhat more problematic.
> > > 
> > > I guess this is between you and Greg :).
> > 
> > Nice try, sport ... you qualify just behind Dan in the "not my 
> > problem" olympic hurdles event.  I'm annoyed because there's no 
> > indication in the log that the driver add behaviour is radically 
> > altered, so I've been staring at the wrong code for weeks. 
> >  However, the unbind/rebind should work, so the real issue is that 
> > your bdi changes (or perhaps something else in block) have induced 
> > a regression in the unbinding of upper layer drivers.  If you're 
> > going to release the bdi in del_gendisk, you have to have some 
> > mechanism for re-acquiring it on re-probe (likely with the same 
> > name) otherwise rebind is broken for every block driver.
> 
> So my patch does not release bdi in del_gendisk(). Bdi has two
> initialization / destruction phases (similarly to request queue). You
> allocate and initialize bdi through bdi_init(), then you call
> bdi_register() to register it (which happens in device_add_disk()). 
> On shutdown you have to first call bdi_unregister() (used to be 
> called from blk_cleanup_queue(), my patch moved it to del_gendisk()). 
> After that the last reference to bdi may be dropped which does final 
> bdi destruction.
> 
> So do I understand correctly that SCSI may call device_add_disk(),
> del_gendisk() repeatedly for the same request queue?

Yes.  The upper drivers (sd, sr, st and sg) follow a device model. 
 They can thus be bound and unbound many times during the lifetime of a
SCSI device.

>  If yes, then indeed I have a bug to fix... But gendisk seems to get 
> allocated from scratch on each probe so we don't call 
> device_add_disk(), del_gendisk() more times on the same disk, right?

Right, gendisk, being a generic representation of a disk is a property
of the upper layer drivers.  We actually cheat and use it in all of
them (including the apparent character ones, they use alloc_disk,
put_disk but not add_disk).  So it has to be freed when the driver is
unbound and reallocated when it is bound.  It's the fundamental entity
which embeds the SCSI upper layer driver lifetime.

> > The fact that the second rebind tried with a different name 
> > indicates that sd_devt_release wasn't called, so some vestige of 
> > the devt remains on the subsequent rebind.
> 
> Yep, I guess that's caused by Dan's patch (commit 0dba1314d4f8 now) 
> which calls put_disk_devt() only in blk_cleanup_queue() which, if I
> understood you correctly, does not get called during unbind-bind 
> cycle? In fact Dan's patch would end up leaking devt's because of 
> repeated device_add_disk() calls for the same request queue...

That's a bit unfortunate.

> > Here's the problem: the queue belongs to SCSI (the lower layer), so 
> > it's not going to change because it doesn't see the release.  The
> > gendisk and its allied stuff belongs to sd so it gets freed and re
> > -created for the same queue.  Something in block is very confused
> > when this happens.
> 
> Yep, I think the binding of request queue to different gendisks is
> something I or Dan did not expect.

OK, so I think we now understand why this is happening ... the question
is what we do about it?  Is it fixable or do we need to go back to
basics?  The fundamental property which got broken by these patches is
the idea of the separation of the lifecycles of the queue and the
gendisk.  The only requirement is that the queue must exist before the
gendisk is created and must be destroyed after it.  This doesn't forbid
there being many gendisk lifetimes over one queue lifetime and this is
what we rely on for the SCSI driver model.  We rarely get called on it
because unbind/rebind of a ULD is rarely done in production (it's
mostly a devtest thing).

James

^ permalink raw reply

* Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.
From: Jan Kara @ 2017-03-07 16:26 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-block, linux-kernel,
	LKP, Martin K. Petersen, linux-scsi
In-Reply-To: <1488903029.2363.16.camel@linux.vnet.ibm.com>

On Tue 07-03-17 08:10:29, James Bottomley wrote:
> On Tue, 2017-03-07 at 15:41 +0100, Jan Kara wrote:
> > On Mon 06-03-17 09:25:42, James Bottomley wrote:
> > > On Mon, 2017-03-06 at 17:13 +0100, Jan Kara wrote:
> > > > On Mon 06-03-17 07:44:55, James Bottomley wrote:
> > ...
> > > > > > Sure. The call trace is:
> > > > > > 
> > > > > > [   41.919244] ------------[ cut here ]------------
> > > > > > [   41.919263] WARNING: CPU: 4 PID: 2335 at
> > > > > > drivers/scsi/sd.c:3332
> > > > > > sd_shutdown+0x2f/0x100
> > > > > > [   41.919268] Modules linked in: scsi_debug(+) netconsole 
> > > > > > loop btrfs raid6_pq zlib_deflate lzo_compress xor
> > > > > > [   41.919319] CPU: 4 PID: 2335 Comm: modprobe Not tainted 
> > > > > > 4.11.0-rc1-xen+ #49
> > > > > > [   41.919325] Hardware name: Bochs Bochs, BIOS Bochs
> > > > > > 01/01/2011
> > > > > > [   41.919331] Call Trace:
> > > > > > [   41.919343]  dump_stack+0x8e/0xf0
> > > > > > [   41.919354]  __warn+0x116/0x120
> > > > > > [   41.919361]  warn_slowpath_null+0x1d/0x20
> > > > > > [   41.919368]  sd_shutdown+0x2f/0x100
> > > > > > [   41.919374]  sd_remove+0x70/0xd0
> > > > > > 
> > > > > > *** Here is the unexpected step I guess...
> > > > > > 
> > > > > > [   41.919383]  driver_probe_device+0xe0/0x4c0
> > > > > 
> > > > > Exactly.  It's this, I think
> > > > > 
> > > > > 	bool test_remove =
> > > > > IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE)
> > > > > &&
> > > > > 			   !drv->suppress_bind_attrs;
> > > > > 
> > > > > You have that config option set.
> > > > 
> > > > Yes - or better said 0-day testing has it set. Maybe that is not 
> > > > a sane default for 0-day tests? The option is explicitely marked 
> > > > as "unstable"... Fengguang?
> > > > 
> > > > > So the drivers base layer is calling ->remove after probe and
> > > > > triggering the destruction of the queue.
> > > > > 
> > > > > What to do about this (apart from nuke such a stupid option) is
> > > > > somewhat more problematic.
> > > > 
> > > > I guess this is between you and Greg :).
> > > 
> > > Nice try, sport ... you qualify just behind Dan in the "not my 
> > > problem" olympic hurdles event.  I'm annoyed because there's no 
> > > indication in the log that the driver add behaviour is radically 
> > > altered, so I've been staring at the wrong code for weeks. 
> > >  However, the unbind/rebind should work, so the real issue is that 
> > > your bdi changes (or perhaps something else in block) have induced 
> > > a regression in the unbinding of upper layer drivers.  If you're 
> > > going to release the bdi in del_gendisk, you have to have some 
> > > mechanism for re-acquiring it on re-probe (likely with the same 
> > > name) otherwise rebind is broken for every block driver.
> > 
> > So my patch does not release bdi in del_gendisk(). Bdi has two
> > initialization / destruction phases (similarly to request queue). You
> > allocate and initialize bdi through bdi_init(), then you call
> > bdi_register() to register it (which happens in device_add_disk()). 
> > On shutdown you have to first call bdi_unregister() (used to be 
> > called from blk_cleanup_queue(), my patch moved it to del_gendisk()). 
> > After that the last reference to bdi may be dropped which does final 
> > bdi destruction.
> > 
> > So do I understand correctly that SCSI may call device_add_disk(),
> > del_gendisk() repeatedly for the same request queue?
> 
> Yes.  The upper drivers (sd, sr, st and sg) follow a device model. 
>  They can thus be bound and unbound many times during the lifetime of a
> SCSI device.
> 
> >  If yes, then indeed I have a bug to fix... But gendisk seems to get 
> > allocated from scratch on each probe so we don't call 
> > device_add_disk(), del_gendisk() more times on the same disk, right?
> 
> Right, gendisk, being a generic representation of a disk is a property
> of the upper layer drivers.  We actually cheat and use it in all of
> them (including the apparent character ones, they use alloc_disk,
> put_disk but not add_disk).  So it has to be freed when the driver is
> unbound and reallocated when it is bound.  It's the fundamental entity
> which embeds the SCSI upper layer driver lifetime.
> 
> > > The fact that the second rebind tried with a different name 
> > > indicates that sd_devt_release wasn't called, so some vestige of 
> > > the devt remains on the subsequent rebind.
> > 
> > Yep, I guess that's caused by Dan's patch (commit 0dba1314d4f8 now) 
> > which calls put_disk_devt() only in blk_cleanup_queue() which, if I
> > understood you correctly, does not get called during unbind-bind 
> > cycle? In fact Dan's patch would end up leaking devt's because of 
> > repeated device_add_disk() calls for the same request queue...
> 
> That's a bit unfortunate.
> 
> > > Here's the problem: the queue belongs to SCSI (the lower layer), so 
> > > it's not going to change because it doesn't see the release.  The
> > > gendisk and its allied stuff belongs to sd so it gets freed and re
> > > -created for the same queue.  Something in block is very confused
> > > when this happens.
> > 
> > Yep, I think the binding of request queue to different gendisks is
> > something I or Dan did not expect.
> 
> OK, so I think we now understand why this is happening ... the question
> is what we do about it?  Is it fixable or do we need to go back to
> basics?  The fundamental property which got broken by these patches is
> the idea of the separation of the lifecycles of the queue and the
> gendisk.  The only requirement is that the queue must exist before the
> gendisk is created and must be destroyed after it.  This doesn't forbid
> there being many gendisk lifetimes over one queue lifetime and this is
> what we rely on for the SCSI driver model.  We rarely get called on it
> because unbind/rebind of a ULD is rarely done in production (it's
> mostly a devtest thing).

For now I think it's fixable. Let's see what I'll be able to come up with.
Let me also note that before my patch moving bdi_unregister() to
del_gendisk() (commit 165a5e22fafb), the SCSI driver model was buggy in a
different way - because bdi_unregister() was called only from
blk_cleanup_queue(), we were leaking bdi->owner reference to gendisk by
repeated calls to bdi_register_owner() from device_add_disk() and
thus gendisks were never freed after unbind/bind (I've actually verified
this bug exists with debug messages). Also as a result request queue use
count was steadily increasing with each new gendisk and request queue could
never get freed after unbind / rebind.

My patch silently fixed this leakage but introduced a different set of
problems which I'll try to fix.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 0/13 v2] block: Fix block device shutdown related races
From: Omar Sandoval @ 2017-03-07 16:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo, NeilBrown
In-Reply-To: <20170307135730.GG2578@quack2.suse.cz>

On Tue, Mar 07, 2017 at 02:57:30PM +0100, Jan Kara wrote:
> On Mon 06-03-17 12:38:18, Omar Sandoval wrote:
> > Unfortunately, this patch actually seems to have introduced a
> > regression. Reverting the patch fixes it.
> > 
> > I'm running a QEMU kvm vm with a virtio-scsi device and I get oopses
> > like this:
> 
> What workload do you run? Or is it enough to just boot the kvm with
> virtio-scsi enabled? Can you send me the kvm setup so that I can reproduce
> this?

This is just while booting. In fact, you don't even need a virtio-scsi
disk attached, it's enough to have the virtio-scsi-pci controller. This
is my exact command line:

qemu-system-x86_64 -nodefaults -nographic -serial mon:stdio -cpu kvm64 \
	-enable-kvm -smp 1 -m 2G -watchdog i6300esb \
	-netdev user,id=vlan0,hostfwd=tcp:127.0.0.1:2222-:22 \
	-device virtio-net,netdev=vlan0 \
	-drive file=Silver/Silver.qcow2,index=0,media=disk,if=virtio,cache=none \
	-device virtio-scsi-pci \
	-kernel /home/osandov/linux/builds/virtio-scsi-crash/arch/x86/boot/bzImage \
	-virtfs local,path=/home/osandov/linux/builds/virtio-scsi-crash,security_model=none,readonly,mount_tag=modules \
	-append 'root=/dev/vda1 console=ttyS0,115200 scsi_mod.use_blk_mq=y'

My kernel config is here:
https://github.com/osandov/osandov-linux/blob/master/configs/qemu.config

> At least the KASAN error could be a result of the situation where someone
> called bdi_register() but didn't call bdi_unregister() before dropping the
> last bdi reference (which would make sense given I've moved
> bdi_unregister() call). However I'd expect a warning from bdi_exit() in that
> case and I don't see that in your kernel log. So I'll try to reproduce the
> issue and debug more.
> 
> 								Honza

Thanks for taking a look!

^ permalink raw reply

* Re: [PATCH RFC 10/14] block, bfq: add Early Queue Merge (EQM)
From: Jens Axboe @ 2017-03-07 17:44 UTC (permalink / raw)
  To: Paolo Valente, Tejun Heo
  Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
	ulf.hansson, linus.walleij, broonie, Mauro Andreolini
In-Reply-To: <20170304160131.57366-11-paolo.valente@linaro.org>

On 03/04/2017 09:01 AM, Paolo Valente wrote:
> @@ -560,6 +600,15 @@ struct bfq_data {
>  	struct bfq_io_cq *bio_bic;
>  	/* bfqq associated with the task issuing current bio for merging */
>  	struct bfq_queue *bio_bfqq;
> +
> +	/*
> +	 * io context to put right after bfqd->lock is released. This
> +	 * filed is used to perform put_io_context, when needed, to
> +	 * after the scheduler lock has been released, and thus
> +	 * prevent an ioc->lock from being possibly taken while the
> +	 * scheduler lock is being held.
> +	 */
> +	struct io_context *ioc_to_put;
>  };

The logic around this is nasty, effectively you end up having locking
around sections of code instea of structures, which is never a good
idea.

The helper functions for unlocking and dropping the ioc add to the mess
as well.

Can't we simply pass back a pointer to an ioc to free? That should be
possible, given that we must have grabbed the bfqd lock ourselves
further up in the call chain. So we _know_ that we'll drop it later on.
If that wasn't the case, the existing logic wouldn't work.

-- 
Jens Axboe

^ permalink raw reply

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
From: Tejun Heo @ 2017-03-07 19:43 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Dan Williams,
	Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown, Omar Sandoval
In-Reply-To: <20170307111420.GF2578@quack2.suse.cz>

Hello,

On Tue, Mar 07, 2017 at 12:14:20PM +0100, Jan Kara wrote:
> > Given that this isn't a super hot path, I think it'd be far better to
> > stick to a simpler synchronization mechanism.
> 
> I'd be happy to do so however struct gendisk does not have any lock in it
> and introducing a special lock just for that seems awkward.

I think even a awkward shared lock is better than memory barriers.
Barriers are the trickiest locking construct to get right and we
really shouldn't using that unless absolutely necessary.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler
From: Jens Axboe @ 2017-03-07 23:22 UTC (permalink / raw)
  To: Paolo Valente, Tejun Heo
  Cc: Fabio Checconi, Arianna Avanzini, linux-block, linux-kernel,
	ulf.hansson, linus.walleij, broonie
In-Reply-To: <20170304160131.57366-2-paolo.valente@linaro.org>

> +/**
> + * bfq_entity_of - get an entity from a node.
> + * @node: the node field of the entity.
> + *
> + * Convert a node pointer to the relative entity.  This is used only
> + * to simplify the logic of some functions and not as the generic
> + * conversion mechanism because, e.g., in the tree walking functions,
> + * the check for a %NULL value would be redundant.
> + */
> +static struct bfq_entity *bfq_entity_of(struct rb_node *node)
> +{
> +	struct bfq_entity *entity = NULL;
> +
> +	if (node)
> +		entity = rb_entry(node, struct bfq_entity, rb_node);
> +
> +	return entity;
> +}

Get rid of pointless wrappers like this, just use rb_entry() in the
caller. It's harmful to the readability of the code to have to lookup
things like this, if it's a list_entry or rb_entry() in the caller you
know exactly what is going on immediately.

-- 
Jens Axboe

^ permalink raw reply

* [PATCH] don't forget to call pd_online_fn when activate policy
From: Zhou Chengming @ 2017-03-08  4:09 UTC (permalink / raw)
  To: axboe; +Cc: linux-block, linux-kernel, miaoxie, zhouchengming1

From: z00354408 <z00354408@notesmail.huawei.com>

Signed-off-by: z00354408 <z00354408@notesmail.huawei.com>
---
 block/blk-cgroup.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 8ba0af7..0dd9e76 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1254,6 +1254,12 @@ int blkcg_activate_policy(struct request_queue *q,
 		pd->plid = pol->plid;
 		if (pol->pd_init_fn)
 			pol->pd_init_fn(pd);
+
+		if (pol->pd_online_fn) {
+			spin_lock(blkg->blkcg->lock);
+			pol->pd_online_fn(pd);
+			spin_unlock(blkg->blkcg->lock);
+		}
 	}
 
 	__set_bit(pol->plid, q->blkcg_pols);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses
From: Minchan Kim @ 2017-03-08  5:11 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Hannes Reinecke, Jens Axboe, Nitin Gupta, Christoph Hellwig,
	Sergey Senozhatsky, yizhan, Linux Block Layer Mailinglist,
	Linux Kernel Mailinglist
In-Reply-To: <10a2335c-0ed0-43de-1cbd-625845301aef@suse.de>

Hi Johannes,

On Tue, Mar 07, 2017 at 10:51:45AM +0100, Johannes Thumshirn wrote:
> On 03/07/2017 09:55 AM, Minchan Kim wrote:
> > On Tue, Mar 07, 2017 at 08:48:06AM +0100, Hannes Reinecke wrote:
> >> On 03/07/2017 08:23 AM, Minchan Kim wrote:
> >>> Hi Hannes,
> >>>
> >>> On Tue, Mar 7, 2017 at 4:00 PM, Hannes Reinecke <hare@suse.de> wrote:
> >>>> On 03/07/2017 06:22 AM, Minchan Kim wrote:
> >>>>> Hello Johannes,
> >>>>>
> >>>>> On Mon, Mar 06, 2017 at 11:23:35AM +0100, Johannes Thumshirn wrote:
> >>>>>> zram can handle at most SECTORS_PER_PAGE sectors in a bio's bvec. When using
> >>>>>> the NVMe over Fabrics loopback target which potentially sends a huge bulk of
> >>>>>> pages attached to the bio's bvec this results in a kernel panic because of
> >>>>>> array out of bounds accesses in zram_decompress_page().
> >>>>>
> >>>>> First of all, thanks for the report and fix up!
> >>>>> Unfortunately, I'm not familiar with that interface of block layer.
> >>>>>
> >>>>> It seems this is a material for stable so I want to understand it clear.
> >>>>> Could you say more specific things to educate me?
> >>>>>
> >>>>> What scenario/When/How it is problem?  It will help for me to understand!
> >>>>>
> >>>
> >>> Thanks for the quick response!
> >>>
> >>>> The problem is that zram as it currently stands can only handle bios
> >>>> where each bvec contains a single page (or, to be precise, a chunk of
> >>>> data with a length of a page).
> >>>
> >>> Right.
> >>>
> >>>>
> >>>> This is not an automatic guarantee from the block layer (who is free to
> >>>> send us bios with arbitrary-sized bvecs), so we need to set the queue
> >>>> limits to ensure that.
> >>>
> >>> What does it mean "bios with arbitrary-sized bvecs"?
> >>> What kinds of scenario is it used/useful?
> >>>
> >> Each bio contains a list of bvecs, each of which points to a specific
> >> memory area:
> >>
> >> struct bio_vec {
> >> 	struct page	*bv_page;
> >> 	unsigned int	bv_len;
> >> 	unsigned int	bv_offset;
> >> };
> >>
> >> The trick now is that while 'bv_page' does point to a page, the memory
> >> area pointed to might in fact be contiguous (if several pages are
> >> adjacent). Hence we might be getting a bio_vec where bv_len is _larger_
> >> than a page.
> > 
> > Thanks for detail, Hannes!
> > 
> > If I understand it correctly, it seems to be related to bid_add_page
> > with high-order page. Right?
> > 
> > If so, I really wonder why I don't see such problem because several
> > places have used it and I expected some of them might do IO with
> > contiguous pages intentionally or by chance. Hmm,
> > 
> > IIUC, it's not a nvme specific problme but general problem which
> > can trigger normal FSes if they uses contiguos pages?
> > 
> 
> I'm not a FS expert, but a quick grep shows that non of the file-systems
> does the
> 
> for_each_sg()
> 	while(bio_add_page()))
> 
> trick NVMe does.

Aah, I see.

> 
> >>
> >> Hence the check for 'is_partial_io' in zram_drv.c (which just does a
> >> test 'if bv_len != PAGE_SIZE) is in fact wrong, as it would trigger for
> >> partial I/O (ie if the overall length of the bio_vec is _smaller_ than a
> >> page), but also for multipage bvecs (where the length of the bio_vec is
> >> _larger_ than a page).
> > 
> > Right. I need to look into that. Thanks for the pointing out!
> > 
> >>
> >> So rather than fixing the bio scanning loop in zram it's easier to set
> >> the queue limits correctly so that 'is_partial_io' does the correct
> >> thing and the overall logic in zram doesn't need to be altered.
> > 
> > 
> > Isn't that approach require new bio allocation through blk_queue_split?
> > Maybe, it wouldn't make severe regression in zram-FS workload but need
> > to test.
> 
> Yes, but blk_queue_split() needs information how big the bvecs can be,
> hence the patch.
> 
> For details have a look into blk_bio_segment_split() in block/blk-merge.c
> 
> It get's the max_sectors from blk_max_size_offset() which is
> q->limits.max_sectors when q->limits.chunk_sectors isn't set and
> then loops over the bio's bvecs to check when to split the bio and then
> calls bio_split() when appropriate.

Yeb so it causes split bio which means new bio allocations which was
not needed before.

> 
> > 
> > Is there any ways to trigger the problem without real nvme device?
> > It would really help to test/measure zram.
> 
> It isn't a /real/ device but the fabrics loopback target. If you want a
> fast reproducible test-case, have a look at:
> 
> https://github.com/ddiss/rapido/
> the cut_nvme_local.sh script set's up the correct VM for this test. Then
> a simple mkfs.xfs /dev/nvme0n1 will oops.

Thanks! I will look into that.

And could you test this patch? It avoids split bio so no need new bio
allocations and makes zram code simple.

>From f778d7564d5cd772f25bb181329362c29548a257 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Wed, 8 Mar 2017 13:35:29 +0900
Subject: [PATCH] fix

Not-yet-Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 40 ++++++++++++++--------------------------
 1 file changed, 14 insertions(+), 26 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index bcb03bacdded..516c3bd97a28 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -147,8 +147,7 @@ static inline bool valid_io_request(struct zram *zram,
 
 static void update_position(u32 *index, int *offset, struct bio_vec *bvec)
 {
-	if (*offset + bvec->bv_len >= PAGE_SIZE)
-		(*index)++;
+	*index  += (*offset + bvec->bv_len) / PAGE_SIZE;
 	*offset = (*offset + bvec->bv_len) % PAGE_SIZE;
 }
 
@@ -886,7 +885,7 @@ static void __zram_make_request(struct zram *zram, struct bio *bio)
 {
 	int offset;
 	u32 index;
-	struct bio_vec bvec;
+	struct bio_vec bvec, bv;
 	struct bvec_iter iter;
 
 	index = bio->bi_iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
@@ -900,34 +899,23 @@ static void __zram_make_request(struct zram *zram, struct bio *bio)
 	}
 
 	bio_for_each_segment(bvec, bio, iter) {
-		int max_transfer_size = PAGE_SIZE - offset;
+		int remained_size = bvec.bv_len;
+		int transfer_size;
 
-		if (bvec.bv_len > max_transfer_size) {
-			/*
-			 * zram_bvec_rw() can only make operation on a single
-			 * zram page. Split the bio vector.
-			 */
-			struct bio_vec bv;
-
-			bv.bv_page = bvec.bv_page;
-			bv.bv_len = max_transfer_size;
-			bv.bv_offset = bvec.bv_offset;
+		bv.bv_page = bvec.bv_page;
+		bv.bv_offset = bvec.bv_offset;
+		do {
+			transfer_size = min_t(int, PAGE_SIZE, remained_size);
+			bv.bv_len = transfer_size;
 
 			if (zram_bvec_rw(zram, &bv, index, offset,
-					 op_is_write(bio_op(bio))) < 0)
-				goto out;
-
-			bv.bv_len = bvec.bv_len - max_transfer_size;
-			bv.bv_offset += max_transfer_size;
-			if (zram_bvec_rw(zram, &bv, index + 1, 0,
-					 op_is_write(bio_op(bio))) < 0)
-				goto out;
-		} else
-			if (zram_bvec_rw(zram, &bvec, index, offset,
-					 op_is_write(bio_op(bio))) < 0)
+					op_is_write(bio_op(bio))) < 0)
 				goto out;
 
-		update_position(&index, &offset, &bvec);
+			bv.bv_offset += transfer_size;
+			update_position(&index, &offset, &bv);
+			remained_size -= transfer_size;
+		} while (remained_size);
 	}
 
 	bio_endio(bio);
-- 
2.7.4


> 
> > 
> > Anyway, to me, it's really subtle at this moment so I doubt it should
> > be stable material. :(
> 
> I'm not quite sure, it's at least 4.11 material. See above.

Absolutely agree that it should be 4.11 material but I don't want to
backport it to the stable because it would make regression due to
split bio works.

Anyway, if my patch I attached works for you, I will resend this
with modified descriptions includes more detail.

Thanks for the help!!

^ permalink raw reply related

* Re: [PATCH 5/8] nowait aio: return on congested block device
From: Sagi Grimberg @ 2017-03-08  7:03 UTC (permalink / raw)
  To: Goldwyn Rodrigues, jack
  Cc: hch, linux-fsdevel, linux-block, linux-btrfs, linux-ext4,
	linux-xfs, Goldwyn Rodrigues
In-Reply-To: <20170228233610.25456-6-rgoldwyn@suse.de>


> -		if (likely(blk_queue_enter(q, false) == 0)) {
> +		if (likely(blk_queue_enter(q, bio_flagged(bio, BIO_NOWAIT)) == 0)) {
>  			ret = q->make_request_fn(q, bio);

I think that for ->make_request to not block we'd need to set
BLK_MQ_REQ_NOWAIT in blk_mq_alloc_data to avoid blocking on a tag
allocation.

Something like the untested addition below:
--
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 09af8ff18719..40e78b57fb44 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -119,6 +119,9 @@ struct request *blk_mq_sched_get_request(struct 
request_queue *q,
         if (likely(!data->hctx))
                 data->hctx = blk_mq_map_queue(q, data->ctx->cpu);

+       if (likely(bio) && bio_flagged(bio, BIO_NOWAIT))
+               data->flags |= BLK_MQ_REQ_NOWAIT;
+
         if (e) {
                 data->flags |= BLK_MQ_REQ_INTERNAL;
--

^ permalink raw reply related

* Re: [PATCH] zram: set physical queue limits to avoid array out of bounds accesses
From: Johannes Thumshirn @ 2017-03-08  7:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Hannes Reinecke, Jens Axboe, Nitin Gupta, Christoph Hellwig,
	Sergey Senozhatsky, yizhan, Linux Block Layer Mailinglist,
	Linux Kernel Mailinglist
In-Reply-To: <20170308051118.GA11206@bbox>

On 03/08/2017 06:11 AM, Minchan Kim wrote:
> And could you test this patch? It avoids split bio so no need new bio
> allocations and makes zram code simple.
> 
> From f778d7564d5cd772f25bb181329362c29548a257 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Wed, 8 Mar 2017 13:35:29 +0900
> Subject: [PATCH] fix
> 
> Not-yet-Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---

[...]

Yup, this works here.

I did a mkfs.xfs /dev/nvme0n1
dd if=/dev/urandom of=/test.bin bs=1M count=128
sha256sum test.bin
mount /dev/nvme0n1 /dir
mv test.bin /dir/
sha256sum /dir/test.bin

No panics and sha256sum of the 128MB test file still matches

Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

Now that you removed the one page limit in zram_bvec_rw() you can also
add this hunk to remove the queue splitting:

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 85f4df8..27b168f6 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -868,8 +868,6 @@ static blk_qc_t zram_make_request(struct
request_queue *queue, struct bio *bio)
 {
        struct zram *zram = queue->queuedata;

-       blk_queue_split(queue, &bio, queue->bio_split);
-
        if (!valid_io_request(zram, bio->bi_iter.bi_sector,
                                        bio->bi_iter.bi_size)) {
                atomic64_inc(&zram->stats.invalid_io);

Byte,
	Johannes

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: Felix Imendï¿½rffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nï¿½rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply related

* BIO request larger than our storage device supports in linux kernel 4.x
From: Umesh Patel @ 2017-03-08  9:34 UTC (permalink / raw)
  To: linux-block@vger.kernel.org; +Cc: Umesh Patel
In-Reply-To: <CY1PR0401MB13538108F3DFD6738AFC8A34E12E0@CY1PR0401MB1353.namprd04.prod.outlook.com>

Hello,

We are registering BIO size of our storage device through linux kernel API =
blk_queue_max_hw_sectors worth of 104K. So max size of BIO that our block d=
evice can handle is 104 KB but kernel block layer is sending 256 KB worth o=
f BIO which is more than we are supporting.
This issue we are observing in 4.4 and also in latest 4.10.1 kernel release=
. Until this we had not seen this issue.

>From the kernel source code history I came to below things.
4.1 kernel version onwards=20
"nr_pages =3D min(sdio->pages_in_io, bio_get_nr_vecs(map_bh->b_bdev))"
has been removed which (bio_get_nr_vecs) was considering max_sectors_kb of =
queue which was 104 KB for our device.

Presently new code is something like
"nr_pages =3D min(sdio->pages_in_io, BIO_MAX_PAGES)"
which is sending 256 KB (BIO_MAX_PAGES) of BIO size to our device which we =
are not supporting.

So from some documentation I found out that 256 KB of bio size is the fix (=
from 4.x kernel) and that has to support by any drive.=20
Please let me know is there any way to change BIO size worth of 104 KB from=
 256 KB or any other way to register our BIO size with kernel or any other =
area to look in to ?.=20

-Umesh

^ permalink raw reply

* Re: [bdi_unregister] 165a5e22fa INFO: task swapper:1 blocked for more than 120 seconds.
From: Jan Kara @ 2017-03-08  9:59 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jan Kara, Fengguang Wu, Jens Axboe, linux-block, linux-kernel,
	LKP, Martin K. Petersen, linux-scsi
In-Reply-To: <20170307162656.GB11837@quack2.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 7412 bytes --]

On Tue 07-03-17 17:26:56, Jan Kara wrote:
> On Tue 07-03-17 08:10:29, James Bottomley wrote:
> > On Tue, 2017-03-07 at 15:41 +0100, Jan Kara wrote:
> > > On Mon 06-03-17 09:25:42, James Bottomley wrote:
> > > > On Mon, 2017-03-06 at 17:13 +0100, Jan Kara wrote:
> > > > > On Mon 06-03-17 07:44:55, James Bottomley wrote:
> > > ...
> > > > > > > Sure. The call trace is:
> > > > > > > 
> > > > > > > [   41.919244] ------------[ cut here ]------------
> > > > > > > [   41.919263] WARNING: CPU: 4 PID: 2335 at
> > > > > > > drivers/scsi/sd.c:3332
> > > > > > > sd_shutdown+0x2f/0x100
> > > > > > > [   41.919268] Modules linked in: scsi_debug(+) netconsole 
> > > > > > > loop btrfs raid6_pq zlib_deflate lzo_compress xor
> > > > > > > [   41.919319] CPU: 4 PID: 2335 Comm: modprobe Not tainted 
> > > > > > > 4.11.0-rc1-xen+ #49
> > > > > > > [   41.919325] Hardware name: Bochs Bochs, BIOS Bochs
> > > > > > > 01/01/2011
> > > > > > > [   41.919331] Call Trace:
> > > > > > > [   41.919343]  dump_stack+0x8e/0xf0
> > > > > > > [   41.919354]  __warn+0x116/0x120
> > > > > > > [   41.919361]  warn_slowpath_null+0x1d/0x20
> > > > > > > [   41.919368]  sd_shutdown+0x2f/0x100
> > > > > > > [   41.919374]  sd_remove+0x70/0xd0
> > > > > > > 
> > > > > > > *** Here is the unexpected step I guess...
> > > > > > > 
> > > > > > > [   41.919383]  driver_probe_device+0xe0/0x4c0
> > > > > > 
> > > > > > Exactly.  It's this, I think
> > > > > > 
> > > > > > 	bool test_remove =
> > > > > > IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE)
> > > > > > &&
> > > > > > 			   !drv->suppress_bind_attrs;
> > > > > > 
> > > > > > You have that config option set.
> > > > > 
> > > > > Yes - or better said 0-day testing has it set. Maybe that is not 
> > > > > a sane default for 0-day tests? The option is explicitely marked 
> > > > > as "unstable"... Fengguang?
> > > > > 
> > > > > > So the drivers base layer is calling ->remove after probe and
> > > > > > triggering the destruction of the queue.
> > > > > > 
> > > > > > What to do about this (apart from nuke such a stupid option) is
> > > > > > somewhat more problematic.
> > > > > 
> > > > > I guess this is between you and Greg :).
> > > > 
> > > > Nice try, sport ... you qualify just behind Dan in the "not my 
> > > > problem" olympic hurdles event.  I'm annoyed because there's no 
> > > > indication in the log that the driver add behaviour is radically 
> > > > altered, so I've been staring at the wrong code for weeks. 
> > > >  However, the unbind/rebind should work, so the real issue is that 
> > > > your bdi changes (or perhaps something else in block) have induced 
> > > > a regression in the unbinding of upper layer drivers.  If you're 
> > > > going to release the bdi in del_gendisk, you have to have some 
> > > > mechanism for re-acquiring it on re-probe (likely with the same 
> > > > name) otherwise rebind is broken for every block driver.
> > > 
> > > So my patch does not release bdi in del_gendisk(). Bdi has two
> > > initialization / destruction phases (similarly to request queue). You
> > > allocate and initialize bdi through bdi_init(), then you call
> > > bdi_register() to register it (which happens in device_add_disk()). 
> > > On shutdown you have to first call bdi_unregister() (used to be 
> > > called from blk_cleanup_queue(), my patch moved it to del_gendisk()). 
> > > After that the last reference to bdi may be dropped which does final 
> > > bdi destruction.
> > > 
> > > So do I understand correctly that SCSI may call device_add_disk(),
> > > del_gendisk() repeatedly for the same request queue?
> > 
> > Yes.  The upper drivers (sd, sr, st and sg) follow a device model. 
> >  They can thus be bound and unbound many times during the lifetime of a
> > SCSI device.
> > 
> > >  If yes, then indeed I have a bug to fix... But gendisk seems to get 
> > > allocated from scratch on each probe so we don't call 
> > > device_add_disk(), del_gendisk() more times on the same disk, right?
> > 
> > Right, gendisk, being a generic representation of a disk is a property
> > of the upper layer drivers.  We actually cheat and use it in all of
> > them (including the apparent character ones, they use alloc_disk,
> > put_disk but not add_disk).  So it has to be freed when the driver is
> > unbound and reallocated when it is bound.  It's the fundamental entity
> > which embeds the SCSI upper layer driver lifetime.
> > 
> > > > The fact that the second rebind tried with a different name 
> > > > indicates that sd_devt_release wasn't called, so some vestige of 
> > > > the devt remains on the subsequent rebind.
> > > 
> > > Yep, I guess that's caused by Dan's patch (commit 0dba1314d4f8 now) 
> > > which calls put_disk_devt() only in blk_cleanup_queue() which, if I
> > > understood you correctly, does not get called during unbind-bind 
> > > cycle? In fact Dan's patch would end up leaking devt's because of 
> > > repeated device_add_disk() calls for the same request queue...
> > 
> > That's a bit unfortunate.
> > 
> > > > Here's the problem: the queue belongs to SCSI (the lower layer), so 
> > > > it's not going to change because it doesn't see the release.  The
> > > > gendisk and its allied stuff belongs to sd so it gets freed and re
> > > > -created for the same queue.  Something in block is very confused
> > > > when this happens.
> > > 
> > > Yep, I think the binding of request queue to different gendisks is
> > > something I or Dan did not expect.
> > 
> > OK, so I think we now understand why this is happening ... the question
> > is what we do about it?  Is it fixable or do we need to go back to
> > basics?  The fundamental property which got broken by these patches is
> > the idea of the separation of the lifecycles of the queue and the
> > gendisk.  The only requirement is that the queue must exist before the
> > gendisk is created and must be destroyed after it.  This doesn't forbid
> > there being many gendisk lifetimes over one queue lifetime and this is
> > what we rely on for the SCSI driver model.  We rarely get called on it
> > because unbind/rebind of a ULD is rarely done in production (it's
> > mostly a devtest thing).
> 
> For now I think it's fixable. Let's see what I'll be able to come up with.
> Let me also note that before my patch moving bdi_unregister() to
> del_gendisk() (commit 165a5e22fafb), the SCSI driver model was buggy in a
> different way - because bdi_unregister() was called only from
> blk_cleanup_queue(), we were leaking bdi->owner reference to gendisk by
> repeated calls to bdi_register_owner() from device_add_disk() and
> thus gendisks were never freed after unbind/bind (I've actually verified
> this bug exists with debug messages). Also as a result request queue use
> count was steadily increasing with each new gendisk and request queue could
> never get freed after unbind / rebind.
> 
> My patch silently fixed this leakage but introduced a different set of
> problems which I'll try to fix.

Attached patch fixes the problem for me. If I also revert commit
0dba1314d4f81115dce711292ec7981d17231064 "scsi, block: fix duplicate bdi
name registration crashes" which is not needed with my
165a5e22fafb127ecb5914e12e8c32a1f0d3f820 "block: Move bdi_unregister() to
del_gendisk()", the device names also do not get incremented and I end up
with 'sda' device (otherwise I end up with sdc).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-block-Allow-bdi-re-registration.patch --]
[-- Type: text/x-patch, Size: 1507 bytes --]

>From bfbccf28853d55d4418401d8eeb6b91c3050e316 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 7 Mar 2017 18:01:51 +0100
Subject: [PATCH] block: Allow bdi re-registration

SCSI can call device_add_disk() several times for one request queue when
a device in unbound and bound, creating new gendisk each time. This will
lead to bdi being repeatedly registered and unregistered. Make sure
bdi_unregister() cleans up bdi in such a way that it is prepared for
the following bdi register call.

An easy way to provoke this behavior is to enable
CONFIG_DEBUG_TEST_DRIVER_REMOVE and use scsi_debug driver to create a
scsi disk which immediately hangs without this fix.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6d861d090e9f..6ac932210f56 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -710,6 +710,11 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 	 */
 	atomic_dec(&bdi->usage_cnt);
 	wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt));
+	/*
+	 * Grab back our reference so that we hold it when @bdi gets
+	 * re-registered.
+	 */
+	atomic_inc(&bdi->usage_cnt);
 }
 
 /**
@@ -857,6 +862,8 @@ int bdi_register_owner(struct backing_dev_info *bdi, struct device *owner)
 			MINOR(owner->devt));
 	if (rc)
 		return rc;
+	/* Leaking owner reference... */
+	WARN_ON(bdi->owner);
 	bdi->owner = owner;
 	get_device(owner);
 	return 0;
-- 
2.10.2


^ permalink raw reply related

* Re: [Nbd] [PATCH 6/6] nbd: add a basic netlink interface
From: Wouter Verhelst @ 2017-03-08 10:07 UTC (permalink / raw)
  To: Josef Bacik; +Cc: axboe, nbd-general, linux-block, kernel-team
In-Reply-To: <1488301031-3199-7-git-send-email-jbacik@fb.com>

On Tue, Feb 28, 2017 at 11:57:11AM -0500, Josef Bacik wrote:
> The existing ioctl interface for configuring NBD devices is a bit
> cumbersome and hard to extend.  The other problem is we leave a
> userspace app sitting in it's syscall until the device disconnects,
> which is less than ideal.

True.

On the other hand, it has the advantage that you leave a userspace app
sitting around until the device disconnects, which allows for some form
of recovery in case you're doing root-on-NBD and the server has a
hiccup. Don't underestimate the advantage of that.

(of course, that requires that the return value of NBD_DO_IT makes a
difference between "unexpected connection drop" and "we sent
NBD_CMD_DISC", but that's a different matter entirely)

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
       people in the world who think they really understand all of its rules,
       and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

^ permalink raw reply

* [PATCH v2] cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode
From: Hou Tao @ 2017-03-08 12:16 UTC (permalink / raw)
  To: axboe, linux-block; +Cc: jack, jmoyer, stable, vgoyal

When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
as the delay of cfq_group's vdisktime if there have been other cfq_groups
already.

When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
from jiffies to nanoseconds") could result in a large iops delay and
lead to an abnormal io schedule delay for the added cfq_group. To fix
it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
when iops mode is enabled.

Despite having the same value, the delay of a cfq_queue in idle class
and the delay of cfq_queue are different things, so I define two new
macros for the delay of a cfq_group under time-slice mode and IOPs mode.

Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
Cc: <stable@vger.kernel.org> # 4.8+
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 block/cfq-iosched.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

v2:
- use constant instead of "nsecs_to_jiffies64(CFQ_IDLE_DELAY)"
  as suggested by Jan Kara

v1:
https://www.spinics.net/lists/stable/msg160580.html

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 440b95e..69754e8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -38,9 +38,13 @@ static const u64 cfq_target_latency = (u64)NSEC_PER_SEC * 3/10; /* 300 ms */
 static const int cfq_hist_divisor = 4;
 
 /*
- * offset from end of service tree
+ * offset from end of queue service tree for idle class
  */
 #define CFQ_IDLE_DELAY		(NSEC_PER_SEC / 5)
+/* offset from end of group service tree under time slice mode */
+#define CFQ_SLICE_MODE_GROUP_DELAY (NSEC_PER_SEC / 5)
+/* offset from end of group service under IOPS mode */
+#define CFQ_IOPS_MODE_GROUP_DELAY (200)
 
 /*
  * below this threshold, we consider thinktime immediate
@@ -1362,6 +1366,14 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
 	cfqg->vfraction = max_t(unsigned, vfr, 1);
 }
 
+static inline u64 cfq_get_cfqg_vdisktime_delay(struct cfq_data *cfqd)
+{
+	if (!iops_mode(cfqd))
+		return CFQ_SLICE_MODE_GROUP_DELAY;
+	else
+		return CFQ_IOPS_MODE_GROUP_DELAY;
+}
+
 static void
 cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
 {
@@ -1381,7 +1393,8 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
 	n = rb_last(&st->rb);
 	if (n) {
 		__cfqg = rb_entry_cfqg(n);
-		cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
+		cfqg->vdisktime = __cfqg->vdisktime +
+			cfq_get_cfqg_vdisktime_delay(cfqd);
 	} else
 		cfqg->vdisktime = st->min_vdisktime;
 	cfq_group_service_tree_add(st, cfqg);
-- 
2.5.0

^ permalink raw reply related

* Re: [PATCH 0/13 v2] block: Fix block device shutdown related races
From: Jan Kara @ 2017-03-08 13:21 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
	Dan Williams, Thiago Jung Bauermann, Lekshmi Pillai, Tejun Heo,
	NeilBrown
In-Reply-To: <20170307162823.GA20922@vader>

[-- Attachment #1: Type: text/plain, Size: 1540 bytes --]

On Tue 07-03-17 08:28:23, Omar Sandoval wrote:
> On Tue, Mar 07, 2017 at 02:57:30PM +0100, Jan Kara wrote:
> > On Mon 06-03-17 12:38:18, Omar Sandoval wrote:
> > > Unfortunately, this patch actually seems to have introduced a
> > > regression. Reverting the patch fixes it.
> > > 
> > > I'm running a QEMU kvm vm with a virtio-scsi device and I get oopses
> > > like this:
> > 
> > What workload do you run? Or is it enough to just boot the kvm with
> > virtio-scsi enabled? Can you send me the kvm setup so that I can reproduce
> > this?
> 
> This is just while booting. In fact, you don't even need a virtio-scsi
> disk attached, it's enough to have the virtio-scsi-pci controller. This
> is my exact command line:
> 
> qemu-system-x86_64 -nodefaults -nographic -serial mon:stdio -cpu kvm64 \
> 	-enable-kvm -smp 1 -m 2G -watchdog i6300esb \
> 	-netdev user,id=vlan0,hostfwd=tcp:127.0.0.1:2222-:22 \
> 	-device virtio-net,netdev=vlan0 \
> 	-drive file=Silver/Silver.qcow2,index=0,media=disk,if=virtio,cache=none \
> 	-device virtio-scsi-pci \
> 	-kernel /home/osandov/linux/builds/virtio-scsi-crash/arch/x86/boot/bzImage \
> 	-virtfs local,path=/home/osandov/linux/builds/virtio-scsi-crash,security_model=none,readonly,mount_tag=modules \
> 	-append 'root=/dev/vda1 console=ttyS0,115200 scsi_mod.use_blk_mq=y'
> 
> My kernel config is here:
> https://github.com/osandov/osandov-linux/blob/master/configs/qemu.config

Thanks. I was able to reproduce. Do attached two patches help?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-block-Allow-bdi-re-registration.patch --]
[-- Type: text/x-patch, Size: 2047 bytes --]

>From 60623ea9de6309f7717fc3536e3594b079735af0 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 7 Mar 2017 18:01:51 +0100
Subject: [PATCH 1/2] block: Allow bdi re-registration

SCSI can call device_add_disk() several times for one request queue when
a device in unbound and bound, creating new gendisk each time. This will
lead to bdi being repeatedly registered and unregistered. This was not a
big problem until commit 165a5e22fafb "block: Move bdi_unregister() to
del_gendisk()" since bdi was only registered repeatedly (bdi_register()
handles repeated calls fine, only we ended up leaking reference to
gendisk due to overwriting bdi->owner) but unregistered only in
blk_cleanup_queue() which didn't get called repeatedly. After
165a5e22fafb we were doing correct bdi_register() - bdi_unregister()
cycles however bdi_unregister() is not prepared for it. So make sure
bdi_unregister() cleans up bdi in such a way that it is prepared for
a possible following bdi_register() call.

An easy way to provoke this behavior is to enable
CONFIG_DEBUG_TEST_DRIVER_REMOVE and use scsi_debug driver to create a
scsi disk which immediately hangs without this fix.

Fixes: 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6d861d090e9f..6ac932210f56 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -710,6 +710,11 @@ static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 	 */
 	atomic_dec(&bdi->usage_cnt);
 	wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt));
+	/*
+	 * Grab back our reference so that we hold it when @bdi gets
+	 * re-registered.
+	 */
+	atomic_inc(&bdi->usage_cnt);
 }
 
 /**
@@ -857,6 +862,8 @@ int bdi_register_owner(struct backing_dev_info *bdi, struct device *owner)
 			MINOR(owner->devt));
 	if (rc)
 		return rc;
+	/* Leaking owner reference... */
+	WARN_ON(bdi->owner);
 	bdi->owner = owner;
 	get_device(owner);
 	return 0;
-- 
2.10.2


[-- Attachment #3: 0002-bdi-Fix-use-after-free-in-wb_congested_put.patch --]
[-- Type: text/x-patch, Size: 3877 bytes --]

>From b9a6424cabd0eb4a663a1faeb11afef268faebc6 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 8 Mar 2017 13:24:47 +0100
Subject: [PATCH 2/2] bdi: Fix use-after-free in wb_congested_put()

bdi_writeback_congested structures get created for each blkcg and bdi
regardless whether bdi is registered or not. When they are created in
unregistered bdi and the request queue (and thus bdi) is then destroyed
while blkg still holds reference to bdi_writeback_congested structure,
this structure will be referencing freed bdi and last wb_congested_put()
will try to remove the structure from already freed bdi.

With commit 165a5e22fafb "block: Move bdi_unregister() to
del_gendisk()", SCSI started to destroy bdis without calling
bdi_unregister() first (previously it was calling bdi_unregister() even
for unregistered bdis) and thus the code detaching
bdi_writeback_congested in cgwb_bdi_destroy() was not triggered and we
started hitting this use-after-free bug. It is enough to boot a KVM
instance with virtio-scsi device to trigger this behavior.

Fix the problem by detaching bdi_writeback_congested structures in
bdi_exit() instead of bdi_unregister(). This is also more logical as
they can get attached to bdi regardless whether it ever got registered
or not.

Fixes: 165a5e22fafb127ecb5914e12e8c32a1f0d3f820
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/backing-dev.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6ac932210f56..c6f2a37028c2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -683,30 +683,18 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
 {
 	struct radix_tree_iter iter;
-	struct rb_node *rbn;
 	void **slot;
 
 	WARN_ON(test_bit(WB_registered, &bdi->wb.state));
 
 	spin_lock_irq(&cgwb_lock);
-
 	radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
 		cgwb_kill(*slot);
-
-	while ((rbn = rb_first(&bdi->cgwb_congested_tree))) {
-		struct bdi_writeback_congested *congested =
-			rb_entry(rbn, struct bdi_writeback_congested, rb_node);
-
-		rb_erase(rbn, &bdi->cgwb_congested_tree);
-		congested->bdi = NULL;	/* mark @congested unlinked */
-	}
-
 	spin_unlock_irq(&cgwb_lock);
 
 	/*
-	 * All cgwb's and their congested states must be shutdown and
-	 * released before returning.  Drain the usage counter to wait for
-	 * all cgwb's and cgwb_congested's ever created on @bdi.
+	 * All cgwb's must be shutdown and released before returning.  Drain
+	 * the usage counter to wait for all cgwb's ever created on @bdi.
 	 */
 	atomic_dec(&bdi->usage_cnt);
 	wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt));
@@ -754,6 +742,21 @@ void wb_blkcg_offline(struct blkcg *blkcg)
 	spin_unlock_irq(&cgwb_lock);
 }
 
+static void cgwb_bdi_exit(struct backing_dev_info *bdi)
+{
+	struct rb_node *rbn;
+
+	spin_lock_irq(&cgwb_lock);
+	while ((rbn = rb_first(&bdi->cgwb_congested_tree))) {
+		struct bdi_writeback_congested *congested =
+			rb_entry(rbn, struct bdi_writeback_congested, rb_node);
+
+		rb_erase(rbn, &bdi->cgwb_congested_tree);
+		congested->bdi = NULL;	/* mark @congested unlinked */
+	}
+	spin_unlock_irq(&cgwb_lock);
+}
+
 #else	/* CONFIG_CGROUP_WRITEBACK */
 
 static int cgwb_bdi_init(struct backing_dev_info *bdi)
@@ -774,7 +777,9 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
 	return 0;
 }
 
-static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
+
+static void cgwb_bdi_exit(struct backing_dev_info *bdi)
 {
 	wb_congested_put(bdi->wb_congested);
 }
@@ -905,6 +910,7 @@ static void bdi_exit(struct backing_dev_info *bdi)
 {
 	WARN_ON_ONCE(bdi->dev);
 	wb_exit(&bdi->wb);
+	cgwb_bdi_exit(bdi);
 }
 
 static void release_bdi(struct kref *ref)
-- 
2.10.2


^ permalink raw reply related

* Re: [PATCH v2] cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode
From: Jan Kara @ 2017-03-08 14:05 UTC (permalink / raw)
  To: Hou Tao; +Cc: axboe, linux-block, jack, jmoyer, stable, vgoyal
In-Reply-To: <1488975415-40417-1-git-send-email-houtao1@huawei.com>

On Wed 08-03-17 20:16:55, Hou Tao wrote:
> When adding a cfq_group into the cfq service tree, we use CFQ_IDLE_DELAY
> as the delay of cfq_group's vdisktime if there have been other cfq_groups
> already.
> 
> When cfq is under iops mode, commit 9a7f38c42c2b ("cfq-iosched: Convert
> from jiffies to nanoseconds") could result in a large iops delay and
> lead to an abnormal io schedule delay for the added cfq_group. To fix
> it, we just need to revert to the old CFQ_IDLE_DELAY value: HZ / 5
> when iops mode is enabled.
> 
> Despite having the same value, the delay of a cfq_queue in idle class
> and the delay of cfq_queue are different things, so I define two new
> macros for the delay of a cfq_group under time-slice mode and IOPs mode.
> 
> Fixes: 9a7f38c42c2b92391d9dabaf9f51df7cfe5608e4
> Cc: <stable@vger.kernel.org> # 4.8+
> Signed-off-by: Hou Tao <houtao1@huawei.com>

OK, the number 200 is somewhat arbitrary but so is HZ/5 so I guess we are
OK. You can add:

Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  block/cfq-iosched.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> v2:
> - use constant instead of "nsecs_to_jiffies64(CFQ_IDLE_DELAY)"
>   as suggested by Jan Kara
> 
> v1:
> https://www.spinics.net/lists/stable/msg160580.html
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 440b95e..69754e8 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -38,9 +38,13 @@ static const u64 cfq_target_latency = (u64)NSEC_PER_SEC * 3/10; /* 300 ms */
>  static const int cfq_hist_divisor = 4;
>  
>  /*
> - * offset from end of service tree
> + * offset from end of queue service tree for idle class
>   */
>  #define CFQ_IDLE_DELAY		(NSEC_PER_SEC / 5)
> +/* offset from end of group service tree under time slice mode */
> +#define CFQ_SLICE_MODE_GROUP_DELAY (NSEC_PER_SEC / 5)
> +/* offset from end of group service under IOPS mode */
> +#define CFQ_IOPS_MODE_GROUP_DELAY (200)
>  
>  /*
>   * below this threshold, we consider thinktime immediate
> @@ -1362,6 +1366,14 @@ cfq_group_service_tree_add(struct cfq_rb_root *st, struct cfq_group *cfqg)
>  	cfqg->vfraction = max_t(unsigned, vfr, 1);
>  }
>  
> +static inline u64 cfq_get_cfqg_vdisktime_delay(struct cfq_data *cfqd)
> +{
> +	if (!iops_mode(cfqd))
> +		return CFQ_SLICE_MODE_GROUP_DELAY;
> +	else
> +		return CFQ_IOPS_MODE_GROUP_DELAY;
> +}
> +
>  static void
>  cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
>  {
> @@ -1381,7 +1393,8 @@ cfq_group_notify_queue_add(struct cfq_data *cfqd, struct cfq_group *cfqg)
>  	n = rb_last(&st->rb);
>  	if (n) {
>  		__cfqg = rb_entry_cfqg(n);
> -		cfqg->vdisktime = __cfqg->vdisktime + CFQ_IDLE_DELAY;
> +		cfqg->vdisktime = __cfqg->vdisktime +
> +			cfq_get_cfqg_vdisktime_delay(cfqd);
>  	} else
>  		cfqg->vdisktime = st->min_vdisktime;
>  	cfq_group_service_tree_add(st, cfqg);
> -- 
> 2.5.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: cfq-iosched: two questions about the hrtimer version of CFQ
From: Jan Kara @ 2017-03-08 14:24 UTC (permalink / raw)
  To: Hou Tao
  Cc: Jan Kara, linux-block, Jens Axboe, linux-kernel, jmoyer,
	Vivek Goyal
In-Reply-To: <a718858f-95b1-32f6-eb8d-ba389b92a937@huawei.com>

Hi,

On Tue 07-03-17 08:11:44, Hou Tao wrote:
> When testing the hrtimer version of CFQ, we found a performance degradation
> problem which seems to be caused by commit 0b31c10 ("cfq-iosched: Charge at
> least 1 jiffie instead of 1 ns").
> 
> The following is the test process:
> 
> * filesystem and block device
> 	* XFS + /dev/sda mounted on /tmp/sda
> * CFQ configuration
> 	* default configuration
> * run "fio ./cfq.job"
> * fio job configuration cfq.job
> 	[global]
> 	bs=4k
> 	ioengine=psync
> 	iodepth=1
> 	direct=1
> 	rw=randwrite
> 	time_based
> 	runtime=15
> 	cgroup_nodelete=1
> 	group_reporting=1
> 
> 	[cfq_a]
> 	filename=/tmp/sda/cfq_a.dat
> 	size=2G
> 	cgroup_weight=500
> 	cgroup=cfq_a
> 	thread=1
> 	numjobs=2
> 
> 	[cfq_b]
> 	new_group
> 	filename=/tmp/sda/cfq_b.dat
> 	size=2G
> 	rate=4m
> 	cgroup_weight=500
> 	cgroup=cfq_b
> 	thread=1
> 	numjobs=2
> 
> The following is the test result:
> * with 0b31c10:
> 	* fio report
> 		cfq_a: bw=5312.6KB/s, iops=1328
> 		cfq_b: bw=8192.6KB/s, iops=2048
> 
> 	* blkcg debug files
> 		./cfq_a/blkio.group_wait_time:8:0 12062571233
> 		./cfq_b/blkio.group_wait_time:8:0 155841600
> 		./cfq_a/blkio.io_serviced:Total 19922
> 		./cfq_b/blkio.io_serviced:Total 30722
> 		./cfq_a/blkio.time:8:0 19406083246
> 		./cfq_b/blkio.time:8:0 19417146869
> 
> * without 0b31c10:
> 	* fio report
> 		cfq_a: bw=21670KB/s, iops=5417
> 		cfq_b: bw=8191.2KB/s, iops=2047
> 
> 	* blkcg debug files
> 		./cfq_a/blkio.group_wait_time:8:0 5798452504
> 		./cfq_b/blkio.group_wait_time:8:0 5131844007
> 		./cfq_a/blkio.io_serviced:8:0 Write 81261
> 		./cfq_b/blkio.io_serviced:8:0 Write 30722
> 		./cfq_a/blkio.time:8:0 5642608173
> 		./cfq_b/blkio.time:8:0 5849949812
> 
> We want to known the reason why you revert the minimal used slice to 1 jiffy
> when the slice has not been allocated. Doest it lead to some performance
> regressions or something similar ? If not, I think we could revert the minimal
> slice to 1 ns again.

So I reverted to accounting 1 jiffie because it was that way before my
commit 9a7f38c42c2b "cfq-iosched: Convert from jiffies to nanoseconds". I
am not aware of any particular issue caused by charging only 1 ns however
it is certainly underestimating the time used by cfqq for that one request
and could be possibly abused by malicious cgroups. How much should be
accounted to cfqq in case no request has completed yet is questionable and
frankly I don't have a good answer for that.

> Another problem is about the time comparison in CFQ code. In no-hrtimer
> version of CFQ, it uses time_after or time_before when possible, Why the
> hrtimer version doesn't use the equivalent time_after64/time_before64 ?
> Can ktime_get_ns() ensure there will be no wrapping problem ?

time_after64() and friends is for 64-bit jiffie values. CFQ is now working
in nanoseconds and not jiffies so you cannot use those functions. WRT
wrapping: 2^64 ns is ~584 years so I'm not concerned about wrapping.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown
From: Jan Kara @ 2017-03-08 14:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jan Kara, Jens Axboe, linux-block, Christoph Hellwig,
	Dan Williams, Thiago Jung Bauermann, Lekshmi Pillai, NeilBrown,
	Omar Sandoval
In-Reply-To: <20170307194349.GF31179@htj.duckdns.org>

On Tue 07-03-17 14:43:49, Tejun Heo wrote:
> Hello,
> 
> On Tue, Mar 07, 2017 at 12:14:20PM +0100, Jan Kara wrote:
> > > Given that this isn't a super hot path, I think it'd be far better to
> > > stick to a simpler synchronization mechanism.
> > 
> > I'd be happy to do so however struct gendisk does not have any lock in it
> > and introducing a special lock just for that seems awkward.
> 
> I think even a awkward shared lock is better than memory barriers.
> Barriers are the trickiest locking construct to get right and we
> really shouldn't using that unless absolutely necessary.

OK, I'll try to come up with something.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox