Linux virtualization list

Linux virtualization list
 help / color / mirror / Atom feed

* Re: [PATCH] virtio_ring: fix num_free handling in error case
From: Cornelia Huck @ 2018-02-23 12:47 UTC (permalink / raw)
  To: Tiwei Bie; +Cc: mst, linux-kernel, stable, virtualization, Andy Lutomirski
In-Reply-To: <20180223114130.16332-1-tiwei.bie@intel.com>

On Fri, 23 Feb 2018 19:41:30 +0800
Tiwei Bie <tiwei.bie@intel.com> wrote:

> The vq->vq.num_free hasn't been changed when error happens,
> so it shouldn't be changed when handling the error.
> 
> Fixes: 780bc7903a32 ("virtio_ring: Support DMA APIs")
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> ---
>  drivers/virtio/virtio_ring.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index eb30f3e09a47..71458f493cf8 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -428,8 +428,6 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  		i = virtio16_to_cpu(_vq->vdev, vq->vring.desc[i].next);
>  	}
>  
> -	vq->vq.num_free += total_sg;
> -
>  	if (indirect)
>  		kfree(desc);
>  

Reviewed-by: Cornelia Huck <cohuck@redhat.com>

^ permalink raw reply

* Re: [PATCH 2/6] pci: Scan all functions when probing while running over Jailhouse
From: Andy Shevchenko @ 2018-02-23 13:23 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: jailhouse-dev, Benedikt Spranger, linux-pci, x86,
	Linux Kernel Mailing List, virtualization, Ingo Molnar,
	H . Peter Anvin, Bjorn Helgaas, Thomas Gleixner
In-Reply-To: <66bfcfc6dc1832baa3fbd8e4879764d36aa9c1e7.1516601570.git.jan.kiszka@siemens.com>

On Mon, Jan 22, 2018 at 8:12 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:

>  #include <linux/export.h>
>  #include <linux/pci.h>
>  #include <asm/pci_x86.h>
> +#include <asm/jailhouse_para.h>

Keep it in order?


>  #include <linux/acpi.h>
>  #include <linux/irqdomain.h>
>  #include <linux/pm_runtime.h>
> +#include <linux/hypervisor.h>

Ditto.

-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* v4.16-rc2: virtio-block + ext4 lockdep splats / sleeping from invalid context
From: Mark Rutland @ 2018-02-23 15:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jens Axboe, Theodore Ts'o, Michael S. Tsirkin, virtualization,
	linux-block, Jan Kara, linux-ext4

Hi all,

While fuzzing arm64/v4.16-rc2 with syzkaller, I simultaneously hit a
number of splats in the block layer:

* inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-R} usage in
  jbd2_trans_will_send_data_barrier

* BUG: sleeping function called from invalid context at mm/mempool.c:320

* WARNING: CPU: 0 PID: 0 at block/blk.h:297 generic_make_request_checks+0x670/0x750

... I've included the full splats at the end of the mail.

These all happen in the context of the virtio block IRQ handler, so I
wonder if this calls something that doesn't expect to be called from IRQ
context. Is it valid to call blk_mq_complete_request() or
blk_mq_end_request() from an IRQ handler?

Syzkaller came up with a minimized reproducer, but it's a bit wacky (the
fcntl and bpf calls should have no practical effect), and I haven't
managed to come up with a C reproducer.

Any ideas?

Thanks,
Mark.


Syzkaller reproducer:
# {Threaded:true Collide:true Repeat:false Procs:1 Sandbox:setuid Fault:false FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true WaitRepeat:false Debug:false Repro:false}
mmap(&(0x7f0000000000/0x24000)=nil, 0x24000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
r0 = openat(0xffffffffffffff9c, &(0x7f0000019000-0x8)='./file0\x00', 0x42, 0x0)
fcntl$setstatus(r0, 0x4, 0x10000)
ftruncate(r0, 0x400)
io_setup(0x1f, &(0x7f0000018000)=<r1=>0x0)
io_submit(r1, 0x1, &(0x7f000001d000-0x28)=[&(0x7f000001b000)={0x0, 0x0, 0x0, 0x1, 0x0, r0, &(0x7f0000022000-0x1000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x200, 0x0, 0x0, 0x0, 0x0}])
bpf$BPF_PROG_ATTACH(0xffffffff, &(0x7f000001b000)={0x0, 0x0, 0x3, 0x2}, 0x40000000)


Full splat:
[  162.337073] ================================
[  162.338055] WARNING: inconsistent lock state
[  162.339017] 4.16.0-rc2 #1 Not tainted
[  162.339797] --------------------------------
[  162.340725] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-R} usage.
[  162.342030] swapper/0/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
[  162.343061]  (&journal->j_state_lock){+?++}, at: [<000000003b9c3e4b>] jbd2_trans_will_send_data_barrier+0x44/0xc8
[  162.353187] {HARDIRQ-ON-W} state was registered at:
[  162.354433]   lock_acquire+0x48/0x68
[  162.358640]   _raw_write_lock+0x3c/0x50
[  162.360716]   ext4_init_journal_params.isra.6+0x40/0xa0
[  162.363445]   ext4_fill_super+0x25cc/0x2e88
[  162.364481]   mount_bdev+0x19c/0x1d8
[  162.365345]   ext4_mount+0x14/0x20
[  162.366130]   mount_fs+0x34/0x160
[  162.366790]   vfs_kern_mount.part.8+0x54/0x160
[  162.367874]   do_mount+0x540/0xd40
[  162.373776]   SyS_mount+0x68/0x100
[  162.374467]   mount_block_root+0x11c/0x28c
[  162.376558]   mount_root+0x130/0x164
[  162.380753]   prepare_namespace+0x138/0x180
[  162.381729]   kernel_init_freeable+0x25c/0x280
[  162.382625]   kernel_init+0x10/0x100
[  162.383337]   ret_from_fork+0x10/0x18
[  162.384072] irq event stamp: 3670810
[  162.384787] hardirqs last  enabled at (3670805): [<00000000fa98454e>] arch_cpu_idle+0x14/0x28
[  162.386505] hardirqs last disabled at (3670806): [<00000000341112e2>] el1_irq+0x74/0x130
[  162.388107] softirqs last  enabled at (3670810): [<00000000d00bd211>] _local_bh_enable+0x20/0x40
[  162.389880] softirqs last disabled at (3670809): [<00000000e929005c>] irq_enter+0x54/0x70
[  162.391443]
[  162.391443] other info that might help us debug this:
[  162.392680]  Possible unsafe locking scenario:
[  162.392680]
[  162.405967]        CPU0
[  162.406513]        ----
[  162.407055]   lock(&journal->j_state_lock);
[  162.407880]   <Interrupt>
[  162.408400]     lock(&journal->j_state_lock);
[  162.409287]
[  162.409287]  *** DEADLOCK ***
[  162.409287]
[  162.410447] 2 locks held by swapper/0/0:
[  162.411248]  #0:  (&(&vblk->vqs[i].lock)->rlock){-.-.}, at: [<00000000b75553ae>] virtblk_done+0x50/0xf8
[  162.413101]  #1:  (rcu_read_lock){....}, at: [<000000002bf2a216>] hctx_lock+0x1c/0xe8
[  162.414630]
[  162.414630] stack backtrace:
[  162.415492] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.16.0-rc2 #1
[  162.429624] Hardware name: linux,dummy-virt (DT)
[  162.430631] Call trace:
[  162.431196]  dump_backtrace+0x0/0x1c8
[  162.432005]  show_stack+0x14/0x20
[  162.432747]  dump_stack+0xac/0xe4
[  162.433507]  print_usage_bug.part.28+0x258/0x270
[  162.434510]  mark_lock+0x744/0x750
[  162.435208]  __lock_acquire+0xab8/0x18c0
[  162.435982]  lock_acquire+0x48/0x68
[  162.436678]  _raw_read_lock+0x3c/0x50
[  162.449520]  jbd2_trans_will_send_data_barrier+0x44/0xc8
[  162.450681]  ext4_sync_file+0x1e0/0x330
[  162.451535]  vfs_fsync_range+0x48/0xc0
[  162.452323]  dio_complete+0x1fc/0x220
[  162.453094]  dio_bio_end_aio+0xf0/0x130
[  162.453935]  bio_endio+0xe8/0xf8
[  162.454625]  blk_update_request+0x80/0x2e8
[  162.455450]  blk_mq_end_request+0x20/0x70
[  162.456240]  virtblk_request_done+0x24/0x30
[  162.457080]  __blk_mq_complete_request+0x100/0x1b0
[  162.458047]  blk_mq_complete_request+0x60/0x98
[  162.458918]  virtblk_done+0x70/0xf8
[  162.459608]  vring_interrupt+0x38/0x50
[  162.460367]  __handle_irq_event_percpu+0x5c/0x148
[  162.472443]  handle_irq_event_percpu+0x34/0x88
[  162.477991]  handle_irq_event+0x48/0x78
[  162.480661]  handle_fasteoi_irq+0xc0/0x198
[  162.485417]  generic_handle_irq+0x24/0x38
[  162.490334]  __handle_domain_irq+0x84/0xf0
[  162.493834]  gic_handle_irq+0x58/0xa8
[  162.498464]  el1_irq+0xb4/0x130
[  162.500621]  arch_cpu_idle+0x18/0x28
[  162.504729]  default_idle_call+0x1c/0x34
[  162.508005]  do_idle+0x17c/0x1f0
[  162.510184]  cpu_startup_entry+0x20/0x28
[  162.515050]  rest_init+0x250/0x260
[  162.518228]  start_kernel+0x3f0/0x41c
[  162.522987] BUG: sleeping function called from invalid context at mm/mempool.c:320
[  162.533762] in_atomic(): 1, irqs_disabled(): 128, pid: 0, name: swapper/0
[  162.540375] INFO: lockdep is turned off.
[  162.542696] irq event stamp: 3670810
[  162.544963] hardirqs last  enabled at (3670805): [<00000000fa98454e>] arch_cpu_idle+0x14/0x28
[  162.551514] hardirqs last disabled at (3670806): [<00000000341112e2>] el1_irq+0x74/0x130
[  162.557422] softirqs last  enabled at (3670810): [<00000000d00bd211>] _local_bh_enable+0x20/0x40
[  162.562596] softirqs last disabled at (3670809): [<00000000e929005c>] irq_enter+0x54/0x70
[  162.568863] Preemption disabled at:
[  162.568877] [<ffff000008b59ca4>] schedule_preempt_disabled+0x1c/0x28
[  162.574673] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.16.0-rc2 #1
[  162.580013] Hardware name: linux,dummy-virt (DT)
[  162.583234] Call trace:
[  162.585279]  dump_backtrace+0x0/0x1c8
[  162.587583]  show_stack+0x14/0x20
[  162.589047]  dump_stack+0xac/0xe4
[  162.592035]  ___might_sleep+0x164/0x238
[  162.594830]  __might_sleep+0x50/0x88
[  162.597012]  mempool_alloc+0xc0/0x198
[  162.600633]  bio_alloc_bioset+0x144/0x250
[  162.602983]  blkdev_issue_flush+0x48/0xc8
[  162.606134]  ext4_sync_file+0x220/0x330
[  162.607870]  vfs_fsync_range+0x48/0xc0
[  162.611694]  dio_complete+0x1fc/0x220
[  162.613369]  dio_bio_end_aio+0xf0/0x130
[  162.617040]  bio_endio+0xe8/0xf8
[  162.618583]  blk_update_request+0x80/0x2e8
[  162.619841]  blk_mq_end_request+0x20/0x70
[  162.621082]  virtblk_request_done+0x24/0x30
[  162.627389]  __blk_mq_complete_request+0x100/0x1b0
[  162.630482]  blk_mq_complete_request+0x60/0x98
[  162.633003]  virtblk_done+0x70/0xf8
[  162.635456]  vring_interrupt+0x38/0x50
[  162.638588]  __handle_irq_event_percpu+0x5c/0x148
[  162.640582]  handle_irq_event_percpu+0x34/0x88
[  162.642829]  handle_irq_event+0x48/0x78
[  162.644062]  handle_fasteoi_irq+0xc0/0x198
[  162.646182]  generic_handle_irq+0x24/0x38
[  162.648024]  __handle_domain_irq+0x84/0xf0
[  162.650304]  gic_handle_irq+0x58/0xa8
[  162.652917]  el1_irq+0xb4/0x130
[  162.656972]  arch_cpu_idle+0x18/0x28
[  162.658445]  default_idle_call+0x1c/0x34
[  162.660723]  do_idle+0x17c/0x1f0
[  162.664958]  cpu_startup_entry+0x20/0x28
[  162.670009]  rest_init+0x250/0x260
[  162.672763]  start_kernel+0x3f0/0x41c
[  162.677064] WARNING: CPU: 0 PID: 0 at block/blk.h:297 generic_make_request_checks+0x670/0x750
[  162.682622] Kernel panic - not syncing: panic_on_warn set ...
[  162.682622]
[  162.695097] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W        4.16.0-rc2 #1
[  162.700355] Hardware name: linux,dummy-virt (DT)
[  162.713462] Call trace:
[  162.714859]  dump_backtrace+0x0/0x1c8
[  162.722935]  show_stack+0x14/0x20
[  162.725392]  dump_stack+0xac/0xe4
[  162.727719]  panic+0x13c/0x2b8
[  162.732313]  panic+0x0/0x2b8
[  162.735233]  report_bug+0xb4/0x158
[  162.738377]  bug_handler+0x30/0x88
[  162.740810]  brk_handler+0xd8/0x198
[  162.747677]  do_debug_exception+0x9c/0x190
[  162.752011]  el1_dbg+0x18/0x78
[  162.755197]  generic_make_request_checks+0x670/0x750
[  162.761260]  generic_make_request+0x2c/0x278
[  162.764366]  submit_bio+0x54/0x208
[  162.771402]  submit_bio_wait+0x68/0xa0
[  162.775053]  blkdev_issue_flush+0x8c/0xc8
[  162.780520]  ext4_sync_file+0x220/0x330
[  162.784815]  vfs_fsync_range+0x48/0xc0
[  162.788243]  dio_complete+0x1fc/0x220
[  162.792071]  dio_bio_end_aio+0xf0/0x130
[  162.794820]  bio_endio+0xe8/0xf8
[  162.796203]  blk_update_request+0x80/0x2e8
[  162.800883]  blk_mq_end_request+0x20/0x70
[  162.803072]  virtblk_request_done+0x24/0x30
[  162.806877]  __blk_mq_complete_request+0x100/0x1b0
[  162.809298]  blk_mq_complete_request+0x60/0x98
[  162.811408]  virtblk_done+0x70/0xf8
[  162.812461]  vring_interrupt+0x38/0x50
[  162.813467]  __handle_irq_event_percpu+0x5c/0x148
[  162.814657]  handle_irq_event_percpu+0x34/0x88
[  162.815730]  handle_irq_event+0x48/0x78
[  162.816856]  handle_fasteoi_irq+0xc0/0x198
[  162.817946]  generic_handle_irq+0x24/0x38
[  162.818948]  __handle_domain_irq+0x84/0xf0
[  162.820056]  gic_handle_irq+0x58/0xa8
[  162.821013]  el1_irq+0xb4/0x130
[  162.826572]  arch_cpu_idle+0x18/0x28
[  162.827304]  default_idle_call+0x1c/0x34
[  162.828315]  do_idle+0x17c/0x1f0
[  162.829174]  cpu_startup_entry+0x20/0x28
[  162.830072]  rest_init+0x250/0x260
[  162.830872]  start_kernel+0x3f0/0x41c
[  162.831627] SMP: stopping secondary CPUs
[  165.929235] SMP: failed to stop secondary CPUs 0,2
[  165.949001] Kernel Offset: disabled
[  165.963649] CPU features: 0x1802082
[  165.969405] Memory Limit: none
[  165.974480] Rebooting in 86400 seconds..

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Siwei Liu @ 2018-02-23 22:22 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Netdev, Alexander Duyck, virtualization,
	David Miller
In-Reply-To: <0d158bf6-79b3-442b-2c61-3e900ff40922@intel.com>

On Wed, Feb 21, 2018 at 6:35 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 2/21/2018 5:59 PM, Siwei Liu wrote:
>>
>> On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>>
>>> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:
>>>>
>>>> I haven't checked emails for days and did not realize the new revision
>>>> had already came out. And thank you for the effort, this revision
>>>> really looks to be a step forward towards our use case and is close to
>>>> what we wanted to do. A few questions in line.
>>>>
>>>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>>>> <alexander.duyck@gmail.com> wrote:
>>>>>
>>>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl> wrote:
>>>>>>
>>>>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>>>>>
>>>>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>>>>> solution.  However, it creates some issues we'll get into in a
>>>>>>> moment.
>>>>>>> It extends virtio_net to use alternate datapath when available and
>>>>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>>>>> an additional 'bypass' netdev that acts as a master device and
>>>>>>> controls
>>>>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>>>>> associated with the same 'pci' device.  The user accesses the network
>>>>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active'
>>>>>>> netdev
>>>>>>> as default for transmits when it is available with link up and
>>>>>>> running.
>>>>>>
>>>>>> Thank you do doing this.
>>>>>>
>>>>>>> We noticed a couple of issues with this approach during testing.
>>>>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>>>>    virtio pci device, udev tries to rename both of them with the same
>>>>>>> name
>>>>>>>    and the 2nd rename will fail. This would be OK as long as the
>>>>>>> first netdev
>>>>>>>    to be renamed is the 'bypass' netdev, but the order in which udev
>>>>>>> gets
>>>>>>>    to rename the 2 netdevs is not reliable.
>>>>>>
>>>>>> Out of curiosity - why do you link the master netdev to the virtio
>>>>>> struct device?
>>>>>
>>>>> The basic idea of all this is that we wanted this to work with an
>>>>> existing VM image that was using virtio. As such we were trying to
>>>>> make it so that the bypass interface takes the place of the original
>>>>> virtio and get udev to rename the bypass to what the original
>>>>> virtio_net was.
>>>>
>>>> Could it made it also possible to take over the config from VF instead
>>>> of virtio on an existing VM image? And get udev rename the bypass
>>>> netdev to what the original VF was. I don't say tightly binding the
>>>> bypass master to only virtio or VF, but I think we should provide both
>>>> options to support different upgrade paths. Possibly we could tweak
>>>> the device tree layout to reuse the same PCI slot for the master
>>>> bypass netdev, such that udev would not get confused when renaming the
>>>> device. The VF needs to use a different function slot afterwards.
>>>> Perhaps we might need to a special multiseat like QEMU device for that
>>>> purpose?
>>>>
>>>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>>>
>>> So if I am understanding what you are saying you are wanting to flip
>>> the backup interface from the virtio to a VF. The problem is that
>>> becomes a bit of a vendor lock-in solution since it would rely on a
>>> specific VF driver. I would agree with Jiri that we don't want to go
>>> down that path. We don't want every VF out there firing up its own
>>> separate bond. Ideally you want the hypervisor to be able to manage
>>> all of this which is why it makes sense to have virtio manage this and
>>> why this is associated with the virtio_net interface.
>>
>> No, that's not what I was talking about of course. I thought you
>> mentioned the upgrade scenario this patch would like to address is to
>> use the bypass interface "to take the place of the original virtio,
>> and get udev to rename the bypass to what the original virtio_net
>> was". That is one of the possible upgrade paths for sure. However the
>> upgrade path I was seeking is to use the bypass interface to take the
>> place of original VF interface while retaining the name and network
>> configs, which generally can be done simply with kernel upgrade. It
>> would become limiting as this patch makes the bypass interface share
>> the same virtio pci device with virito backup. Can this bypass
>> interface be made general to take place of any pci device other than
>> virtio-net? This will be more helpful as the cloud users who has
>> existing setup on VF interface don't have to recreate it on virtio-net
>> and VF separately again.
>
>
> Yes. This sounds interesting. Looks like you want an existing VM image with
> VF only configuration to get transparent live migration support by adding
> virtio_net with BACKUP feature.  We may need another feature bit to switch
> between these 2 options.

Yes, that's what I was thinking about. I have been building something
like this before, and would like to get back after merging with your
patch.

>
>
>
>>
>>> The other bits get into more complexity then we are ready to handle
>>> for now. I think I might have talked about something similar that I
>>> was referring to as a "virtio-bond" where you would have a PCI/PCIe
>>> tree topology that makes this easier to sort out, and the "virtio-bond
>>> would be used to handle coordination/configuration of a much more
>>> complex interface.
>>
>> That was one way to solve this problem but I'd like to see simple ways
>> to sort it out.
>>
>>>>>> FWIW two solutions that immediately come to mind is to export "backup"
>>>>>> as phys_port_name of the backup virtio link and/or assign a name to
>>>>>> the
>>>>>> master like you are doing already.  I think team uses team%d and bond
>>>>>> uses bond%d, soft naming of master devices seems quite natural in this
>>>>>> case.
>>>>>
>>>>> I figured I had overlooked something like that.. Thanks for pointing
>>>>> this out. Okay so I think the phys_port_name approach might resolve
>>>>> the original issue. If I am reading things correctly what we end up
>>>>> with is the master showing up as "ens1" for example and the backup
>>>>> showing up as "ens1nbackup". Am I understanding that right?
>>>>>
>>>>> The problem with the team/bond%d approach is that it creates a new
>>>>> netdevice and so it would require guest configuration changes.
>>>>>
>>>>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>>>>> link is quite neat.
>>>>>
>>>>> I agree. For non-"backup" virio_net devices would it be okay for us to
>>>>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>>>>> behavior could be maintained although the function still exists.
>>>>>
>>>>>>> - When the 'active' netdev is unplugged OR not present on a
>>>>>>> destination
>>>>>>>    system after live migration, the user will see 2 virtio_net
>>>>>>> netdevs.
>>>>>>
>>>>>> That's necessary and expected, all configuration applies to the master
>>>>>> so master must exist.
>>>>>
>>>>> With the naming issue resolved this is the only item left outstanding.
>>>>> This becomes a matter of form vs function.
>>>>>
>>>>> The main complaint about the "3 netdev" solution is a bit confusing to
>>>>> have the 2 netdevs present if the VF isn't there. The idea is that
>>>>> having the extra "master" netdev there if there isn't really a bond is
>>>>> a bit ugly.
>>>>
>>>> Is it this uglier in terms of user experience rather than
>>>> functionality? I don't want it dynamically changed between 2-netdev
>>>> and 3-netdev depending on the presence of VF. That gets back to my
>>>> original question and suggestion earlier: why not just hide the lower
>>>> netdevs from udev renaming and such? Which important observability
>>>> benefits users may get if exposing the lower netdevs?
>>>>
>>>> Thanks,
>>>> -Siwei
>>>
>>> The only real advantage to a 2 netdev solution is that it looks like
>>> the netvsc solution, however it doesn't behave like it since there are
>>> some features like XDP that may not function correctly if they are
>>> left enabled in the virtio_net interface.
>>>
>>> As far as functionality the advantage of not hiding the lower devices
>>> is that they are free to be managed. The problem with pushing all of
>>> the configuration into the upper device is that you are limited to the
>>> intersection of the features of the lower devices. This can be
>>> limiting for some setups as some VFs support things like more queues,
>>> or better interrupt moderation options than others so trying to make
>>> everything work with one config would be ugly.
>>
>> It depends on how you build it and the way you expect it to work. IMHO
>> the lower devices don't need to be directly managed at all, otherwise
>> it ends up with loss of configuration across migration, and it really
>> does not bring much value than having a general team or bond device.
>> Users still have to reconfigure those queue settings and interrupt
>> moderation options after all. The new upper device could take the
>> assumption that the VF/PT lower device always has superior feature set
>> than virtio-net in order to apply advanced configuration. The upper
>> device should remember all configurations previously done and apply
>> supporting ones to active device automatically when switching the
>> datapath.
>>
> It should be possible to extend this patchset to support migration of
> additional
> settings  by enabling additional ndo_ops and ethtool_ops on the upper dev
> and propagating them down the lower devices and replaying the settings after
> the VF is replugged after migration.

Indeed. But your 3rd patch will collapse this merit of the 3-netdev
into the former 2-netdev model - I hope it's just for demostrating the
possibility of dynamically switching ndo_ops and ethtool_ops but not
actually making it complicated in implementing this further.

Thanks,
-Siwei
>
> Thanks
> Sridhar

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Jiri Pirko @ 2018-02-23 22:38 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, Alexander Duyck,
	virtualization, Netdev, David Miller
In-Reply-To: <CADGSJ22Z1ujZOc+OS5E=MwGCy1nTn9D6mHKBz+J=K9+--8R+jg@mail.gmail.com>

Fri, Feb 23, 2018 at 11:22:36PM CET, loseweigh@gmail.com wrote:

[...]

>>>
>>> No, that's not what I was talking about of course. I thought you
>>> mentioned the upgrade scenario this patch would like to address is to
>>> use the bypass interface "to take the place of the original virtio,
>>> and get udev to rename the bypass to what the original virtio_net
>>> was". That is one of the possible upgrade paths for sure. However the
>>> upgrade path I was seeking is to use the bypass interface to take the
>>> place of original VF interface while retaining the name and network
>>> configs, which generally can be done simply with kernel upgrade. It
>>> would become limiting as this patch makes the bypass interface share
>>> the same virtio pci device with virito backup. Can this bypass
>>> interface be made general to take place of any pci device other than
>>> virtio-net? This will be more helpful as the cloud users who has
>>> existing setup on VF interface don't have to recreate it on virtio-net
>>> and VF separately again.

How that could work? If you have the VF netdev with all configuration
including IPs and routes and whatever - now you want to do migration
so you add virtio_net and do some weird in-driver bonding with it. But
then, VF disappears and the VF netdev with that and also all
configuration it had.
I don't think this scenario is valid.


>>
>>
>> Yes. This sounds interesting. Looks like you want an existing VM image with
>> VF only configuration to get transparent live migration support by adding
>> virtio_net with BACKUP feature.  We may need another feature bit to switch
>> between these 2 options.
>
>Yes, that's what I was thinking about. I have been building something
>like this before, and would like to get back after merging with your
>patch.

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Stephen Hemminger @ 2018-02-23 23:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	Netdev, David Miller
In-Reply-To: <CAKgT0UeVhPOgRM0J6fyGDAJrTnqjc6AtC8+_UMs=hojZAajtdg@mail.gmail.com>

On Thu, 22 Feb 2018 13:30:12 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> > Again, I undertand your motivation. Yet I don't like your solution.
> > But if the decision is made to do this in-driver bonding. I would like
> > to see it baing done some generic way:
> > 1) share the same "in-driver bonding core" code with netvsc
> >    put to net/core.
> > 2) the "in-driver bonding core" will strictly limit the functionality,
> >    like active-backup mode only, one vf, one backup, vf netdev type
> >    check (so noone could enslave a tap or anything else)
> > If user would need something more, he should employ team/bond.  

Sharing would be good, but netvsc world would really like to only have
one visible network device.

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Stephen Hemminger @ 2018-02-24  0:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Netdev, virtualization, Siwei Liu,
	Sridhar Samudrala, David Miller
In-Reply-To: <CAKgT0Uda7Bmzf06RMmD7hCVLr8hHQxaY8Fk7w9Qy=7Vi6NODwQ@mail.gmail.com>

(pruned to reduce thread)

On Wed, 21 Feb 2018 16:17:19 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> >>> FWIW two solutions that immediately come to mind is to export "backup"
> >>> as phys_port_name of the backup virtio link and/or assign a name to the
> >>> master like you are doing already.  I think team uses team%d and bond
> >>> uses bond%d, soft naming of master devices seems quite natural in this
> >>> case.  
> >>
> >> I figured I had overlooked something like that.. Thanks for pointing
> >> this out. Okay so I think the phys_port_name approach might resolve
> >> the original issue. If I am reading things correctly what we end up
> >> with is the master showing up as "ens1" for example and the backup
> >> showing up as "ens1nbackup". Am I understanding that right?
> >>
> >> The problem with the team/bond%d approach is that it creates a new
> >> netdevice and so it would require guest configuration changes.
> >>  
> >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
> >>> link is quite neat.  
> >>
> >> I agree. For non-"backup" virio_net devices would it be okay for us to
> >> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> >> behavior could be maintained although the function still exists.
> >>  
> >>>> - When the 'active' netdev is unplugged OR not present on a destination
> >>>>   system after live migration, the user will see 2 virtio_net netdevs.  
> >>>
> >>> That's necessary and expected, all configuration applies to the master
> >>> so master must exist.  
> >>
> >> With the naming issue resolved this is the only item left outstanding.
> >> This becomes a matter of form vs function.
> >>
> >> The main complaint about the "3 netdev" solution is a bit confusing to
> >> have the 2 netdevs present if the VF isn't there. The idea is that
> >> having the extra "master" netdev there if there isn't really a bond is
> >> a bit ugly.  
> >
> > Is it this uglier in terms of user experience rather than
> > functionality? I don't want it dynamically changed between 2-netdev
> > and 3-netdev depending on the presence of VF. That gets back to my
> > original question and suggestion earlier: why not just hide the lower
> > netdevs from udev renaming and such? Which important observability
> > benefits users may get if exposing the lower netdevs?
> >
> > Thanks,
> > -Siwei  
> 
> The only real advantage to a 2 netdev solution is that it looks like
> the netvsc solution, however it doesn't behave like it since there are
> some features like XDP that may not function correctly if they are
> left enabled in the virtio_net interface.
> 
> As far as functionality the advantage of not hiding the lower devices
> is that they are free to be managed. The problem with pushing all of
> the configuration into the upper device is that you are limited to the
> intersection of the features of the lower devices. This can be
> limiting for some setups as some VFs support things like more queues,
> or better interrupt moderation options than others so trying to make
> everything work with one config would be ugly.
> 


Let's not make XDP the blocker for doing the best solution
from the end user point of view. XDP is just yet another offload
thing which needs to be handled.  The current backup device solution
used in netvsc doesn't handle the full range of offload options
(things like flow direction, DCB, etc); no one but the HW vendors
seems to care.

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Siwei Liu @ 2018-02-24  0:17 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, Alexander Duyck,
	virtualization, Netdev, David Miller
In-Reply-To: <20180223223802.GA2010@nanopsycho>

On Fri, Feb 23, 2018 at 2:38 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Fri, Feb 23, 2018 at 11:22:36PM CET, loseweigh@gmail.com wrote:
>
> [...]
>
>>>>
>>>> No, that's not what I was talking about of course. I thought you
>>>> mentioned the upgrade scenario this patch would like to address is to
>>>> use the bypass interface "to take the place of the original virtio,
>>>> and get udev to rename the bypass to what the original virtio_net
>>>> was". That is one of the possible upgrade paths for sure. However the
>>>> upgrade path I was seeking is to use the bypass interface to take the
>>>> place of original VF interface while retaining the name and network
>>>> configs, which generally can be done simply with kernel upgrade. It
>>>> would become limiting as this patch makes the bypass interface share
>>>> the same virtio pci device with virito backup. Can this bypass
>>>> interface be made general to take place of any pci device other than
>>>> virtio-net? This will be more helpful as the cloud users who has
>>>> existing setup on VF interface don't have to recreate it on virtio-net
>>>> and VF separately again.
>
> How that could work? If you have the VF netdev with all configuration
> including IPs and routes and whatever - now you want to do migration
> so you add virtio_net and do some weird in-driver bonding with it. But
> then, VF disappears and the VF netdev with that and also all
> configuration it had.
> I don't think this scenario is valid.

We are talking about making udev aware of the new virtio-bypass to
rebind the name of the old VF interface with supposedly virtio-bypass
*post the kernel upgrade*. Of course, this needs virtio-net backend to
supply the [bdf] info where the VF/PT device was located.

-Siwei


>
>
>>>
>>>
>>> Yes. This sounds interesting. Looks like you want an existing VM image with
>>> VF only configuration to get transparent live migration support by adding
>>> virtio_net with BACKUP feature.  We may need another feature bit to switch
>>> between these 2 options.
>>
>>Yes, that's what I was thinking about. I have been building something
>>like this before, and would like to get back after merging with your
>>patch.

^ permalink raw reply

* hong_en@solenw.org
From: Solen win @ 2018-02-24 18:25 UTC (permalink / raw)
  To: virtualization


[-- Attachment #1.1: Type: text/plain, Size: 9 bytes --]

-- 
null

[-- Attachment #1.2: Type: text/html, Size: 90 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Alexander Duyck @ 2018-02-25 22:17 UTC (permalink / raw)
  To: Stephen Hemminger, Jakub Kicinski, Michael S. Tsirkin, Siwei Liu,
	Jiri Pirko, Jason Wang
  Cc: Duyck, Alexander H, virtio-dev, Sridhar Samudrala, virtualization,
	Netdev, David Miller
In-Reply-To: <20180223160305.71fb2db2@xeon-e3>

On Fri, Feb 23, 2018 at 4:03 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> (pruned to reduce thread)
>
> On Wed, 21 Feb 2018 16:17:19 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> >>> FWIW two solutions that immediately come to mind is to export "backup"
>> >>> as phys_port_name of the backup virtio link and/or assign a name to the
>> >>> master like you are doing already.  I think team uses team%d and bond
>> >>> uses bond%d, soft naming of master devices seems quite natural in this
>> >>> case.
>> >>
>> >> I figured I had overlooked something like that.. Thanks for pointing
>> >> this out. Okay so I think the phys_port_name approach might resolve
>> >> the original issue. If I am reading things correctly what we end up
>> >> with is the master showing up as "ens1" for example and the backup
>> >> showing up as "ens1nbackup". Am I understanding that right?
>> >>
>> >> The problem with the team/bond%d approach is that it creates a new
>> >> netdevice and so it would require guest configuration changes.
>> >>
>> >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>> >>> link is quite neat.
>> >>
>> >> I agree. For non-"backup" virio_net devices would it be okay for us to
>> >> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>> >> behavior could be maintained although the function still exists.
>> >>
>> >>>> - When the 'active' netdev is unplugged OR not present on a destination
>> >>>>   system after live migration, the user will see 2 virtio_net netdevs.
>> >>>
>> >>> That's necessary and expected, all configuration applies to the master
>> >>> so master must exist.
>> >>
>> >> With the naming issue resolved this is the only item left outstanding.
>> >> This becomes a matter of form vs function.
>> >>
>> >> The main complaint about the "3 netdev" solution is a bit confusing to
>> >> have the 2 netdevs present if the VF isn't there. The idea is that
>> >> having the extra "master" netdev there if there isn't really a bond is
>> >> a bit ugly.
>> >
>> > Is it this uglier in terms of user experience rather than
>> > functionality? I don't want it dynamically changed between 2-netdev
>> > and 3-netdev depending on the presence of VF. That gets back to my
>> > original question and suggestion earlier: why not just hide the lower
>> > netdevs from udev renaming and such? Which important observability
>> > benefits users may get if exposing the lower netdevs?
>> >
>> > Thanks,
>> > -Siwei
>>
>> The only real advantage to a 2 netdev solution is that it looks like
>> the netvsc solution, however it doesn't behave like it since there are
>> some features like XDP that may not function correctly if they are
>> left enabled in the virtio_net interface.
>>
>> As far as functionality the advantage of not hiding the lower devices
>> is that they are free to be managed. The problem with pushing all of
>> the configuration into the upper device is that you are limited to the
>> intersection of the features of the lower devices. This can be
>> limiting for some setups as some VFs support things like more queues,
>> or better interrupt moderation options than others so trying to make
>> everything work with one config would be ugly.
>>
>
>
> Let's not make XDP the blocker for doing the best solution
> from the end user point of view. XDP is just yet another offload
> thing which needs to be handled.  The current backup device solution
> used in netvsc doesn't handle the full range of offload options
> (things like flow direction, DCB, etc); no one but the HW vendors
> seems to care.

XDP isn't the blocker here. As far as I am concerned we can go either
way, with a 2 netdev or a 3 netdev solution. We just need to make sure
we are aware of all the trade-offs, and make a decision one way or the
other. This is quickly turning into a bikeshed and I would prefer us
to all agree, or at least disagree and commit, on which way to go
before we burn more cycles on a patch set that seems to be getting
tied up in debate.

With the 2 netdev solution we have to limit the functionality so that
we don't break things when we bypass the guts of the driver to hand
traffic off to the VF. Then ends up meaning that we are stuck with an
extra qdisc and Tx queue lock in the transmit path of the VF, and we
cannot rely on any in-driver Rx functionality to work such as
in-driver XDP. However the advantage here is that this is how netvsc
is already doing things.

The issue with the 3 netdev solution is that you are stuck with 2
netdevs ("ens1", "ens1nbackup") when the VF is not present. It could
be argued this isn't a very elegant looking solution, especially when
the VF is not present. With virtio this makes more sense though as you
are still able to expose the full functionality of the lower device so
you don't have to strip or drop any of the existing net device ops if
the "backup" bit is present.

Ultimately I would have preferred to have the 3 netdev solution go
with virtio only as it would have allowed for healthy competition
between the two designs and we could have seen which one would have
ultimately won out, but if we have to decide this now we need to do so
before we put too much more effort into the patches as these end up
becoming two very different solutions, especially if we have to apply
the solution to both drivers. My preference would still be 3 netdevs
since we could apply this to netvsc without too many changes, but I
will agree with whatever conclusion we can come to in terms of how
this is supposed to work for both netvsc and virtio_bypass.

- Alex

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Alexander Duyck @ 2018-02-25 22:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, virtualization, Siwei Liu,
	Netdev, David Miller
In-Reply-To: <20180223155904.27b11865@xeon-e3>

On Fri, Feb 23, 2018 at 3:59 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Thu, 22 Feb 2018 13:30:12 -0800
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> > Again, I undertand your motivation. Yet I don't like your solution.
>> > But if the decision is made to do this in-driver bonding. I would like
>> > to see it baing done some generic way:
>> > 1) share the same "in-driver bonding core" code with netvsc
>> >    put to net/core.
>> > 2) the "in-driver bonding core" will strictly limit the functionality,
>> >    like active-backup mode only, one vf, one backup, vf netdev type
>> >    check (so noone could enslave a tap or anything else)
>> > If user would need something more, he should employ team/bond.
>
> Sharing would be good, but netvsc world would really like to only have
> one visible network device.

Other than the netdev count are there any other issues we need to be
thinking about?

If I am not mistaken you netvsc doesn't put any broadcast/multicast
filters on the VF. If we ended up doing that in order to support the
virtio based solution would that cause any issues? I just realized we
had overlooked dealing with multicast in our current solution so we
will probably be looking at syncing the multicast list like what
occurs in netvsc, however we will need to do it for both the VF and
the virtio interfaces.

Thanks.

- Alex

^ permalink raw reply

* Re: [PATCH v28 0/4] Virtio-balloon: support free page reporting
From: Wei Wang @ 2018-02-26  4:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: yang.zhang.wz@gmail.com, virtio-dev@lists.oasis-open.org,
	riel@redhat.com, quan.xu0@gmail.com, kvm@vger.kernel.org,
	nilal@redhat.com, liliang.opensource@gmail.com,
	linux-kernel@vger.kernel.org, mhocko@kernel.org,
	linux-mm@kvack.org, huangzhichao@huawei.com, pbonzini@redhat.com,
	akpm@linux-foundation.org,
	virtualization@lists.linux-foundation.org, dgilbert@redhat.com
In-Reply-To: <20180209051212-mutt-send-email-mst@kernel.org>

On 02/09/2018 11:14 AM, Michael S. Tsirkin wrote:
> On Fri, Feb 09, 2018 at 11:11:39AM +0800, Wei Wang wrote:
>> On 02/09/2018 03:55 AM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 08, 2018 at 05:50:16PM +0800, Wei Wang wrote:
>>>
>>>> Details:
>>>> Set up a Ping-Pong local live migration, where the guest ceaselessy
>>>> migrates between the source and destination. Linux compilation,
>>>> i.e. make bzImage -j4, is performed during the Ping-Pong migration. The
>>>> legacy case takes 5min14s to finish the compilation. With this
>>>> optimization patched, it takes 5min12s.
>>> How is migration time affected in this case?
>>
>> When the linux compilation workload runs, the migration time (both the
>> legacy and this optimization case) varies as the compilation goes on. It
>> seems not easy to give a static speedup number, some times the migration
>> time is reduced to 33%, sometimes to 50%, it varies, and depends on how much
>> free memory the system has at that moment. For example, at the later stage
>> of the compilation, I can observe 5GB memory being used as page cache. But
>> overall, I can observe obvious improvement of the migration time.
>>
>>
>> Best,
>> Wei
> You can run multiple tests and give a best, worst and median numbers.
>

Sorry for my late response (I was on leaves for some other 
responsibilities).

Here are some more numbers of the live migration time comparison while 
linux compilation is in progress (probably the average time is more 
relevant)

average: 52.4% reduction (optimization v.s. legacy = 1242ms v.s. 2611ms)
best: 69.1%
worst: 21.9%


Best,
Wei

^ permalink raw reply

* Re: [PATCH] mm/page_poison: move PAGE_POISON to page_poison.c
From: Wei Wang @ 2018-02-26  4:37 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, akpm, mst, linux-kernel, virtualization
In-Reply-To: <20180213101615.GO3443@dhcp22.suse.cz>

On 02/13/2018 06:16 PM, Michal Hocko wrote:
> On Fri 09-02-18 16:08:14, Wei Wang wrote:
>> The PAGE_POISON macro is used in page_poison.c only, so avoid exporting
>> it. Also remove the "mm/debug-pagealloc.c" related comment, which is
>> obsolete.
> Why is this an improvement? I thought the whole point of poison.h is to
> keep all the poison value at a single place to make them obviously
> unique.

There isn't a comment explaining why they are exposed. We did this 
because PAGE_POISON is used by page_poison.c only, it seems not 
necessary to expose the private values.
Why would it be not unique if moved to page_poison.c (on condition that 
it is only used there)?

Best,
Wei

^ permalink raw reply

* Re: [v1 1/1] xen, mm: Allow deferred page initialization for xen pv domains
From: Juergen Gross @ 2018-02-26  7:17 UTC (permalink / raw)
  To: Pavel Tatashin, steven.sistare, daniel.m.jordan, akataria, tglx,
	mingo, hpa, x86, boris.ostrovsky, akpm, mhocko, vbabka, luto,
	labbott, kirill.shutemov, bp, minipli, jinb.park7, dan.j.williams,
	bhe, zhang.jia, mgorman, hannes, virtualization, linux-kernel,
	xen-devel, linux-mm
In-Reply-To: <20180223232538.4314-2-pasha.tatashin@oracle.com>

On 24/02/18 00:25, Pavel Tatashin wrote:
> Juergen Gross noticed that commit
> f7f99100d8d ("mm: stop zeroing memory during allocation in vmemmap")
> broke XEN PV domains when deferred struct page initialization is enabled.
> 
> This is because the xen's PagePinned() flag is getting erased from struct
> pages when they are initialized later in boot.
> 
> Juergen fixed this problem by disabling deferred pages on xen pv domains.
> However, it is desirable to have this feature available, as it reduces boot
> time. This fix re-enables the feature for pv-dmains, and fixes the problem
> the following way:
> 
> The fix is to delay setting PagePinned flag until struct pages for all
> allocated memory are initialized (until free_all_bootmem()).
> 
> A new hypervisor op pv_init_ops.after_bootmem() is called to let xen know
> that boot allocator is done, and hence struct pages for all the allocated
> memory are now initialized. If deferred page initialization is enabled, the
> rest of struct pages are going to be initialized later in boot once
> page_alloc_init_late() is called.
> 
> xen_after_bootmem() is xen's implementation of pv_init_ops.after_bootmem(),
> we walk page table and mark every page as pinned.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  arch/x86/include/asm/paravirt.h       |  9 +++++++++
>  arch/x86/include/asm/paravirt_types.h |  3 +++
>  arch/x86/kernel/paravirt.c            |  1 +
>  arch/x86/mm/init_32.c                 |  1 +
>  arch/x86/mm/init_64.c                 |  1 +
>  arch/x86/xen/mmu_pv.c                 | 38 ++++++++++++++++++++++++-----------
>  mm/page_alloc.c                       |  4 ----
>  7 files changed, 41 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index 9be2bf13825b..737e596a9836 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -820,6 +820,11 @@ static inline notrace unsigned long arch_local_irq_save(void)
>  
>  extern void default_banner(void);
>  
> +static inline void paravirt_after_bootmem(void)
> +{
> +	pv_init_ops.after_bootmem();
> +}
> +

Putting this in the paravirt framework is overkill IMO. There is no need
to patch the callsites for optimal performance.

I'd put it into struct x86_hyper_init and pre-init it with x86_init_noop

>  #else  /* __ASSEMBLY__ */
>  
>  #define _PVSITE(ptype, clobbers, ops, word, algn)	\
> @@ -964,6 +969,10 @@ static inline void paravirt_arch_dup_mmap(struct mm_struct *oldmm,
>  static inline void paravirt_arch_exit_mmap(struct mm_struct *mm)
>  {
>  }
> +
> +static inline void paravirt_after_bootmem(void)
> +{
> +}
>  #endif /* __ASSEMBLY__ */
>  #endif /* !CONFIG_PARAVIRT */
>  #endif /* _ASM_X86_PARAVIRT_H */
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 180bc0bff0fb..da78a3610168 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -86,6 +86,9 @@ struct pv_init_ops {
>  	 */
>  	unsigned (*patch)(u8 type, u16 clobber, void *insnbuf,
>  			  unsigned long addr, unsigned len);
> +
> +	/* called right after we finish boot allocator */
> +	void (*after_bootmem)(void);
>  } __no_randomize_layout;
>  
>  
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index 99dc79e76bdc..7b5f931e2e3a 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -315,6 +315,7 @@ struct pv_info pv_info = {
>  
>  struct pv_init_ops pv_init_ops = {
>  	.patch = native_patch,
> +	.after_bootmem = paravirt_nop,
>  };
>  
>  struct pv_time_ops pv_time_ops = {
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 79cb066f40c0..6096d0d9ecbc 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -763,6 +763,7 @@ void __init mem_init(void)
>  	free_all_bootmem();
>  
>  	after_bootmem = 1;
> +	paravirt_after_bootmem();
>  
>  	mem_init_print_info(NULL);
>  	printk(KERN_INFO "virtual kernel memory layout:\n"
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 332f6e25977a..70b7b5093d07 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1189,6 +1189,7 @@ void __init mem_init(void)
>  	/* this will put all memory onto the freelists */
>  	free_all_bootmem();
>  	after_bootmem = 1;
> +	paravirt_after_bootmem();
>  
>  	/*
>  	 * Must be done after boot memory is put on freelist, because here we
> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index d20763472920..603589809334 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -116,6 +116,8 @@ DEFINE_PER_CPU(unsigned long, xen_current_cr3);	 /* actual vcpu cr3 */
>  
>  static phys_addr_t xen_pt_base, xen_pt_size __initdata;
>  
> +static DEFINE_STATIC_KEY_FALSE(xen_struct_pages_ready);
> +
>  /*
>   * Just beyond the highest usermode address.  STACK_TOP_MAX has a
>   * redzone above it, so round it up to a PGD boundary.
> @@ -155,11 +157,18 @@ void make_lowmem_page_readwrite(void *vaddr)
>  }
>  
>  
> +/*
> + * During early boot all pages are pinned, but we do not have struct pages,
> + * so return true until struct pages are ready.
> + */

Uuh, this comment is just not true.

The "pinned" state for Xen means it is a pv pagetable known to Xen. Such
pages are read-only for the guest and can be modified via hypercalls
only.

So either the "pinned" state will be tested for page tables only, in
which case the comment needs adjustment, or the code is wrong.


Juergen

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Jiri Pirko @ 2018-02-26  7:19 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, Alexander Duyck,
	virtualization, Siwei Liu, Netdev, David Miller
In-Reply-To: <20180223155904.27b11865@xeon-e3>

Sat, Feb 24, 2018 at 12:59:04AM CET, stephen@networkplumber.org wrote:
>On Thu, 22 Feb 2018 13:30:12 -0800
>Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> > Again, I undertand your motivation. Yet I don't like your solution.
>> > But if the decision is made to do this in-driver bonding. I would like
>> > to see it baing done some generic way:
>> > 1) share the same "in-driver bonding core" code with netvsc
>> >    put to net/core.
>> > 2) the "in-driver bonding core" will strictly limit the functionality,
>> >    like active-backup mode only, one vf, one backup, vf netdev type
>> >    check (so noone could enslave a tap or anything else)
>> > If user would need something more, he should employ team/bond.  
>
>Sharing would be good, but netvsc world would really like to only have
>one visible network device.

Why do you mind? All would be the same, there would be just another
netdevice unused by the vm user (same as the vf netdev).

^ permalink raw reply

* Re: v4.16-rc2: virtio-block + ext4 lockdep splats / sleeping from invalid context
From: Jan Kara @ 2018-02-26 10:52 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Jens Axboe, Theodore Ts'o, Michael S. Tsirkin, linux-kernel,
	virtualization, linux-block, Jan Kara, linux-ext4
In-Reply-To: <20180223154735.x4ezmpjbjm6o6duw@lakrids.cambridge.arm.com>

[-- Attachment #1: Type: text/plain, Size: 12252 bytes --]

On Fri 23-02-18 15:47:36, Mark Rutland wrote:
> Hi all,
> 
> While fuzzing arm64/v4.16-rc2 with syzkaller, I simultaneously hit a
> number of splats in the block layer:
> 
> * inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-R} usage in
>   jbd2_trans_will_send_data_barrier
> 
> * BUG: sleeping function called from invalid context at mm/mempool.c:320
> 
> * WARNING: CPU: 0 PID: 0 at block/blk.h:297 generic_make_request_checks+0x670/0x750
> 
> ... I've included the full splats at the end of the mail.
> 
> These all happen in the context of the virtio block IRQ handler, so I
> wonder if this calls something that doesn't expect to be called from IRQ
> context. Is it valid to call blk_mq_complete_request() or
> blk_mq_end_request() from an IRQ handler?

No, it's likely a bug in detection whether IO completion should be deferred
to a workqueue or not. Does attached patch fix the problem? I don't see
exactly this being triggered by the syzkaller but it's close enough :)

								Honza


> Syzkaller came up with a minimized reproducer, but it's a bit wacky (the
> fcntl and bpf calls should have no practical effect), and I haven't
> managed to come up with a C reproducer.
> 
> Any ideas?
> 
> Thanks,
> Mark.
> 
> 
> Syzkaller reproducer:
> # {Threaded:true Collide:true Repeat:false Procs:1 Sandbox:setuid Fault:false FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true WaitRepeat:false Debug:false Repro:false}
> mmap(&(0x7f0000000000/0x24000)=nil, 0x24000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
> r0 = openat(0xffffffffffffff9c, &(0x7f0000019000-0x8)='./file0\x00', 0x42, 0x0)
> fcntl$setstatus(r0, 0x4, 0x10000)
> ftruncate(r0, 0x400)
> io_setup(0x1f, &(0x7f0000018000)=<r1=>0x0)
> io_submit(r1, 0x1, &(0x7f000001d000-0x28)=[&(0x7f000001b000)={0x0, 0x0, 0x0, 0x1, 0x0, r0, &(0x7f0000022000-0x1000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
>  00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x200, 0x0, 0x0, 0x0, 0x0}])
> bpf$BPF_PROG_ATTACH(0xffffffff, &(0x7f000001b000)={0x0, 0x0, 0x3, 0x2}, 0x40000000)
> 
> 
> Full splat:
> [  162.337073] ================================
> [  162.338055] WARNING: inconsistent lock state
> [  162.339017] 4.16.0-rc2 #1 Not tainted
> [  162.339797] --------------------------------
> [  162.340725] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-R} usage.
> [  162.342030] swapper/0/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
> [  162.343061]  (&journal->j_state_lock){+?++}, at: [<000000003b9c3e4b>] jbd2_trans_will_send_data_barrier+0x44/0xc8
> [  162.353187] {HARDIRQ-ON-W} state was registered at:
> [  162.354433]   lock_acquire+0x48/0x68
> [  162.358640]   _raw_write_lock+0x3c/0x50
> [  162.360716]   ext4_init_journal_params.isra.6+0x40/0xa0
> [  162.363445]   ext4_fill_super+0x25cc/0x2e88
> [  162.364481]   mount_bdev+0x19c/0x1d8
> [  162.365345]   ext4_mount+0x14/0x20
> [  162.366130]   mount_fs+0x34/0x160
> [  162.366790]   vfs_kern_mount.part.8+0x54/0x160
> [  162.367874]   do_mount+0x540/0xd40
> [  162.373776]   SyS_mount+0x68/0x100
> [  162.374467]   mount_block_root+0x11c/0x28c
> [  162.376558]   mount_root+0x130/0x164
> [  162.380753]   prepare_namespace+0x138/0x180
> [  162.381729]   kernel_init_freeable+0x25c/0x280
> [  162.382625]   kernel_init+0x10/0x100
> [  162.383337]   ret_from_fork+0x10/0x18
> [  162.384072] irq event stamp: 3670810
> [  162.384787] hardirqs last  enabled at (3670805): [<00000000fa98454e>] arch_cpu_idle+0x14/0x28
> [  162.386505] hardirqs last disabled at (3670806): [<00000000341112e2>] el1_irq+0x74/0x130
> [  162.388107] softirqs last  enabled at (3670810): [<00000000d00bd211>] _local_bh_enable+0x20/0x40
> [  162.389880] softirqs last disabled at (3670809): [<00000000e929005c>] irq_enter+0x54/0x70
> [  162.391443]
> [  162.391443] other info that might help us debug this:
> [  162.392680]  Possible unsafe locking scenario:
> [  162.392680]
> [  162.405967]        CPU0
> [  162.406513]        ----
> [  162.407055]   lock(&journal->j_state_lock);
> [  162.407880]   <Interrupt>
> [  162.408400]     lock(&journal->j_state_lock);
> [  162.409287]
> [  162.409287]  *** DEADLOCK ***
> [  162.409287]
> [  162.410447] 2 locks held by swapper/0/0:
> [  162.411248]  #0:  (&(&vblk->vqs[i].lock)->rlock){-.-.}, at: [<00000000b75553ae>] virtblk_done+0x50/0xf8
> [  162.413101]  #1:  (rcu_read_lock){....}, at: [<000000002bf2a216>] hctx_lock+0x1c/0xe8
> [  162.414630]
> [  162.414630] stack backtrace:
> [  162.415492] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.16.0-rc2 #1
> [  162.429624] Hardware name: linux,dummy-virt (DT)
> [  162.430631] Call trace:
> [  162.431196]  dump_backtrace+0x0/0x1c8
> [  162.432005]  show_stack+0x14/0x20
> [  162.432747]  dump_stack+0xac/0xe4
> [  162.433507]  print_usage_bug.part.28+0x258/0x270
> [  162.434510]  mark_lock+0x744/0x750
> [  162.435208]  __lock_acquire+0xab8/0x18c0
> [  162.435982]  lock_acquire+0x48/0x68
> [  162.436678]  _raw_read_lock+0x3c/0x50
> [  162.449520]  jbd2_trans_will_send_data_barrier+0x44/0xc8
> [  162.450681]  ext4_sync_file+0x1e0/0x330
> [  162.451535]  vfs_fsync_range+0x48/0xc0
> [  162.452323]  dio_complete+0x1fc/0x220
> [  162.453094]  dio_bio_end_aio+0xf0/0x130
> [  162.453935]  bio_endio+0xe8/0xf8
> [  162.454625]  blk_update_request+0x80/0x2e8
> [  162.455450]  blk_mq_end_request+0x20/0x70
> [  162.456240]  virtblk_request_done+0x24/0x30
> [  162.457080]  __blk_mq_complete_request+0x100/0x1b0
> [  162.458047]  blk_mq_complete_request+0x60/0x98
> [  162.458918]  virtblk_done+0x70/0xf8
> [  162.459608]  vring_interrupt+0x38/0x50
> [  162.460367]  __handle_irq_event_percpu+0x5c/0x148
> [  162.472443]  handle_irq_event_percpu+0x34/0x88
> [  162.477991]  handle_irq_event+0x48/0x78
> [  162.480661]  handle_fasteoi_irq+0xc0/0x198
> [  162.485417]  generic_handle_irq+0x24/0x38
> [  162.490334]  __handle_domain_irq+0x84/0xf0
> [  162.493834]  gic_handle_irq+0x58/0xa8
> [  162.498464]  el1_irq+0xb4/0x130
> [  162.500621]  arch_cpu_idle+0x18/0x28
> [  162.504729]  default_idle_call+0x1c/0x34
> [  162.508005]  do_idle+0x17c/0x1f0
> [  162.510184]  cpu_startup_entry+0x20/0x28
> [  162.515050]  rest_init+0x250/0x260
> [  162.518228]  start_kernel+0x3f0/0x41c
> [  162.522987] BUG: sleeping function called from invalid context at mm/mempool.c:320
> [  162.533762] in_atomic(): 1, irqs_disabled(): 128, pid: 0, name: swapper/0
> [  162.540375] INFO: lockdep is turned off.
> [  162.542696] irq event stamp: 3670810
> [  162.544963] hardirqs last  enabled at (3670805): [<00000000fa98454e>] arch_cpu_idle+0x14/0x28
> [  162.551514] hardirqs last disabled at (3670806): [<00000000341112e2>] el1_irq+0x74/0x130
> [  162.557422] softirqs last  enabled at (3670810): [<00000000d00bd211>] _local_bh_enable+0x20/0x40
> [  162.562596] softirqs last disabled at (3670809): [<00000000e929005c>] irq_enter+0x54/0x70
> [  162.568863] Preemption disabled at:
> [  162.568877] [<ffff000008b59ca4>] schedule_preempt_disabled+0x1c/0x28
> [  162.574673] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.16.0-rc2 #1
> [  162.580013] Hardware name: linux,dummy-virt (DT)
> [  162.583234] Call trace:
> [  162.585279]  dump_backtrace+0x0/0x1c8
> [  162.587583]  show_stack+0x14/0x20
> [  162.589047]  dump_stack+0xac/0xe4
> [  162.592035]  ___might_sleep+0x164/0x238
> [  162.594830]  __might_sleep+0x50/0x88
> [  162.597012]  mempool_alloc+0xc0/0x198
> [  162.600633]  bio_alloc_bioset+0x144/0x250
> [  162.602983]  blkdev_issue_flush+0x48/0xc8
> [  162.606134]  ext4_sync_file+0x220/0x330
> [  162.607870]  vfs_fsync_range+0x48/0xc0
> [  162.611694]  dio_complete+0x1fc/0x220
> [  162.613369]  dio_bio_end_aio+0xf0/0x130
> [  162.617040]  bio_endio+0xe8/0xf8
> [  162.618583]  blk_update_request+0x80/0x2e8
> [  162.619841]  blk_mq_end_request+0x20/0x70
> [  162.621082]  virtblk_request_done+0x24/0x30
> [  162.627389]  __blk_mq_complete_request+0x100/0x1b0
> [  162.630482]  blk_mq_complete_request+0x60/0x98
> [  162.633003]  virtblk_done+0x70/0xf8
> [  162.635456]  vring_interrupt+0x38/0x50
> [  162.638588]  __handle_irq_event_percpu+0x5c/0x148
> [  162.640582]  handle_irq_event_percpu+0x34/0x88
> [  162.642829]  handle_irq_event+0x48/0x78
> [  162.644062]  handle_fasteoi_irq+0xc0/0x198
> [  162.646182]  generic_handle_irq+0x24/0x38
> [  162.648024]  __handle_domain_irq+0x84/0xf0
> [  162.650304]  gic_handle_irq+0x58/0xa8
> [  162.652917]  el1_irq+0xb4/0x130
> [  162.656972]  arch_cpu_idle+0x18/0x28
> [  162.658445]  default_idle_call+0x1c/0x34
> [  162.660723]  do_idle+0x17c/0x1f0
> [  162.664958]  cpu_startup_entry+0x20/0x28
> [  162.670009]  rest_init+0x250/0x260
> [  162.672763]  start_kernel+0x3f0/0x41c
> [  162.677064] WARNING: CPU: 0 PID: 0 at block/blk.h:297 generic_make_request_checks+0x670/0x750
> [  162.682622] Kernel panic - not syncing: panic_on_warn set ...
> [  162.682622]
> [  162.695097] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W        4.16.0-rc2 #1
> [  162.700355] Hardware name: linux,dummy-virt (DT)
> [  162.713462] Call trace:
> [  162.714859]  dump_backtrace+0x0/0x1c8
> [  162.722935]  show_stack+0x14/0x20
> [  162.725392]  dump_stack+0xac/0xe4
> [  162.727719]  panic+0x13c/0x2b8
> [  162.732313]  panic+0x0/0x2b8
> [  162.735233]  report_bug+0xb4/0x158
> [  162.738377]  bug_handler+0x30/0x88
> [  162.740810]  brk_handler+0xd8/0x198
> [  162.747677]  do_debug_exception+0x9c/0x190
> [  162.752011]  el1_dbg+0x18/0x78
> [  162.755197]  generic_make_request_checks+0x670/0x750
> [  162.761260]  generic_make_request+0x2c/0x278
> [  162.764366]  submit_bio+0x54/0x208
> [  162.771402]  submit_bio_wait+0x68/0xa0
> [  162.775053]  blkdev_issue_flush+0x8c/0xc8
> [  162.780520]  ext4_sync_file+0x220/0x330
> [  162.784815]  vfs_fsync_range+0x48/0xc0
> [  162.788243]  dio_complete+0x1fc/0x220
> [  162.792071]  dio_bio_end_aio+0xf0/0x130
> [  162.794820]  bio_endio+0xe8/0xf8
> [  162.796203]  blk_update_request+0x80/0x2e8
> [  162.800883]  blk_mq_end_request+0x20/0x70
> [  162.803072]  virtblk_request_done+0x24/0x30
> [  162.806877]  __blk_mq_complete_request+0x100/0x1b0
> [  162.809298]  blk_mq_complete_request+0x60/0x98
> [  162.811408]  virtblk_done+0x70/0xf8
> [  162.812461]  vring_interrupt+0x38/0x50
> [  162.813467]  __handle_irq_event_percpu+0x5c/0x148
> [  162.814657]  handle_irq_event_percpu+0x34/0x88
> [  162.815730]  handle_irq_event+0x48/0x78
> [  162.816856]  handle_fasteoi_irq+0xc0/0x198
> [  162.817946]  generic_handle_irq+0x24/0x38
> [  162.818948]  __handle_domain_irq+0x84/0xf0
> [  162.820056]  gic_handle_irq+0x58/0xa8
> [  162.821013]  el1_irq+0xb4/0x130
> [  162.826572]  arch_cpu_idle+0x18/0x28
> [  162.827304]  default_idle_call+0x1c/0x34
> [  162.828315]  do_idle+0x17c/0x1f0
> [  162.829174]  cpu_startup_entry+0x20/0x28
> [  162.830072]  rest_init+0x250/0x260
> [  162.830872]  start_kernel+0x3f0/0x41c
> [  162.831627] SMP: stopping secondary CPUs
> [  165.929235] SMP: failed to stop secondary CPUs 0,2
> [  165.949001] Kernel Offset: disabled
> [  165.963649] CPU features: 0x1802082
> [  165.969405] Memory Limit: none
> [  165.974480] Rebooting in 86400 seconds..
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[-- Attachment #2: 0001-direct-io-Fix-sleep-in-atomic-due-to-sync-AIO.patch --]
[-- Type: text/x-patch, Size: 1730 bytes --]

From 501d97ed88f5020a55a0de4d546df5ad11461cea Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Mon, 26 Feb 2018 11:36:52 +0100
Subject: [PATCH] direct-io: Fix sleep in atomic due to sync AIO

Commit e864f39569f4 "fs: add RWF_DSYNC aand RWF_SYNC" added additional
way for direct IO to become synchronous and thus trigger fsync from the
IO completion handler. Then commit 9830f4be159b "fs: Use RWF_* flags for
AIO operations" allowed these flags to be set for AIO as well. However
that commit forgot to update the condition checking whether the IO
completion handling should be defered to a workqueue and thus AIO DIO
with RWF_[D]SYNC set will call fsync() from IRQ context resulting in
sleep in atomic.

Fix the problem by checking directly iocb flags (the same way as it is
done in dio_complete()) instead of checking all conditions that could
lead to IO being synchronous.

CC: Christoph Hellwig <hch@lst.de>
CC: Goldwyn Rodrigues <rgoldwyn@suse.com>
CC: stable@vger.kernel.org
Reported-by: Mark Rutland <mark.rutland@arm.com>
Fixes: 9830f4be159b29399d107bffb99e0132bc5aedd4
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/direct-io.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index a0ca9e48e993..1357ef563893 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1274,8 +1274,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 */
 	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
 		retval = 0;
-		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
-		    IS_SYNC(iocb->ki_filp->f_mapping->host))
+		if (iocb->ki_flags & IOCB_DSYNC)
 			retval = dio_set_defer_completion(dio);
 		else if (!dio->inode->i_sb->s_dio_done_wq) {
 			/*
-- 
2.13.6


[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related

* Re: v4.16-rc2: virtio-block + ext4 lockdep splats / sleeping from invalid context
From: Mark Rutland @ 2018-02-26 11:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Theodore Ts'o, Michael S. Tsirkin, linux-kernel,
	virtualization, linux-block, Jan Kara, linux-ext4
In-Reply-To: <20180226105256.jagidoki6vsrvzb4@quack2.suse.cz>

On Mon, Feb 26, 2018 at 11:52:56AM +0100, Jan Kara wrote:
> On Fri 23-02-18 15:47:36, Mark Rutland wrote:
> > Hi all,
> > 
> > While fuzzing arm64/v4.16-rc2 with syzkaller, I simultaneously hit a
> > number of splats in the block layer:
> > 
> > * inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-R} usage in
> >   jbd2_trans_will_send_data_barrier
> > 
> > * BUG: sleeping function called from invalid context at mm/mempool.c:320
> > 
> > * WARNING: CPU: 0 PID: 0 at block/blk.h:297 generic_make_request_checks+0x670/0x750
> > 
> > ... I've included the full splats at the end of the mail.
> > 
> > These all happen in the context of the virtio block IRQ handler, so I
> > wonder if this calls something that doesn't expect to be called from IRQ
> > context. Is it valid to call blk_mq_complete_request() or
> > blk_mq_end_request() from an IRQ handler?
> 
> No, it's likely a bug in detection whether IO completion should be deferred
> to a workqueue or not. Does attached patch fix the problem? I don't see
> exactly this being triggered by the syzkaller but it's close enough :)
> 
> 								Honza

That seems to be it!

With the below patch applied, I can't trigger the bug after ~10 minutes,
whereas prior to the patch I can trigger it in ~10 seconds. I'll leave
that running for a while just in case there's another part to the
problem, but FWIW:

Tested-by: Mark Rutland <mark.rutland@arm.com>

Thanks,
Mark.

> From 501d97ed88f5020a55a0de4d546df5ad11461cea Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Mon, 26 Feb 2018 11:36:52 +0100
> Subject: [PATCH] direct-io: Fix sleep in atomic due to sync AIO
> 
> Commit e864f39569f4 "fs: add RWF_DSYNC aand RWF_SYNC" added additional
> way for direct IO to become synchronous and thus trigger fsync from the
> IO completion handler. Then commit 9830f4be159b "fs: Use RWF_* flags for
> AIO operations" allowed these flags to be set for AIO as well. However
> that commit forgot to update the condition checking whether the IO
> completion handling should be defered to a workqueue and thus AIO DIO
> with RWF_[D]SYNC set will call fsync() from IRQ context resulting in
> sleep in atomic.
> 
> Fix the problem by checking directly iocb flags (the same way as it is
> done in dio_complete()) instead of checking all conditions that could
> lead to IO being synchronous.
> 
> CC: Christoph Hellwig <hch@lst.de>
> CC: Goldwyn Rodrigues <rgoldwyn@suse.com>
> CC: stable@vger.kernel.org
> Reported-by: Mark Rutland <mark.rutland@arm.com>
> Fixes: 9830f4be159b29399d107bffb99e0132bc5aedd4
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/direct-io.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index a0ca9e48e993..1357ef563893 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1274,8 +1274,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	 */
>  	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
>  		retval = 0;
> -		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> -		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +		if (iocb->ki_flags & IOCB_DSYNC)
>  			retval = dio_set_defer_completion(dio);
>  		else if (!dio->inode->i_sb->s_dio_done_wq) {
>  			/*
> -- 
> 2.13.6
> 

^ permalink raw reply

* Re: v4.16-rc2: virtio-block + ext4 lockdep splats / sleeping from invalid context
From: Jan Kara @ 2018-02-26 12:44 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Jens Axboe, Jan Kara, Michael S. Tsirkin, linux-kernel,
	virtualization, linux-block, Jan Kara, Theodore Ts'o,
	linux-ext4
In-Reply-To: <20180226113818.v544mmjsavc4x5q2@lakrids.cambridge.arm.com>

On Mon 26-02-18 11:38:19, Mark Rutland wrote:
> On Mon, Feb 26, 2018 at 11:52:56AM +0100, Jan Kara wrote:
> > On Fri 23-02-18 15:47:36, Mark Rutland wrote:
> > > Hi all,
> > > 
> > > While fuzzing arm64/v4.16-rc2 with syzkaller, I simultaneously hit a
> > > number of splats in the block layer:
> > > 
> > > * inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-R} usage in
> > >   jbd2_trans_will_send_data_barrier
> > > 
> > > * BUG: sleeping function called from invalid context at mm/mempool.c:320
> > > 
> > > * WARNING: CPU: 0 PID: 0 at block/blk.h:297 generic_make_request_checks+0x670/0x750
> > > 
> > > ... I've included the full splats at the end of the mail.
> > > 
> > > These all happen in the context of the virtio block IRQ handler, so I
> > > wonder if this calls something that doesn't expect to be called from IRQ
> > > context. Is it valid to call blk_mq_complete_request() or
> > > blk_mq_end_request() from an IRQ handler?
> > 
> > No, it's likely a bug in detection whether IO completion should be deferred
> > to a workqueue or not. Does attached patch fix the problem? I don't see
> > exactly this being triggered by the syzkaller but it's close enough :)
> > 
> > 								Honza
> 
> That seems to be it!
> 
> With the below patch applied, I can't trigger the bug after ~10 minutes,
> whereas prior to the patch I can trigger it in ~10 seconds. I'll leave
> that running for a while just in case there's another part to the
> problem, but FWIW:
> 
> Tested-by: Mark Rutland <mark.rutland@arm.com>

Thanks for testing! Sent the patch to Jens for inclusion.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Stephen Hemminger @ 2018-02-27  1:02 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
	Jakub Kicinski, Samudrala, Sridhar, Alexander Duyck,
	virtualization, Siwei Liu, Netdev, David Miller
In-Reply-To: <20180226071924.GA2063@nanopsycho>

On Mon, 26 Feb 2018 08:19:24 +0100
Jiri Pirko <jiri@resnulli.us> wrote:

> Sat, Feb 24, 2018 at 12:59:04AM CET, stephen@networkplumber.org wrote:
> >On Thu, 22 Feb 2018 13:30:12 -0800
> >Alexander Duyck <alexander.duyck@gmail.com> wrote:
> >  
> >> > Again, I undertand your motivation. Yet I don't like your solution.
> >> > But if the decision is made to do this in-driver bonding. I would like
> >> > to see it baing done some generic way:
> >> > 1) share the same "in-driver bonding core" code with netvsc
> >> >    put to net/core.
> >> > 2) the "in-driver bonding core" will strictly limit the functionality,
> >> >    like active-backup mode only, one vf, one backup, vf netdev type
> >> >    check (so noone could enslave a tap or anything else)
> >> > If user would need something more, he should employ team/bond.    
> >
> >Sharing would be good, but netvsc world would really like to only have
> >one visible network device.  
> 
> Why do you mind? All would be the same, there would be just another
> netdevice unused by the vm user (same as the vf netdev).
> 

I mind because our requirement is no changes to userspace.
No special udev rules, no bonding script, no setup.

Things like cloudinit running on current distro's expect to see a single
eth0.  The VF device show up can also be an issue because distro's have
stupid rules like Network Manager trying to start DHCP on every interface.
We deal with that now by doing stuff like udev rules to get it to stop
but that is still causing user errors.

^ permalink raw reply

* Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
From: Michael S. Tsirkin @ 2018-02-27  1:18 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Duyck, Alexander H, virtio-dev, Jiri Pirko, Jakub Kicinski,
	Samudrala, Sridhar, Alexander Duyck, virtualization, Siwei Liu,
	Netdev, David Miller
In-Reply-To: <20180226170218.6b1dcf22@xeon-e3>

On Mon, Feb 26, 2018 at 05:02:18PM -0800, Stephen Hemminger wrote:
> On Mon, 26 Feb 2018 08:19:24 +0100
> Jiri Pirko <jiri@resnulli.us> wrote:
> 
> > Sat, Feb 24, 2018 at 12:59:04AM CET, stephen@networkplumber.org wrote:
> > >On Thu, 22 Feb 2018 13:30:12 -0800
> > >Alexander Duyck <alexander.duyck@gmail.com> wrote:
> > >  
> > >> > Again, I undertand your motivation. Yet I don't like your solution.
> > >> > But if the decision is made to do this in-driver bonding. I would like
> > >> > to see it baing done some generic way:
> > >> > 1) share the same "in-driver bonding core" code with netvsc
> > >> >    put to net/core.
> > >> > 2) the "in-driver bonding core" will strictly limit the functionality,
> > >> >    like active-backup mode only, one vf, one backup, vf netdev type
> > >> >    check (so noone could enslave a tap or anything else)
> > >> > If user would need something more, he should employ team/bond.    
> > >
> > >Sharing would be good, but netvsc world would really like to only have
> > >one visible network device.  
> > 
> > Why do you mind? All would be the same, there would be just another
> > netdevice unused by the vm user (same as the vf netdev).
> > 
> 
> I mind because our requirement is no changes to userspace.
> No special udev rules, no bonding script, no setup.

Agreed. It is mostly fine from this point of view, except that you need
to know to skip the slaves.  Maybe we could look at some kind of
trick e.g. pretending link is down for slaves?

> Things like cloudinit running on current distro's expect to see a single
> eth0.  The VF device show up can also be an issue because distro's have
> stupid rules like Network Manager trying to start DHCP on every interface.
> We deal with that now by doing stuff like udev rules to get it to stop
> but that is still causing user errors.

So the ideal of a single net device isn't achieved by netvsc.

Since you have scripts to skip the PT device, can't they
hind the PV slave too? How do they identify the device to skip?

I agree it would be nice to have a way to hide the extra netdev
from userspace.

The benefit of the separation is that each slave device can
be configured with e.g. its own native ethtool commands for
optimum performance.

-- 
MST

^ permalink raw reply

* Re: [PATCH 0/7] drm/virtio: Checkpatch cleanup for virtio
From: Gerd Hoffmann @ 2018-02-27  6:47 UTC (permalink / raw)
  To: Rodrigo Siqueira
  Cc: David Airlie, kernel-janitors, linux-kernel, dri-devel,
	virtualization
In-Reply-To: <cover.1519343668.git.rodrigosiqueiramelo@gmail.com>

On Thu, Feb 22, 2018 at 08:59:33PM -0300, Rodrigo Siqueira wrote:
> This patchset fixes warnings and errors found by checkpatch.pl in the
> drm/virtio:

Added to drm-qemu queue, will land in drm-misc soon.

thanks,
  Gerd

^ permalink raw reply

* Re: [v2 1/1] xen, mm: Allow deferred page initialization for xen pv domains
From: Ingo Molnar @ 2018-02-27  6:59 UTC (permalink / raw)
  To: Pavel Tatashin
  Cc: mhocko, virtualization, linux-mm, steven.sistare, hpa,
	dan.j.williams, bhe, x86, akataria, daniel.m.jordan, mingo,
	xen-devel, bp, minipli, jinb.park7, luto, boris.ostrovsky, vbabka,
	jgross, labbott, linux-kernel, tglx, zhang.jia, hannes, akpm,
	mgorman, kirill.shutemov
In-Reply-To: <20180226160112.24724-2-pasha.tatashin@oracle.com>


* Pavel Tatashin <pasha.tatashin@oracle.com> wrote:

> Juergen Gross noticed that commit
> f7f99100d8d ("mm: stop zeroing memory during allocation in vmemmap")
> broke XEN PV domains when deferred struct page initialization is enabled.
> 
> This is because the xen's PagePinned() flag is getting erased from struct
> pages when they are initialized later in boot.
> 
> Juergen fixed this problem by disabling deferred pages on xen pv domains.
> It is desirable, however, to have this feature available as it reduces boot
> time. This fix re-enables the feature for pv-dmains, and fixes the problem
> the following way:
> 
> The fix is to delay setting PagePinned flag until struct pages for all
> allocated memory are initialized, i.e. until after free_all_bootmem().
> 
> A new x86_init.hyper op init_after_bootmem() is called to let xen know
> that boot allocator is done, and hence struct pages for all the allocated
> memory are now initialized. If deferred page initialization is enabled, the
> rest of struct pages are going to be initialized later in boot once
> page_alloc_init_late() is called.
> 
> xen_after_bootmem() walks page table's pages and marks them pinned.
> 
> Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
> ---
>  arch/x86/include/asm/x86_init.h |  2 ++
>  arch/x86/kernel/x86_init.c      |  1 +
>  arch/x86/mm/init_32.c           |  1 +
>  arch/x86/mm/init_64.c           |  1 +
>  arch/x86/xen/mmu_pv.c           | 38 ++++++++++++++++++++++++++------------
>  mm/page_alloc.c                 |  4 ----
>  6 files changed, 31 insertions(+), 16 deletions(-)

Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 4/6] x86: Consolidate PCI_MMCONFIG configs
From: Jan Kiszka @ 2018-02-27  7:19 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: jailhouse-dev, linux-pci, x86, Linux Kernel Mailing List,
	virtualization, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas,
	Thomas Gleixner
In-Reply-To: <CAHp75Vd3AsdOayA0FSaug6L4K+Xx5pk=Eeh4F5mZQsKiV3RxwA@mail.gmail.com>

On 2018-01-28 18:26, Andy Shevchenko wrote:
> On Mon, Jan 22, 2018 at 8:12 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> Not sure if those two worked by design or just by chance so far. In any
>> case, it's at least cleaner and clearer to express this in a single
>> config statement.
> 
> Congrats! You found by the way a bug in
> 
> commit e279b6c1d329e50b766bce96aacc197eae8a053b
> Author: Sam Ravnborg <sam@ravnborg.org>
> Date:   Tue Nov 6 20:41:05 2007 +0100
> 
>    x86: start unification of arch/x86/Kconfig.*
> 
> ...and proper fix seems to split PCI stuff to common + X86_32 only + X86_64 only
> 

Hmm, is that a change request on this patch?

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply

* Re: [PATCH 2/6] pci: Scan all functions when probing while running over Jailhouse
From: Jan Kiszka @ 2018-02-27  7:22 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: jailhouse-dev, Benedikt Spranger, linux-pci, x86,
	Linux Kernel Mailing List, virtualization, Ingo Molnar,
	H . Peter Anvin, Bjorn Helgaas, Thomas Gleixner
In-Reply-To: <CAHp75VcssoPE8ZCiHOt8-BKcponpr3xpzn42KZGx=Y+nMmZjYw@mail.gmail.com>

On 2018-02-23 14:23, Andy Shevchenko wrote:
> On Mon, Jan 22, 2018 at 8:12 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> 
>>  #include <linux/export.h>
>>  #include <linux/pci.h>
>>  #include <asm/pci_x86.h>
>> +#include <asm/jailhouse_para.h>
> 
> Keep it in order?
> 

Done.

> 
>>  #include <linux/acpi.h>
>>  #include <linux/irqdomain.h>
>>  #include <linux/pm_runtime.h>
>> +#include <linux/hypervisor.h>
> 
> Ditto.
> 

Despite the context suggesting it, this file has no ordering.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply

* Re: [PATCH 2/6] pci: Scan all functions when probing while running over Jailhouse
From: Jan Kiszka @ 2018-02-27  7:25 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: jailhouse-dev, Benedikt Spranger, linux-pci, x86,
	Linux Kernel Mailing List, virtualization, Ingo Molnar,
	H . Peter Anvin, Bjorn Helgaas, Thomas Gleixner
In-Reply-To: <20180222205753.GE16519@bhelgaas-glaptop.roam.corp.google.com>

On 2018-02-22 21:57, Bjorn Helgaas wrote:
> On Mon, Jan 22, 2018 at 07:12:46AM +0100, Jan Kiszka wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> PCI and PCIBIOS probing only scans devices at function number 0/8/16/...
>> Subdevices (e.g. multiqueue) have function numbers which are not a
>> multiple of 8.
> 
> Suggested text:
> 
>   Per PCIe r4.0, sec 7.5.1.1.9, multi-function devices are required to
>   have a function 0.  Therefore, Linux scans for devices at function 0
>   (devfn 0/8/16/...) and only scans for other functions if function 0
>   has its Multi-Function Device bit set or ARI or SR-IOV indicate
>   there are more functions.
>   
>   The Jailhouse hypervisor may pass individual functions of a
>   multi-function device to a guest without passing function 0, which
>   means a Linux guest won't find them.
> 
>   Change Linux PCI probing so it scans all function numbers when
>   running as a guest over Jailhouse.
>   
>   This is technically prohibited by the spec, so it is possible that
>   PCI devices without the Multi-Function Device bit set may have
>   unexpected behavior in response to this probe.
> 
>> The simple hypervisor Jailhouse passes subdevices directly w/o providing
>> a virtual PCI topology like KVM. As a consequence a PCI passthrough from
>> Jailhouse to a guest will not be detected by Linux.
>>
>> Based on patch by Benedikt Spranger, adding Jailhouse probing to avoid
>> changing the behavior in the absence of the hypervisor.
>>
>> CC: Benedikt Spranger <b.spranger@linutronix.de>
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> 
> With subject change to:
> 
>   PCI: Scan all functions when running over Jailhouse
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> 

Thanks, all suggestions picked up for next round.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox