Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH 0/3] virtio: balloon: cleanups and a fix
From: Rusty Russell @ 2012-05-07  3:53 UTC (permalink / raw)
  To: Amit Shah, Michael S. Tsirkin; +Cc: Virtualization List
In-Reply-To: <20120426201554.GA5804@amit.redhat.com>

On Fri, 27 Apr 2012 01:45:54 +0530, Amit Shah <amit.shah@redhat.com> wrote:
> On (Thu) 26 Apr 2012 [23:07:47], Michael S. Tsirkin wrote:
> > On Fri, Apr 27, 2012 at 12:45:54AM +0530, Amit Shah wrote:
> > > Hello,
> > > 
> > > The main fix is to update the host with the current balloon value on
> > > module removal after deflating the balloon.  Without the fix, the host
> > > has the wrong idea of the ballooned memory in the guest.  This is
> > > patch 2.
> > > 
> > > Patches 1 and 3 are cleanups with no effective code change.
> > 
> > 
> > better to just do fixes for 3.4. can you reorder pls?
> 
> Well if you just pick patch 2, it'll work fine.  There's no context
> change there.

1 and 3 applied.

Thanks!
Rusty.

^ permalink raw reply

* Re: virtio message framing
From: Rusty Russell @ 2012-05-07  3:12 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Michael S. Tsirkin, Linux Virtualization
In-Reply-To: <4F884AC2.9030202@us.ibm.com>

On Fri, 13 Apr 2012 10:48:18 -0500, Anthony Liguori <aliguori@us.ibm.com> wrote:
> On 04/13/2012 09:50 AM, Stefan Hajnoczi wrote:
> > The virtio specification says:
> >
> > "The descriptors used for a buff^[er should not eff^[ect the semantics
> > of the message,
> > except for the total length of the bu^[ffer"
> >
> > and
> >
> > "In particular, no implementation should use the descriptor boundaries
> > to determine the size of any header in a request"
> 
> This was the noble intention but all of the implementations actually rely on 
> boundary sizes.
> 
> Both QEMU and lguest rely on boundary sizes.  We've removed some of it in 
> virtio-net in QEMU but it still looks like it's there for virtio-blk.

I will fix lguest.  It's a poor excuse to say that lguest doens't have
an ABI, so it can take these shortcuts; it's also the exemplar.

> kvm tool also makes this assumption.
> 
> > Why should descriptor layout not be specified?
> >
> > It seems that implementing arbitrary descriptor layout support (e.g.
> > 1-byte descriptors) requires more code and makes input validation
> > harder.
> >
> > Why bother with the flexibility of unspecified descriptor layouts?  As
> > long as the layout is specified clearly it makes everyone's lives
> > easier to use a strict descriptor layout.
> 
> I hate to just change the spec here but I don't see a better option.

For example, in the net code, we could often pack the virtio_net header
into the skb and save a descriptor.

We didn't do this because it would break qemu.

Added to TODO list: a new DEBUG config option which breaks up virtio
descriptors.

Thanks,
Rusty.

^ permalink raw reply

* Submission Deadline Extension
From: VHPC 12 @ 2012-05-06 17:32 UTC (permalink / raw)
  To: virtualization

we apologize if you receive multiple copies of this CFP

===================================================================

CALL FOR PAPERS

7th Workshop on

Virtualization in High-Performance Cloud Computing

VHPC '12

as part of Euro-Par 2012, Rhodes Island, Greece

===================================================================

Date: August 28, 2012

Workshop URL: http://vhpc.org

SUBMISSION DEADLINE:

June 11, 2012 - Full paper submission (extended)


SCOPE:

Virtualization has become a common abstraction layer in modern
data centers, enabling resource owners to manage complex
infrastructure independently of their applications. Conjointly,
virtualization is becoming a driving technology for a manifold of
industry grade IT services. The cloud concept includes the notion
of a separation between resource owners and users, adding  services
such as hosted application frameworks and queueing. Utilizing the
same infrastructure, clouds carry significant potential for use in
high-performance scientific computing. The ability of clouds to provide
for requests and releases of vast computing resources dynamically and
close to the marginal cost of providing the services is unprecedented in
the history of scientific and commercial computing.

Distributed computing concepts that leverage federated resource
access are popular within the grid community, but have not seen
previously desired deployed levels so far. Also, many of the scientific
data centers have not adopted virtualization or cloud concepts yet.

This workshop aims to bring together industrial providers with the
scientific community in order to foster discussion, collaboration
and mutual exchange of knowledge and experience.

The workshop will be one day in length, composed of 20 min
paper presentations, each followed by 10 min discussion sections.
Presentations may be accompanied by interactive demonstrations.


TOPICS

Topics of interest include, but are not limited to:

Higher-level cloud architectures, focusing on issues such as:
- Languages for describing highly-distributed compute jobs
- Workload characterization for VM-based environments
- Optimized communication libraries/protocols in the cloud
- Cross-layer optimization of numeric algorithms on VM infrastructure
- System and process/bytecode VM convergence
- Cloud frameworks and API sets
- Checkpointing/migration of large compute jobs
- Instrumentation interfaces and languages
- VMM performance (auto-)tuning on various load types
- Cloud reliability, fault-tolerance, and security
- Software as a Service (SaaS) architectures
- Research and education use cases
- Virtualization in cloud, cluster and grid environments
- Cross-layer VM optimizations
- Cloud use cases including optimizations
- VM-based cloud performance modelling
- Performance and cost modelling

Lower-level design challenges for Hypervisors, VM-aware I/O devices,
hardware accelerators or filesystems in VM environments, especially:
- Cloud, grid and distributed filesystems
- Hardware for I/O virtualization (storage/network/accelerators)
- Storage and network I/O subsystems in virtualized environments
- Novel software approaches to I/O virtualization
- Paravirtualized I/O subsystems for modified/unmodified guests
- Virtualization-aware cluster interconnects
- Direct device assignment
- NUMA-aware subsystems in virtualized environments
- Hardware Accelerators in virtualization (GPUs/FPGAs)
- Hardware extensions for virtualization
- VMMs/Hypervisors for embedded systems

Data Center management methods, including:
- QoS and and service levels
- VM cloud and cluster distribution algorithms
- VM load-balancing in Clouds
- Hypervisor extensions and tools for cluster and grid computing
- Fault tolerant VM environments
- Virtual machine monitor platforms
- Management, deployment and monitoring of VM-based environments
- Cluster provisioning in the Cloud


PAPER SUBMISSION

Papers submitted to the workshop will be reviewed by at least two
members of the program committee and external reviewers. Submissions
should include abstract, key words, the e-mail address of the
corresponding author, and must not exceed 10 pages, including tables
and figures at a main font size no smaller than 11 point. Submission
of a paper should be regarded as a commitment that, should the paper
be accepted, at least one of the authors will register and attend the
conference to present the work.

Accepted papers will be published in the Springer LNCS series - the
format must be according to the Springer LNCS Style. Initial
submissions are in PDF; authors of accepted papers will be requested
to provide source files.

Format Guidelines: http://www.springer.de/comp/lncs/authors.html
Style template:
ftp://ftp.springer.de/pub/tex/latex/llncs/latex2e/llncs2e.zip
Abstract Submission Link: http://edas.info/newPaper.php?c=11943


IMPORTANT DATES

Rolling abstract submission
June 11, 2012 - Full paper submission (extended)
June 29, 2012 - Acceptance notification
July 20, 2012 - Camera-ready version due
August 28, 2012 - Workshop Date


CHAIR

Michael Alexander (chair), TU Wien, Austria
Gianluigi Zanetti (co-chair), CRS4, Italy
Anastassios Nanos (co-chair), NTUA, Greece


PROGRAM COMMITTEE

Paolo Anedda, CRS4, Italy
Giovanni Busonera, CRS4, Italy
Brad Calder, Microsoft, USA
Roberto Canonico, University of Napoli Federico II, Italy
Tommaso Cucinotta, Alcatel-Lucent Bell Labs, Ireland
Werner Fischer, Thomas-Krenn AG, Germany
William Gardner, University of Guelph, USA
Marcus Hardt, Forschungszentrum Karlsruhe, Germany
Sverre Jarp, CERN, Switzerland
Shantenu Jha, Louisiana State University, USA
Xuxian Jiang, NC State, USA
Nectarios Koziris, National Technical University of Athens, Greece
Simone Leo, CRS4, Italy
Ignacio Llorente, Universidad Complutense de Madrid, Spain
Naoya Maruyama, Tokyo Institute of Technology, Japan
Jean-Marc Menaud, Ecole des Mines de Nantes, France
Dimitrios Nikolopoulos, Foundation for Research&Technology Hellas, Greece
Jose Renato Santos, HP Labs, USA
Walter Schwaiger, TU Wien, Austria
Yoshio Turner, HP Labs, USA
Kurt Tutschku, University of Vienna, Austria
Lizhe Wang, Indiana University, USA
Chao-Tung Yang, Tunghai University, Taiwan


DURATION: Workshop Duration is one day.


GENERAL INFORMATION

The workshop will be held as part of Euro-Par 2012.

Euro-Par 2012: http://europar2012.cti.gr/

^ permalink raw reply

* Re: [PATCH v2] virtio-blk: Fix hot-unplug race in remove method
From: Michael S. Tsirkin @ 2012-05-04 13:24 UTC (permalink / raw)
  To: Asias He; +Cc: Jens Axboe, kvm, virtualization, Chris Mason
In-Reply-To: <4FA39559.2000705@redhat.com>

On Fri, May 04, 2012 at 04:37:45PM +0800, Asias He wrote:
> On 05/03/2012 03:56 PM, Michael S. Tsirkin wrote:
> >On Thu, May 03, 2012 at 03:30:47PM +0800, Asias He wrote:
> >>If we reset the virtio-blk device before the requests already dispatched
> >>to the virtio-blk driver from the block layer are finised, we will stuck
> >>in blk_cleanup_queue() and the remove will fail.
> >>
> >>blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
> >>before DEAD marking. However it will never success if the device is
> >>already stopped. We'll have q->in_flight[]>  0, so the drain will not
> >>finish.
> >>
> >>How to reproduce the race:
> >>1. hot-plug a virtio-blk device
> >>2. keep reading/writing the device in guest
> >>3. hot-unplug while the device is busy serving I/O
> >>
> >>Changes in v2:
> >>- Drop req_in_flight
> >>- Use virtqueue_detach_unused_buf to get request dispatched to driver
> >>
> >>Signed-off-by: Asias He<asias@redhat.com>
> >>---
> >>  drivers/block/virtio_blk.c |   16 +++++++++++++---
> >>  1 file changed, 13 insertions(+), 3 deletions(-)
> >>
> >>diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> >>index 72fe55d..670c28f 100644
> >>--- a/drivers/block/virtio_blk.c
> >>+++ b/drivers/block/virtio_blk.c
> >>@@ -443,7 +443,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
> >>  	if (err)
> >>  		goto out_free_vblk;
> >>
> >>-	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
> >>+	vblk->pool = mempool_create_kmalloc_pool(1, sizeof(struct virtblk_req));
> >>  	if (!vblk->pool) {
> >>  		err = -ENOMEM;
> >>  		goto out_free_vq;
> >
> >Would be a bit easier to review if whitespace changes
> >are avoided, and done in a separate patch targeting 3.5.
> 
> Well, I will cook another patch for this.
> 
> >>@@ -576,20 +576,30 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
> >>  {
> >>  	struct virtio_blk *vblk = vdev->priv;
> >>  	int index = vblk->index;
> >>+	struct virtblk_req *vbr;
> >>
> >>  	/* Prevent config work handler from accessing the device. */
> >>  	mutex_lock(&vblk->config_lock);
> >>  	vblk->config_enable = false;
> >>  	mutex_unlock(&vblk->config_lock);
> >>
> >>+	/* Abort all request on the queue. */
> >
> >All requests
> >
> >Also, the comment isn't
> >really helpful. Want to explain why we abort
> >them all here? Won't they be drained
> >by later detach code?
> 
> blk_cleanup_queue is trying to drain the queue by calling
> request_fn, which is do_virtblk_request in this case. As we already
> stopped the device when calling blk_cleanup_queue, the drain might
> fail if do_req fails.
> 
> blk_cleanup_queue
>    blk_drain_queue
>        while (true)
>           __blk_run_queue
>             q->request_fn(q);
> 
> >>+	blk_abort_queue(vblk->disk->queue);
> 
> And now, I realized that using blk_abort_queue here to abort the
> queue is not right. it is used for timeout handling
> (block/blk-timeout.c).
> 
> [ CC'ing Jens and Chris ]
> 
> I suspect the btrfs code is using this in the wrong way too:
> 
> btrfs_abort_devices() {
>         list_for_each_entry_rcu(dev, head, dev_list) {
>                 blk_abort_queue(dev->bdev->bd_disk->queue);
>         }
> }
> 
> >>+	del_gendisk(vblk->disk);
> >>+
> >>  	/* Stop all the virtqueues. */
> >>  	vdev->config->reset(vdev);
> >>-
> >>  	flush_work(&vblk->config_work);
> >>
> >>-	del_gendisk(vblk->disk);
> >
> >Is there a reason you move del_gendisk to before reset?
> >Is it safe to del_gendisk while we might
> >still be getting callbacks from the device?
> 
> The original idea was to make the block layer stop sending request
> to driver asap. This is wrong since virtblk_config_changed_work
> might access vblk->disk.

Maybe it's ok - vlk->disk isn't removed until put_disk.
Needs some thought.
For example we must make sure config requests
do not race with reset.
But it should be a separate patch anyway.


> >>+	/* Abort request dispatched to driver. */
> >>+	while ((vbr = virtqueue_detach_unused_buf(vblk->vq))) {
> >>+		blk_abort_request(vbr->req);
> >>+		mempool_free(vbr, vblk->pool);
> >>+	}
> >>+
> >>  	blk_cleanup_queue(vblk->disk->queue);
> >>  	put_disk(vblk->disk);
> >>+
> >>  	mempool_destroy(vblk->pool);
> >>  	vdev->config->del_vqs(vdev);
> >>  	kfree(vblk);
> >>--
> >>1.7.10
> >--
> >To unsubscribe from this list: send the line "unsubscribe kvm" in
> >the body of a message to majordomo@vger.kernel.org
> >More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> -- 
> Asias

^ permalink raw reply

* [PATCH v3] virtio-blk: Fix hot-unplug race in remove method
From: Asias He @ 2012-05-04 12:22 UTC (permalink / raw)
  To: virtualization, Rusty Russell, Michael S. Tsirkin; +Cc: kvm

If we reset the virtio-blk device before the requests already dispatched
to the virtio-blk driver from the block layer are finised, we will stuck
in blk_cleanup_queue() and the remove will fail.

blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
before DEAD marking. However it will never success if the device is
already stopped. We'll have q->in_flight[] > 0, so the drain will not
finish.

How to reproduce the race:
1. hot-plug a virtio-blk device
2. keep reading/writing the device in guest
3. hot-unplug while the device is busy serving I/O

Test:
~1000 rounds of hot-plug/hot-unplug test passed with this patch.

Changes in v3:
- Drop blk_abort_queue and blk_abort_request
- Use __blk_end_request_all to complete request dispatched to driver

Changes in v2:
- Drop req_in_flight
- Use virtqueue_detach_unused_buf to get request dispatched to driver

Signed-off-by: Asias He <asias@redhat.com>
---
 drivers/block/virtio_blk.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 72fe55d..693187d 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -576,6 +576,8 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
 {
 	struct virtio_blk *vblk = vdev->priv;
 	int index = vblk->index;
+	struct virtblk_req *vbr;
+	unsigned long flags;
 
 	/* Prevent config work handler from accessing the device. */
 	mutex_lock(&vblk->config_lock);
@@ -588,6 +590,15 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
 	flush_work(&vblk->config_work);
 
 	del_gendisk(vblk->disk);
+
+	/* Abort requests dispatched to driver. */
+	spin_lock_irqsave(&vblk->lock, flags);
+	while ((vbr = virtqueue_detach_unused_buf(vblk->vq))) {
+		__blk_end_request_all(vbr->req, -EIO);
+		mempool_free(vbr, vblk->pool);
+	}
+	spin_unlock_irqrestore(&vblk->lock, flags);
+
 	blk_cleanup_queue(vblk->disk->queue);
 	put_disk(vblk->disk);
 	mempool_destroy(vblk->pool);
-- 
1.7.10.1

^ permalink raw reply related

* [PATCH v3] virtio-blk: Fix hot-unplug race in remove method
From: Asias He @ 2012-05-04  8:39 UTC (permalink / raw)
  To: virtualization, Rusty Russell, Michael S. Tsirkin; +Cc: kvm

If we reset the virtio-blk device before the requests already dispatched
to the virtio-blk driver from the block layer are finised, we will stuck
in blk_cleanup_queue() and the remove will fail.

blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
before DEAD marking. However it will never success if the device is
already stopped. We'll have q->in_flight[] > 0, so the drain will not
finish.

How to reproduce the race:
1. hot-plug a virtio-blk device
2. keep reading/writing the device in guest
3. hot-unplug while the device is busy serving I/O

Test:
~1000 rounds of hot-plug/hot-unplug test passed with this patch.

Changes in v3:
- Drop blk_abort_queue and blk_abort_request
- Use __blk_end_request_all to complete request dispatched to driver

Changes in v2:
- Drop req_in_flight
- Use virtqueue_detach_unused_buf to get request dispatched to driver

Signed-off-by: Asias He <asias@redhat.com>
---
 drivers/block/virtio_blk.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 72fe55d..693187d 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -576,6 +576,8 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
 {
 	struct virtio_blk *vblk = vdev->priv;
 	int index = vblk->index;
+	struct virtblk_req *vbr;
+	unsigned long flags;
 
 	/* Prevent config work handler from accessing the device. */
 	mutex_lock(&vblk->config_lock);
@@ -588,6 +590,15 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
 	flush_work(&vblk->config_work);
 
 	del_gendisk(vblk->disk);
+
+	/* Abort requests dispatched to driver. */
+	spin_lock_irqsave(&vblk->lock, flags);
+	while ((vbr = virtqueue_detach_unused_buf(vblk->vq))) {
+		__blk_end_request_all(vbr->req, -EIO);
+		mempool_free(vbr, vblk->pool);
+	}
+	spin_unlock_irqrestore(&vblk->lock, flags);
+
 	blk_cleanup_queue(vblk->disk->queue);
 	put_disk(vblk->disk);
 	mempool_destroy(vblk->pool);
-- 
1.7.10.1

^ permalink raw reply related

* Re: [PATCH v2] virtio-blk: Fix hot-unplug race in remove method
From: Asias He @ 2012-05-04  8:37 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jens Axboe, kvm, virtualization, Chris Mason
In-Reply-To: <20120503075634.GK8266@redhat.com>

On 05/03/2012 03:56 PM, Michael S. Tsirkin wrote:
> On Thu, May 03, 2012 at 03:30:47PM +0800, Asias He wrote:
>> If we reset the virtio-blk device before the requests already dispatched
>> to the virtio-blk driver from the block layer are finised, we will stuck
>> in blk_cleanup_queue() and the remove will fail.
>>
>> blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
>> before DEAD marking. However it will never success if the device is
>> already stopped. We'll have q->in_flight[]>  0, so the drain will not
>> finish.
>>
>> How to reproduce the race:
>> 1. hot-plug a virtio-blk device
>> 2. keep reading/writing the device in guest
>> 3. hot-unplug while the device is busy serving I/O
>>
>> Changes in v2:
>> - Drop req_in_flight
>> - Use virtqueue_detach_unused_buf to get request dispatched to driver
>>
>> Signed-off-by: Asias He<asias@redhat.com>
>> ---
>>   drivers/block/virtio_blk.c |   16 +++++++++++++---
>>   1 file changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>> index 72fe55d..670c28f 100644
>> --- a/drivers/block/virtio_blk.c
>> +++ b/drivers/block/virtio_blk.c
>> @@ -443,7 +443,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
>>   	if (err)
>>   		goto out_free_vblk;
>>
>> -	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
>> +	vblk->pool = mempool_create_kmalloc_pool(1, sizeof(struct virtblk_req));
>>   	if (!vblk->pool) {
>>   		err = -ENOMEM;
>>   		goto out_free_vq;
>
> Would be a bit easier to review if whitespace changes
> are avoided, and done in a separate patch targeting 3.5.

Well, I will cook another patch for this.

>> @@ -576,20 +576,30 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
>>   {
>>   	struct virtio_blk *vblk = vdev->priv;
>>   	int index = vblk->index;
>> +	struct virtblk_req *vbr;
>>
>>   	/* Prevent config work handler from accessing the device. */
>>   	mutex_lock(&vblk->config_lock);
>>   	vblk->config_enable = false;
>>   	mutex_unlock(&vblk->config_lock);
>>
>> +	/* Abort all request on the queue. */
>
> All requests
>
> Also, the comment isn't
> really helpful. Want to explain why we abort
> them all here? Won't they be drained
> by later detach code?

blk_cleanup_queue is trying to drain the queue by calling request_fn, 
which is do_virtblk_request in this case. As we already stopped the 
device when calling blk_cleanup_queue, the drain might fail if do_req fails.

blk_cleanup_queue
    blk_drain_queue
        while (true)
           __blk_run_queue
             q->request_fn(q);

>> +	blk_abort_queue(vblk->disk->queue);

And now, I realized that using blk_abort_queue here to abort the queue 
is not right. it is used for timeout handling (block/blk-timeout.c).

[ CC'ing Jens and Chris ]

I suspect the btrfs code is using this in the wrong way too:

btrfs_abort_devices() {
         list_for_each_entry_rcu(dev, head, dev_list) {
                 blk_abort_queue(dev->bdev->bd_disk->queue);
         }
}

>> +	del_gendisk(vblk->disk);
>> +
>>   	/* Stop all the virtqueues. */
>>   	vdev->config->reset(vdev);
>> -
>>   	flush_work(&vblk->config_work);
>>
>> -	del_gendisk(vblk->disk);
>
> Is there a reason you move del_gendisk to before reset?
> Is it safe to del_gendisk while we might
> still be getting callbacks from the device?

The original idea was to make the block layer stop sending request to 
driver asap. This is wrong since virtblk_config_changed_work might 
access vblk->disk.

>> +	/* Abort request dispatched to driver. */
>> +	while ((vbr = virtqueue_detach_unused_buf(vblk->vq))) {
>> +		blk_abort_request(vbr->req);
>> +		mempool_free(vbr, vblk->pool);
>> +	}
>> +
>>   	blk_cleanup_queue(vblk->disk->queue);
>>   	put_disk(vblk->disk);
>> +
>>   	mempool_destroy(vblk->pool);
>>   	vdev->config->del_vqs(vdev);
>>   	kfree(vblk);
>> --
>> 1.7.10
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Asias

^ permalink raw reply

* Re: [PATCH RFC] virtio-net: remove useless disable on freeze
From: Michael S. Tsirkin @ 2012-05-03 11:08 UTC (permalink / raw)
  To: Amit Shah; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20120503105959.GM13421@amit.redhat.com>

On Thu, May 03, 2012 at 04:29:59PM +0530, Amit Shah wrote:
> On (Wed) 04 Apr 2012 [12:19:55], Michael S. Tsirkin wrote:
> > disable_cb is just an optimization: it
> > can not guarantee that there are no callbacks.
> 
> Even then, what's the harm in keeping it?  If indeed there's an
> attempt to raise an interrupt after the host has been notified, it
> will be suppressed.

It won't. It's not a guarantee, e.g. with event index on
it does nothing at all.

> Also, disable_cb seems to be used elsewhere in the virtio_net.c file,
> to suit similar purposes.
> 
> 		Amit

Where?

^ permalink raw reply

* Re: [PATCH RFC] virtio-net: remove useless disable on freeze
From: Amit Shah @ 2012-05-03 10:59 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20120404091954.GA3776@redhat.com>

On (Wed) 04 Apr 2012 [12:19:55], Michael S. Tsirkin wrote:
> disable_cb is just an optimization: it
> can not guarantee that there are no callbacks.

Even then, what's the harm in keeping it?  If indeed there's an
attempt to raise an interrupt after the host has been notified, it
will be suppressed.

Also, disable_cb seems to be used elsewhere in the virtio_net.c file,
to suit similar purposes.

		Amit

^ permalink raw reply

* Re: using cache for virtio allocations?
From: Michael S. Tsirkin @ 2012-05-03  9:02 UTC (permalink / raw)
  To: Sasha Levin; +Cc: kvm, virtualization
In-Reply-To: <CA+1xoqdRjPCpfhYhe0=m-6SktZ4q8Z01vxE7gG5+PpGQw29crA@mail.gmail.com>

On Thu, May 03, 2012 at 10:48:53AM +0200, Sasha Levin wrote:
> On Thu, May 3, 2012 at 10:44 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Thu, May 03, 2012 at 10:38:56AM +0200, Sasha Levin wrote:
> >> On Thu, May 3, 2012 at 9:32 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Thu, May 03, 2012 at 07:51:18AM +0200, Sasha Levin wrote:
> >> >> On Thu, May 3, 2012 at 7:29 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> >> > Sasha, didn't you have a patch to allocate
> >> >> > things using cache in virtio core?
> >> >> > What happened to it?
> >> >> >
> >> >> > Thanks,
> >> >> > MST
> >> >>
> >> >> It got stuck due to several things, and I got sidetracked, sorry. Here
> >> >> are the outstanding issues:
> >> >>
> >> >> 1. Since now we can allocate a descriptor either using kmalloc or from
> >> >> the cache, we need a new flag in vring_desc to know how to free it, it
> >> >> seems a bit too intrusive,
> >> >> and I couldn't thing of a better
> >> >> alternative.
> >> >
> >> > Since that is guest visible it does not sound great, I agree.
> >> >
> >> > Three ideas:
> >> > 1. The logic looks at descriptor size so can we just read
> >> >   desc.len before free and rerun the same math?
> >>
> >> It'll break every time the value is changed (either by the user or by
> >> some dynamic algorithm thingie).
> >
> > Yes but did you intend to implement such complex logic?
> > If not let's not over-engineer.
> 
> I did intend to allow him to change the value while the device is
> running, if we don't want to allow that then it's easy.
> 
> >> > 2. For -net the requests are up to max_skb_frags + 2 in size, right?
> >> >   Does it make sense to just use cache for net, always?
> >> >   That would mean a per device flag.
> >>
> >> Yup, it could work.
> >>
> >> > 3. Allocate a bit more and stick extra data before the 1st descriptor.
> >>
> >> I guess it'll work, but it just seems a bit ugly :)
> >
> > An understatement.
> >
> >> >> 2. Rusty has pointed out that no one is going to modify the default
> >> >> value we set, and we don't really have a good default value to put
> >> >> there (at least, we haven't agreed on a specific value). Also, you
> >> >> have noted that it should be a per-device value, which complicates
> >> >> this question further since we probably want a different value for
> >> >> each device type.
> >> >>
> >> >> While the first one can be solved easily with a blessing from the
> >> >> maintainers, the second one will require testing on various platforms,
> >> >> configurations and devices to select either the best "magic" value, or
> >> >> the best algorithm to play with threshold.
> >> >
> >> > Not sure about platforms but for devices that's right.
> >> > But this really only means we only change what we tested.
> >> > eg see what is good for net and change net in a way
> >> > that others will keep using old code.
> >>
> >> It'll work only if there will be someone following up and actually
> >> testing it, since regular users won't be testing it at all (with it
> >> being defaulted to off and everything).
> >
> > Not sure I understand. Whatever patch gets applied will be
> > tested beforehand.
> 
> I thought you meant that we apply the patch with threshold set at
> 0/disabled, and based on future tests we will enable it for specific
> devices and set best values for threshold, no?

Exactly the opposite. I meant each driver sets the value
and we test it to find a good value. Drivers that don't
do anything use existing kmalloc code.

^ permalink raw reply

* Re: using cache for virtio allocations?
From: Sasha Levin @ 2012-05-03  8:48 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization
In-Reply-To: <20120503084431.GN8266@redhat.com>

On Thu, May 3, 2012 at 10:44 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, May 03, 2012 at 10:38:56AM +0200, Sasha Levin wrote:
>> On Thu, May 3, 2012 at 9:32 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Thu, May 03, 2012 at 07:51:18AM +0200, Sasha Levin wrote:
>> >> On Thu, May 3, 2012 at 7:29 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> >> > Sasha, didn't you have a patch to allocate
>> >> > things using cache in virtio core?
>> >> > What happened to it?
>> >> >
>> >> > Thanks,
>> >> > MST
>> >>
>> >> It got stuck due to several things, and I got sidetracked, sorry. Here
>> >> are the outstanding issues:
>> >>
>> >> 1. Since now we can allocate a descriptor either using kmalloc or from
>> >> the cache, we need a new flag in vring_desc to know how to free it, it
>> >> seems a bit too intrusive,
>> >> and I couldn't thing of a better
>> >> alternative.
>> >
>> > Since that is guest visible it does not sound great, I agree.
>> >
>> > Three ideas:
>> > 1. The logic looks at descriptor size so can we just read
>> >   desc.len before free and rerun the same math?
>>
>> It'll break every time the value is changed (either by the user or by
>> some dynamic algorithm thingie).
>
> Yes but did you intend to implement such complex logic?
> If not let's not over-engineer.

I did intend to allow him to change the value while the device is
running, if we don't want to allow that then it's easy.

>> > 2. For -net the requests are up to max_skb_frags + 2 in size, right?
>> >   Does it make sense to just use cache for net, always?
>> >   That would mean a per device flag.
>>
>> Yup, it could work.
>>
>> > 3. Allocate a bit more and stick extra data before the 1st descriptor.
>>
>> I guess it'll work, but it just seems a bit ugly :)
>
> An understatement.
>
>> >> 2. Rusty has pointed out that no one is going to modify the default
>> >> value we set, and we don't really have a good default value to put
>> >> there (at least, we haven't agreed on a specific value). Also, you
>> >> have noted that it should be a per-device value, which complicates
>> >> this question further since we probably want a different value for
>> >> each device type.
>> >>
>> >> While the first one can be solved easily with a blessing from the
>> >> maintainers, the second one will require testing on various platforms,
>> >> configurations and devices to select either the best "magic" value, or
>> >> the best algorithm to play with threshold.
>> >
>> > Not sure about platforms but for devices that's right.
>> > But this really only means we only change what we tested.
>> > eg see what is good for net and change net in a way
>> > that others will keep using old code.
>>
>> It'll work only if there will be someone following up and actually
>> testing it, since regular users won't be testing it at all (with it
>> being defaulted to off and everything).
>
> Not sure I understand. Whatever patch gets applied will be
> tested beforehand.

I thought you meant that we apply the patch with threshold set at
0/disabled, and based on future tests we will enable it for specific
devices and set best values for threshold, no?

^ permalink raw reply

* Re: using cache for virtio allocations?
From: Michael S. Tsirkin @ 2012-05-03  8:44 UTC (permalink / raw)
  To: Sasha Levin; +Cc: kvm, virtualization
In-Reply-To: <CA+1xoqd5dnh6XKdwXd--7wvn_5+RUVBgUaRdwJXbpwXpd1ccaw@mail.gmail.com>

On Thu, May 03, 2012 at 10:38:56AM +0200, Sasha Levin wrote:
> On Thu, May 3, 2012 at 9:32 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Thu, May 03, 2012 at 07:51:18AM +0200, Sasha Levin wrote:
> >> On Thu, May 3, 2012 at 7:29 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > Sasha, didn't you have a patch to allocate
> >> > things using cache in virtio core?
> >> > What happened to it?
> >> >
> >> > Thanks,
> >> > MST
> >>
> >> It got stuck due to several things, and I got sidetracked, sorry. Here
> >> are the outstanding issues:
> >>
> >> 1. Since now we can allocate a descriptor either using kmalloc or from
> >> the cache, we need a new flag in vring_desc to know how to free it, it
> >> seems a bit too intrusive,
> >> and I couldn't thing of a better
> >> alternative.
> >
> > Since that is guest visible it does not sound great, I agree.
> >
> > Three ideas:
> > 1. The logic looks at descriptor size so can we just read
> >   desc.len before free and rerun the same math?
> 
> It'll break every time the value is changed (either by the user or by
> some dynamic algorithm thingie).

Yes but did you intend to implement such complex logic?
If not let's not over-engineer.

> > 2. For -net the requests are up to max_skb_frags + 2 in size, right?
> >   Does it make sense to just use cache for net, always?
> >   That would mean a per device flag.
> 
> Yup, it could work.
> 
> > 3. Allocate a bit more and stick extra data before the 1st descriptor.
> 
> I guess it'll work, but it just seems a bit ugly :)

An understatement.

> >> 2. Rusty has pointed out that no one is going to modify the default
> >> value we set, and we don't really have a good default value to put
> >> there (at least, we haven't agreed on a specific value). Also, you
> >> have noted that it should be a per-device value, which complicates
> >> this question further since we probably want a different value for
> >> each device type.
> >>
> >> While the first one can be solved easily with a blessing from the
> >> maintainers, the second one will require testing on various platforms,
> >> configurations and devices to select either the best "magic" value, or
> >> the best algorithm to play with threshold.
> >
> > Not sure about platforms but for devices that's right.
> > But this really only means we only change what we tested.
> > eg see what is good for net and change net in a way
> > that others will keep using old code.
> 
> It'll work only if there will be someone following up and actually
> testing it, since regular users won't be testing it at all (with it
> being defaulted to off and everything).

Not sure I understand. Whatever patch gets applied will be
tested beforehand.

-- 
MST

^ permalink raw reply

* Re: [PATCH v2] virtio-blk: Fix hot-unplug race in remove method
From: Michael S. Tsirkin @ 2012-05-03  7:56 UTC (permalink / raw)
  To: Asias He; +Cc: kvm, virtualization
In-Reply-To: <1336030247-16323-1-git-send-email-asias@redhat.com>

On Thu, May 03, 2012 at 03:30:47PM +0800, Asias He wrote:
> If we reset the virtio-blk device before the requests already dispatched
> to the virtio-blk driver from the block layer are finised, we will stuck
> in blk_cleanup_queue() and the remove will fail.
> 
> blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
> before DEAD marking. However it will never success if the device is
> already stopped. We'll have q->in_flight[] > 0, so the drain will not
> finish.
> 
> How to reproduce the race:
> 1. hot-plug a virtio-blk device
> 2. keep reading/writing the device in guest
> 3. hot-unplug while the device is busy serving I/O
> 
> Changes in v2:
> - Drop req_in_flight
> - Use virtqueue_detach_unused_buf to get request dispatched to driver
> 
> Signed-off-by: Asias He <asias@redhat.com>
> ---
>  drivers/block/virtio_blk.c |   16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 72fe55d..670c28f 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -443,7 +443,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vblk;
>  
> -	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
> +	vblk->pool = mempool_create_kmalloc_pool(1, sizeof(struct virtblk_req));
>  	if (!vblk->pool) {
>  		err = -ENOMEM;
>  		goto out_free_vq;

Would be a bit easier to review if whitespace changes
are avoided, and done in a separate patch targeting 3.5.

> @@ -576,20 +576,30 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
>  {
>  	struct virtio_blk *vblk = vdev->priv;
>  	int index = vblk->index;
> +	struct virtblk_req *vbr;
>  
>  	/* Prevent config work handler from accessing the device. */
>  	mutex_lock(&vblk->config_lock);
>  	vblk->config_enable = false;
>  	mutex_unlock(&vblk->config_lock);
>  
> +	/* Abort all request on the queue. */

All requests

Also, the comment isn't
really helpful. Want to explain why we abort
them all here? Won't they be drained
by later detach code?


> +	blk_abort_queue(vblk->disk->queue);
> +	del_gendisk(vblk->disk);
> +
>  	/* Stop all the virtqueues. */
>  	vdev->config->reset(vdev);
> -
>  	flush_work(&vblk->config_work);
>  
> -	del_gendisk(vblk->disk);

Is there a reason you move del_gendisk to before reset?
Is it safe to del_gendisk while we might
still be getting callbacks from the device?

> +	/* Abort request dispatched to driver. */
> +	while ((vbr = virtqueue_detach_unused_buf(vblk->vq))) {
> +		blk_abort_request(vbr->req);
> +		mempool_free(vbr, vblk->pool);
> +	}
> +
>  	blk_cleanup_queue(vblk->disk->queue);
>  	put_disk(vblk->disk);
> +
>  	mempool_destroy(vblk->pool);
>  	vdev->config->del_vqs(vdev);
>  	kfree(vblk);
> -- 
> 1.7.10

^ permalink raw reply

* Re: [PATCH 1/2] virtio-blk: Fix hot-unplug race in remove method
From: Asias He @ 2012-05-03  7:38 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization
In-Reply-To: <20120503052751.GE8266@redhat.com>

On 05/03/2012 01:27 PM, Michael S. Tsirkin wrote:
> On Thu, May 03, 2012 at 10:19:52AM +0800, Asias He wrote:
>> If we reset the virtio-blk device before the requests already dispatched
>> to the virtio-blk driver from the block layer are finised, we will stuck
>> in blk_cleanup_queue() and the remove will fail.
>>
>> blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
>> before DEAD marking. However it will never success if the device is
>> already stopped. We'll have q->in_flight[]>  0, so the drain will not
>> finish.
>>
>> How to reproduce the race:
>> 1. hot-plug a virtio-blk device
>> 2. keep reading/writing the device in guest
>> 3. hot-unplug while the device is busy serving I/O
>>
>> Signed-off-by: Asias He<asias@redhat.com>
>
> We used to do similar tracking in -net but dropped it all by using the
> tracking that virtio core does.  Can't blk do the same?  Isn't there
> some way to use virtqueue_detach_unused_buf for this instead?

It is much simpler to use virtqueue_detach_unused_buf. Thanks, Michael.

>> ---
>>   drivers/block/virtio_blk.c |   26 ++++++++++++++++++++++----
>>   1 file changed, 22 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
>> index 72fe55d..72b818b 100644
>> --- a/drivers/block/virtio_blk.c
>> +++ b/drivers/block/virtio_blk.c
>> @@ -46,6 +46,9 @@ struct virtio_blk
>>   	/* Ida index - used to track minor number allocations. */
>>   	int index;
>>
>> +	/* Number of pending requests dispatched to driver. */
>> +	int req_in_flight;
>> +
>>   	/* Scatterlist: can be too big for stack. */
>>   	struct scatterlist sg[/*sg_elems*/];
>>   };
>> @@ -95,6 +98,7 @@ static void blk_done(struct virtqueue *vq)
>>   		}
>>
>>   		__blk_end_request_all(vbr->req, error);
>> +		vblk->req_in_flight--;
>>   		mempool_free(vbr, vblk->pool);
>>   	}
>>   	/* In case queue is stopped waiting for more buffers. */
>> @@ -190,6 +194,7 @@ static void do_virtblk_request(struct request_queue *q)
>>
>>   	while ((req = blk_peek_request(q)) != NULL) {
>>   		BUG_ON(req->nr_phys_segments + 2>  vblk->sg_elems);
>> +		vblk->req_in_flight++;
>>
>>   		/* If this request fails, stop queue and wait for something to
>>   		   finish to restart it. */
>> @@ -443,7 +448,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
>>   	if (err)
>>   		goto out_free_vblk;
>>
>> -	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
>> +	vblk->pool = mempool_create_kmalloc_pool(1, sizeof(struct virtblk_req));
>>   	if (!vblk->pool) {
>>   		err = -ENOMEM;
>>   		goto out_free_vq;
>> @@ -466,6 +471,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
>>
>>   	virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN);
>>
>> +	vblk->req_in_flight = 0;
>>   	vblk->disk->major = major;
>>   	vblk->disk->first_minor = index_to_minor(index);
>>   	vblk->disk->private_data = vblk;
>> @@ -576,22 +582,34 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
>>   {
>>   	struct virtio_blk *vblk = vdev->priv;
>>   	int index = vblk->index;
>> +	unsigned long flags;
>> +	int req_in_flight;
>>
>>   	/* Prevent config work handler from accessing the device. */
>>   	mutex_lock(&vblk->config_lock);
>>   	vblk->config_enable = false;
>>   	mutex_unlock(&vblk->config_lock);
>>
>> +	/* Abort all request on the queue. */
>> +	blk_abort_queue(vblk->disk->queue);
>> +	del_gendisk(vblk->disk);
>> +
>>   	/* Stop all the virtqueues. */
>>   	vdev->config->reset(vdev);
>> -
>> +	vdev->config->del_vqs(vdev);
>>   	flush_work(&vblk->config_work);
>>
>> -	del_gendisk(vblk->disk);
>> +	/* Wait requests dispatched to device driver to finish. */
>> +	do {
>> +		spin_lock_irqsave(&vblk->lock, flags);
>> +		req_in_flight = vblk->req_in_flight;
>> +		spin_unlock_irqrestore(&vblk->lock, flags);
>> +	} while (req_in_flight != 0);
>> +
>>   	blk_cleanup_queue(vblk->disk->queue);
>>   	put_disk(vblk->disk);
>> +
>>   	mempool_destroy(vblk->pool);
>> -	vdev->config->del_vqs(vdev);
>>   	kfree(vblk);
>>   	ida_simple_remove(&vd_index_ida, index);
>>   }
>> --
>> 1.7.10
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Asias

^ permalink raw reply

* Re: [PATCH 1/2] virtio-blk: Fix hot-unplug race in remove method
From: Asias He @ 2012-05-03  7:36 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Michael S. Tsirkin, kvm, virtualization
In-Reply-To: <CA+1xoqcKQmSQxvUR6syvseUow4AfcnKmDODW=j6t6gejmSi4NA@mail.gmail.com>

On 05/03/2012 01:02 PM, Sasha Levin wrote:
> On Thu, May 3, 2012 at 4:19 AM, Asias He<asias@redhat.com>  wrote:
>> @@ -190,6 +194,7 @@ static void do_virtblk_request(struct request_queue *q)
>>
>>         while ((req = blk_peek_request(q)) != NULL) {
>>                 BUG_ON(req->nr_phys_segments + 2>  vblk->sg_elems);
>> +               vblk->req_in_flight++;
>>
>>                 /* If this request fails, stop queue and wait for something to
>>                    finish to restart it. */
>
> This is being increased before we know if the request will actually be
> sent, so if do_req() fails afterwards, req_in_flight would be
> increased but the request will never be sent.
>
> Which means we won't be able to unplug the device ever.

Yes, you are right. This introduces another race. I could do 
vblk->req_in_flight++ right after blk_start_request(req) to avoid this 
race.

> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Asias

^ permalink raw reply

* Re: using cache for virtio allocations?
From: Michael S. Tsirkin @ 2012-05-03  7:32 UTC (permalink / raw)
  To: Sasha Levin; +Cc: kvm, virtualization
In-Reply-To: <CA+1xoqeWJ_nhoZQ+wRdZnB=b=mDFauWJgcjQ3-+ZorJt=bdoAQ@mail.gmail.com>

On Thu, May 03, 2012 at 07:51:18AM +0200, Sasha Levin wrote:
> On Thu, May 3, 2012 at 7:29 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > Sasha, didn't you have a patch to allocate
> > things using cache in virtio core?
> > What happened to it?
> >
> > Thanks,
> > MST
> 
> It got stuck due to several things, and I got sidetracked, sorry. Here
> are the outstanding issues:
> 
> 1. Since now we can allocate a descriptor either using kmalloc or from
> the cache, we need a new flag in vring_desc to know how to free it, it
> seems a bit too intrusive,
> and I couldn't thing of a better
> alternative.

Since that is guest visible it does not sound great, I agree.

Three ideas:
1. The logic looks at descriptor size so can we just read
   desc.len before free and rerun the same math?
2. For -net the requests are up to max_skb_frags + 2 in size, right?
   Does it make sense to just use cache for net, always?
   That would mean a per device flag.
3. Allocate a bit more and stick extra data before the 1st descriptor.

> 2. Rusty has pointed out that no one is going to modify the default
> value we set, and we don't really have a good default value to put
> there (at least, we haven't agreed on a specific value). Also, you
> have noted that it should be a per-device value, which complicates
> this question further since we probably want a different value for
> each device type.
> 
> While the first one can be solved easily with a blessing from the
> maintainers, the second one will require testing on various platforms,
> configurations and devices to select either the best "magic" value, or
> the best algorithm to play with threshold.

Not sure about platforms but for devices that's right.
But this really only means we only change what we tested.
eg see what is good for net and change net in a way
that others will keep using old code.

-- 
MST

^ permalink raw reply

* [PATCH v2] virtio-blk: Fix hot-unplug race in remove method
From: Asias He @ 2012-05-03  7:30 UTC (permalink / raw)
  To: virtualization, Rusty Russell, Michael S. Tsirkin; +Cc: kvm

If we reset the virtio-blk device before the requests already dispatched
to the virtio-blk driver from the block layer are finised, we will stuck
in blk_cleanup_queue() and the remove will fail.

blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
before DEAD marking. However it will never success if the device is
already stopped. We'll have q->in_flight[] > 0, so the drain will not
finish.

How to reproduce the race:
1. hot-plug a virtio-blk device
2. keep reading/writing the device in guest
3. hot-unplug while the device is busy serving I/O

Changes in v2:
- Drop req_in_flight
- Use virtqueue_detach_unused_buf to get request dispatched to driver

Signed-off-by: Asias He <asias@redhat.com>
---
 drivers/block/virtio_blk.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 72fe55d..670c28f 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -443,7 +443,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vblk;
 
-	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
+	vblk->pool = mempool_create_kmalloc_pool(1, sizeof(struct virtblk_req));
 	if (!vblk->pool) {
 		err = -ENOMEM;
 		goto out_free_vq;
@@ -576,20 +576,30 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
 {
 	struct virtio_blk *vblk = vdev->priv;
 	int index = vblk->index;
+	struct virtblk_req *vbr;
 
 	/* Prevent config work handler from accessing the device. */
 	mutex_lock(&vblk->config_lock);
 	vblk->config_enable = false;
 	mutex_unlock(&vblk->config_lock);
 
+	/* Abort all request on the queue. */
+	blk_abort_queue(vblk->disk->queue);
+	del_gendisk(vblk->disk);
+
 	/* Stop all the virtqueues. */
 	vdev->config->reset(vdev);
-
 	flush_work(&vblk->config_work);
 
-	del_gendisk(vblk->disk);
+	/* Abort request dispatched to driver. */
+	while ((vbr = virtqueue_detach_unused_buf(vblk->vq))) {
+		blk_abort_request(vbr->req);
+		mempool_free(vbr, vblk->pool);
+	}
+
 	blk_cleanup_queue(vblk->disk->queue);
 	put_disk(vblk->disk);
+
 	mempool_destroy(vblk->pool);
 	vdev->config->del_vqs(vdev);
 	kfree(vblk);
-- 
1.7.10

^ permalink raw reply related

* Re: using cache for virtio allocations?
From: Sasha Levin @ 2012-05-03  5:51 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, virtualization
In-Reply-To: <20120503052943.GF8266@redhat.com>

On Thu, May 3, 2012 at 7:29 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> Sasha, didn't you have a patch to allocate
> things using cache in virtio core?
> What happened to it?
>
> Thanks,
> MST

It got stuck due to several things, and I got sidetracked, sorry. Here
are the outstanding issues:

1. Since now we can allocate a descriptor either using kmalloc or from
the cache, we need a new flag in vring_desc to know how to free it, it
seems a bit too intrusive, and I couldn't thing of a better
alternative.

2. Rusty has pointed out that no one is going to modify the default
value we set, and we don't really have a good default value to put
there (at least, we haven't agreed on a specific value). Also, you
have noted that it should be a per-device value, which complicates
this question further since we probably want a different value for
each device type.

While the first one can be solved easily with a blessing from the
maintainers, the second one will require testing on various platforms,
configurations and devices to select either the best "magic" value, or
the best algorithm to play with threshold.

^ permalink raw reply

* using cache for virtio allocations?
From: Michael S. Tsirkin @ 2012-05-03  5:29 UTC (permalink / raw)
  To: Sasha Levin; +Cc: kvm, virtualization

Sasha, didn't you have a patch to allocate
things using cache in virtio core?
What happened to it?

Thanks,
MST

^ permalink raw reply

* Re: [PATCH 1/2] virtio-blk: Fix hot-unplug race in remove method
From: Michael S. Tsirkin @ 2012-05-03  5:27 UTC (permalink / raw)
  To: Asias He; +Cc: kvm, virtualization
In-Reply-To: <1336011592-24889-1-git-send-email-asias@redhat.com>

On Thu, May 03, 2012 at 10:19:52AM +0800, Asias He wrote:
> If we reset the virtio-blk device before the requests already dispatched
> to the virtio-blk driver from the block layer are finised, we will stuck
> in blk_cleanup_queue() and the remove will fail.
> 
> blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
> before DEAD marking. However it will never success if the device is
> already stopped. We'll have q->in_flight[] > 0, so the drain will not
> finish.
> 
> How to reproduce the race:
> 1. hot-plug a virtio-blk device
> 2. keep reading/writing the device in guest
> 3. hot-unplug while the device is busy serving I/O
> 
> Signed-off-by: Asias He <asias@redhat.com>

We used to do similar tracking in -net but dropped it all by using the
tracking that virtio core does.  Can't blk do the same?  Isn't there
some way to use virtqueue_detach_unused_buf for this instead?

> ---
>  drivers/block/virtio_blk.c |   26 ++++++++++++++++++++++----
>  1 file changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 72fe55d..72b818b 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -46,6 +46,9 @@ struct virtio_blk
>  	/* Ida index - used to track minor number allocations. */
>  	int index;
>  
> +	/* Number of pending requests dispatched to driver. */
> +	int req_in_flight;
> +
>  	/* Scatterlist: can be too big for stack. */
>  	struct scatterlist sg[/*sg_elems*/];
>  };
> @@ -95,6 +98,7 @@ static void blk_done(struct virtqueue *vq)
>  		}
>  
>  		__blk_end_request_all(vbr->req, error);
> +		vblk->req_in_flight--;
>  		mempool_free(vbr, vblk->pool);
>  	}
>  	/* In case queue is stopped waiting for more buffers. */
> @@ -190,6 +194,7 @@ static void do_virtblk_request(struct request_queue *q)
>  
>  	while ((req = blk_peek_request(q)) != NULL) {
>  		BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
> +		vblk->req_in_flight++;
>  
>  		/* If this request fails, stop queue and wait for something to
>  		   finish to restart it. */
> @@ -443,7 +448,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
>  	if (err)
>  		goto out_free_vblk;
>  
> -	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
> +	vblk->pool = mempool_create_kmalloc_pool(1, sizeof(struct virtblk_req));
>  	if (!vblk->pool) {
>  		err = -ENOMEM;
>  		goto out_free_vq;
> @@ -466,6 +471,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
>  
>  	virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN);
>  
> +	vblk->req_in_flight = 0;
>  	vblk->disk->major = major;
>  	vblk->disk->first_minor = index_to_minor(index);
>  	vblk->disk->private_data = vblk;
> @@ -576,22 +582,34 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
>  {
>  	struct virtio_blk *vblk = vdev->priv;
>  	int index = vblk->index;
> +	unsigned long flags;
> +	int req_in_flight;
>  
>  	/* Prevent config work handler from accessing the device. */
>  	mutex_lock(&vblk->config_lock);
>  	vblk->config_enable = false;
>  	mutex_unlock(&vblk->config_lock);
>  
> +	/* Abort all request on the queue. */
> +	blk_abort_queue(vblk->disk->queue);
> +	del_gendisk(vblk->disk);
> +
>  	/* Stop all the virtqueues. */
>  	vdev->config->reset(vdev);
> -
> +	vdev->config->del_vqs(vdev);
>  	flush_work(&vblk->config_work);
>  
> -	del_gendisk(vblk->disk);
> +	/* Wait requests dispatched to device driver to finish. */
> +	do {
> +		spin_lock_irqsave(&vblk->lock, flags);
> +		req_in_flight = vblk->req_in_flight;
> +		spin_unlock_irqrestore(&vblk->lock, flags);
> +	} while (req_in_flight != 0);
> +
>  	blk_cleanup_queue(vblk->disk->queue);
>  	put_disk(vblk->disk);
> +
>  	mempool_destroy(vblk->pool);
> -	vdev->config->del_vqs(vdev);
>  	kfree(vblk);
>  	ida_simple_remove(&vd_index_ida, index);
>  }
> -- 
> 1.7.10

^ permalink raw reply

* Re: [PATCH 1/2] virtio-blk: Fix hot-unplug race in remove method
From: Sasha Levin @ 2012-05-03  5:02 UTC (permalink / raw)
  To: Asias He; +Cc: Michael S. Tsirkin, kvm, virtualization
In-Reply-To: <1336011592-24889-1-git-send-email-asias@redhat.com>

On Thu, May 3, 2012 at 4:19 AM, Asias He <asias@redhat.com> wrote:
> @@ -190,6 +194,7 @@ static void do_virtblk_request(struct request_queue *q)
>
>        while ((req = blk_peek_request(q)) != NULL) {
>                BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
> +               vblk->req_in_flight++;
>
>                /* If this request fails, stop queue and wait for something to
>                   finish to restart it. */

This is being increased before we know if the request will actually be
sent, so if do_req() fails afterwards, req_in_flight would be
increased but the request will never be sent.

Which means we won't be able to unplug the device ever.

^ permalink raw reply

* [PATCH 2/2] virtio: Use ida to allocate virtio index
From: Asias He @ 2012-05-03  2:20 UTC (permalink / raw)
  To: virtualization, Rusty Russell, Michael S. Tsirkin; +Cc: kvm

Current index allocation in virtio is based on a monotonically
increasing variable "index". This means we'll run out of numbers
after a while. E.g. someone crazy doing this in host side.

while(1) {
	hot-plug a virtio device
	hot-unplug the virito devcie
}

Signed-off-by: Asias He <asias@redhat.com>
---
 drivers/virtio/virtio.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 984c501..f355807 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -2,9 +2,10 @@
 #include <linux/spinlock.h>
 #include <linux/virtio_config.h>
 #include <linux/module.h>
+#include <linux/idr.h>
 
 /* Unique numbering for virtio devices. */
-static unsigned int dev_index;
+static DEFINE_IDA(virtio_index_ida);
 
 static ssize_t device_show(struct device *_d,
 			   struct device_attribute *attr, char *buf)
@@ -193,7 +194,11 @@ int register_virtio_device(struct virtio_device *dev)
 	dev->dev.bus = &virtio_bus;
 
 	/* Assign a unique device index and hence name. */
-	dev->index = dev_index++;
+	err = ida_simple_get(&virtio_index_ida, 0, 0, GFP_KERNEL);
+	if (err < 0)
+		goto out;
+
+	dev->index = err;
 	dev_set_name(&dev->dev, "virtio%u", dev->index);
 
 	/* We always start by resetting the device, in case a previous
@@ -208,6 +213,7 @@ int register_virtio_device(struct virtio_device *dev)
 	/* device_register() causes the bus infrastructure to look for a
 	 * matching driver. */
 	err = device_register(&dev->dev);
+out:
 	if (err)
 		add_status(dev, VIRTIO_CONFIG_S_FAILED);
 	return err;
@@ -217,6 +223,7 @@ EXPORT_SYMBOL_GPL(register_virtio_device);
 void unregister_virtio_device(struct virtio_device *dev)
 {
 	device_unregister(&dev->dev);
+	ida_simple_remove(&virtio_index_ida, dev->index);
 }
 EXPORT_SYMBOL_GPL(unregister_virtio_device);
 
-- 
1.7.10

^ permalink raw reply related

* [PATCH 1/2] virtio-blk: Fix hot-unplug race in remove method
From: Asias He @ 2012-05-03  2:19 UTC (permalink / raw)
  To: virtualization, Rusty Russell, Michael S. Tsirkin; +Cc: kvm

If we reset the virtio-blk device before the requests already dispatched
to the virtio-blk driver from the block layer are finised, we will stuck
in blk_cleanup_queue() and the remove will fail.

blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
before DEAD marking. However it will never success if the device is
already stopped. We'll have q->in_flight[] > 0, so the drain will not
finish.

How to reproduce the race:
1. hot-plug a virtio-blk device
2. keep reading/writing the device in guest
3. hot-unplug while the device is busy serving I/O

Signed-off-by: Asias He <asias@redhat.com>
---
 drivers/block/virtio_blk.c |   26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 72fe55d..72b818b 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -46,6 +46,9 @@ struct virtio_blk
 	/* Ida index - used to track minor number allocations. */
 	int index;
 
+	/* Number of pending requests dispatched to driver. */
+	int req_in_flight;
+
 	/* Scatterlist: can be too big for stack. */
 	struct scatterlist sg[/*sg_elems*/];
 };
@@ -95,6 +98,7 @@ static void blk_done(struct virtqueue *vq)
 		}
 
 		__blk_end_request_all(vbr->req, error);
+		vblk->req_in_flight--;
 		mempool_free(vbr, vblk->pool);
 	}
 	/* In case queue is stopped waiting for more buffers. */
@@ -190,6 +194,7 @@ static void do_virtblk_request(struct request_queue *q)
 
 	while ((req = blk_peek_request(q)) != NULL) {
 		BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
+		vblk->req_in_flight++;
 
 		/* If this request fails, stop queue and wait for something to
 		   finish to restart it. */
@@ -443,7 +448,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 	if (err)
 		goto out_free_vblk;
 
-	vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
+	vblk->pool = mempool_create_kmalloc_pool(1, sizeof(struct virtblk_req));
 	if (!vblk->pool) {
 		err = -ENOMEM;
 		goto out_free_vq;
@@ -466,6 +471,7 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
 
 	virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN);
 
+	vblk->req_in_flight = 0;
 	vblk->disk->major = major;
 	vblk->disk->first_minor = index_to_minor(index);
 	vblk->disk->private_data = vblk;
@@ -576,22 +582,34 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
 {
 	struct virtio_blk *vblk = vdev->priv;
 	int index = vblk->index;
+	unsigned long flags;
+	int req_in_flight;
 
 	/* Prevent config work handler from accessing the device. */
 	mutex_lock(&vblk->config_lock);
 	vblk->config_enable = false;
 	mutex_unlock(&vblk->config_lock);
 
+	/* Abort all request on the queue. */
+	blk_abort_queue(vblk->disk->queue);
+	del_gendisk(vblk->disk);
+
 	/* Stop all the virtqueues. */
 	vdev->config->reset(vdev);
-
+	vdev->config->del_vqs(vdev);
 	flush_work(&vblk->config_work);
 
-	del_gendisk(vblk->disk);
+	/* Wait requests dispatched to device driver to finish. */
+	do {
+		spin_lock_irqsave(&vblk->lock, flags);
+		req_in_flight = vblk->req_in_flight;
+		spin_unlock_irqrestore(&vblk->lock, flags);
+	} while (req_in_flight != 0);
+
 	blk_cleanup_queue(vblk->disk->queue);
 	put_disk(vblk->disk);
+
 	mempool_destroy(vblk->pool);
-	vdev->config->del_vqs(vdev);
 	kfree(vblk);
 	ida_simple_remove(&vd_index_ida, index);
 }
-- 
1.7.10

^ permalink raw reply related

* [PATCH RFC V8 17/17] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock
From: Raghavendra K T @ 2012-05-02 10:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Greg Kroah-Hartman, Konrad Rzeszutek Wilk,
	H. Peter Anvin, Avi Kivity, X86, Gleb Natapov, Ingo Molnar,
	Marcelo Tosatti
  Cc: Xen Devel, KVM, linux-doc, LKML, Raghavendra K T,
	Srivatsa Vaddagiri, Virtualization, Andi Kleen,
	Stephan Diestelhorst, Attilio Rao, Linus Torvalds,
	Stefano Stabellini
In-Reply-To: <20120502100610.13206.40.sendpatchset@codeblue.in.ibm.com>

From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> 

KVM_HC_KICK_CPU  hypercall added to wakeup halted vcpu in paravirtual spinlock
enabled guest.

KVM_FEATURE_PV_UNHALT enables guest to check whether pv spinlock can be enabled
in guest.

Thanks Alex for KVM_HC_FEATURES inputs and Vatsa for rewriting KVM_HC_KICK_CPU

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 Documentation/virtual/kvm/cpuid.txt      |    4 ++
 Documentation/virtual/kvm/hypercalls.txt |   60 ++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+), 0 deletions(-)
diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt
index 8820685..062dff9 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -39,6 +39,10 @@ KVM_FEATURE_CLOCKSOURCE2           ||     3 || kvmclock available at msrs
 KVM_FEATURE_ASYNC_PF               ||     4 || async pf can be enabled by
                                    ||       || writing to msr 0x4b564d02
 ------------------------------------------------------------------------------
+KVM_FEATURE_PV_UNHALT              ||     6 || guest checks this feature bit
+                                   ||       || before enabling paravirtualized
+                                   ||       || spinlock support.
+------------------------------------------------------------------------------
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||    24 || host will warn if no guest-side
                                    ||       || per-cpu warps are expected in
                                    ||       || kvmclock.
diff --git a/Documentation/virtual/kvm/hypercalls.txt b/Documentation/virtual/kvm/hypercalls.txt
new file mode 100644
index 0000000..bc3f14a
--- /dev/null
+++ b/Documentation/virtual/kvm/hypercalls.txt
@@ -0,0 +1,60 @@
+KVM Hypercalls Documentation
+===========================
+The template for each hypercall is:
+1. Hypercall name, value.
+2. Architecture(s)
+3. Status (deprecated, obsolete, active)
+4. Purpose
+
+1. KVM_HC_VAPIC_POLL_IRQ
+------------------------
+Value: 1
+Architecture: x86
+Purpose: None
+
+2. KVM_HC_MMU_OP
+------------------------
+Value: 2
+Architecture: x86
+Status: deprecated.
+Purpose: Support MMU operations such as writing to PTE,
+flushing TLB, release PT.
+
+3. KVM_HC_FEATURES
+------------------------
+Value: 3
+Architecture: PPC
+Status: active
+Purpose: Expose hypercall availability to the guest. On x86 platforms, cpuid
+used to enumerate which hypercalls are available. On PPC, either device tree
+based lookup ( which is also what EPAPR dictates) OR KVM specific enumeration
+mechanism (which is this hypercall) can be used.
+
+4. KVM_HC_PPC_MAP_MAGIC_PAGE
+------------------------
+Value: 4
+Architecture: PPC
+Status: active
+Purpose: To enable communication between the hypervisor and guest there is a
+shared page that contains parts of supervisor visible register state.
+The guest can map this shared page to access its supervisor register through
+memory using this hypercall.
+
+5. KVM_HC_KICK_CPU
+------------------------
+Value: 5
+Architecture: x86
+Status: active
+Purpose: Hypercall used to wakeup a vcpu from HLT state
+
+Usage example : A vcpu of a paravirtualized guest that is busywaiting in guest
+kernel mode for an event to occur (ex: a spinlock to become available) can
+execute HLT instruction once it has busy-waited for more than a threshold
+time-interval. Execution of HLT instruction would cause the hypervisor to put
+the vcpu to sleep until occurence of an appropriate event. Another vcpu of the
+same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall,
+specifying APIC ID of the vcpu to be wokenup.
+
+TODO:
+1. more information on input and output needed?
+2. Add more detail to purpose of hypercalls.

^ permalink raw reply related

* [PATCH RFC V8 16/17] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor
From: Raghavendra K T @ 2012-05-02 10:09 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Greg Kroah-Hartman, Konrad Rzeszutek Wilk,
	H. Peter Anvin, Marcelo Tosatti, X86, Gleb Natapov, Ingo Molnar,
	Avi Kivity
  Cc: Xen Devel, KVM, linux-doc, LKML, Raghavendra K T,
	Srivatsa Vaddagiri, Virtualization, Andi Kleen,
	Stephan Diestelhorst, Attilio Rao, Linus Torvalds,
	Stefano Stabellini
In-Reply-To: <20120502100610.13206.40.sendpatchset@codeblue.in.ibm.com>

From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
 support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 arch/x86/include/asm/kvm_para.h |   14 ++-
 arch/x86/kernel/kvm.c           |  256 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 268 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 5b647ea..77266d3 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -195,10 +195,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
 	return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b8ba6e4..7c46567 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -33,6 +33,7 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/kprobes.h>
+#include <linux/debugfs.h>
 #include <asm/timer.h>
 #include <asm/cpu.h>
 #include <asm/traps.h>
@@ -368,6 +369,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
 #endif
 	kvm_guest_cpu_init();
 	native_smp_prepare_boot_cpu();
+	kvm_spinlock_init();
 }
 
 static void __cpuinit kvm_guest_cpu_online(void *dummy)
@@ -450,3 +452,257 @@ static __init int activate_jump_labels(void)
 	return 0;
 }
 arch_initcall(activate_jump_labels);
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+	int apicid;
+
+	apicid = per_cpu(x86_cpu_to_apicid, cpu);
+	kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+enum kvm_contention_stat {
+	TAKEN_SLOW,
+	TAKEN_SLOW_PICKUP,
+	RELEASED_SLOW,
+	RELEASED_SLOW_KICKED,
+	NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS	30
+
+static struct kvm_spinlock_stats
+{
+	u32 contention_stats[NR_CONTENTION_STATS];
+	u32 histo_spin_blocked[HISTO_BUCKETS+1];
+	u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+	u8 ret;
+	u8 old;
+
+	old = ACCESS_ONCE(zero_stats);
+	if (unlikely(old)) {
+		ret = cmpxchg(&zero_stats, old, 0);
+		/* This ensures only one fellow resets the stat */
+		if (ret == old)
+			memset(&spinlock_stats, 0, sizeof(spinlock_stats));
+	}
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+	check_zero();
+	spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+	return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+	unsigned index;
+
+	index = ilog2(delta);
+	check_zero();
+
+	if (index < HISTO_BUCKETS)
+		array[index]++;
+	else
+		array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+	u32 delta;
+
+	delta = sched_clock() - start;
+	__spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+	spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+	d_kvm_debug = debugfs_create_dir("kvm", NULL);
+	if (!d_kvm_debug)
+		printk(KERN_WARNING "Could not create 'kvm' debugfs directory\n");
+
+	return d_kvm_debug;
+}
+
+static int __init kvm_spinlock_debugfs(void)
+{
+	struct dentry *d_kvm;
+
+	d_kvm = kvm_init_debugfs();
+	if (d_kvm == NULL)
+		return -ENOMEM;
+
+	d_spin_debug = debugfs_create_dir("spinlocks", d_kvm);
+
+	debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats);
+
+	debugfs_create_u32("taken_slow", 0444, d_spin_debug,
+		   &spinlock_stats.contention_stats[TAKEN_SLOW]);
+	debugfs_create_u32("taken_slow_pickup", 0444, d_spin_debug,
+		   &spinlock_stats.contention_stats[TAKEN_SLOW_PICKUP]);
+
+	debugfs_create_u32("released_slow", 0444, d_spin_debug,
+		   &spinlock_stats.contention_stats[RELEASED_SLOW]);
+	debugfs_create_u32("released_slow_kicked", 0444, d_spin_debug,
+		   &spinlock_stats.contention_stats[RELEASED_SLOW_KICKED]);
+
+	debugfs_create_u64("time_blocked", 0444, d_spin_debug,
+			   &spinlock_stats.time_blocked);
+
+	debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
+		     spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
+
+	return 0;
+}
+fs_initcall(kvm_spinlock_debugfs);
+#else  /* !CONFIG_KVM_DEBUG_FS */
+#define TIMEOUT			(1 << 10)
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+}
+
+static inline u64 spin_time_start(void)
+{
+	return 0;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+}
+#endif  /* CONFIG_KVM_DEBUG_FS */
+
+struct kvm_lock_waiting {
+	struct arch_spinlock *lock;
+	__ticket_t want;
+};
+
+/* cpus 'waiting' on a spinlock to become available */
+static cpumask_t waiting_cpus;
+
+/* Track spinlock on which a cpu is waiting */
+static DEFINE_PER_CPU(struct kvm_lock_waiting, lock_waiting);
+
+static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
+{
+	struct kvm_lock_waiting *w;
+	int cpu;
+	u64 start;
+	unsigned long flags;
+
+	w = &__get_cpu_var(lock_waiting);
+	cpu = smp_processor_id();
+	start = spin_time_start();
+
+	/*
+	 * Make sure an interrupt handler can't upset things in a
+	 * partially setup state.
+	 */
+	local_irq_save(flags);
+
+	/*
+	 * The ordering protocol on this is that the "lock" pointer
+	 * may only be set non-NULL if the "want" ticket is correct.
+	 * If we're updating "want", we must first clear "lock".
+	 */
+	w->lock = NULL;
+	smp_wmb();
+	w->want = want;
+	smp_wmb();
+	w->lock = lock;
+
+	add_stats(TAKEN_SLOW, 1);
+
+	/*
+	 * This uses set_bit, which is atomic but we should not rely on its
+	 * reordering gurantees. So barrier is needed after this call.
+	 */
+	cpumask_set_cpu(cpu, &waiting_cpus);
+
+	barrier();
+
+	/*
+	 * Mark entry to slowpath before doing the pickup test to make
+	 * sure we don't deadlock with an unlocker.
+	 */
+	__ticket_enter_slowpath(lock);
+
+	/*
+	 * check again make sure it didn't become free while
+	 * we weren't looking.
+	 */
+	if (ACCESS_ONCE(lock->tickets.head) == want) {
+		add_stats(TAKEN_SLOW_PICKUP, 1);
+		goto out;
+	}
+
+	/* Allow interrupts while blocked */
+	local_irq_restore(flags);
+
+	/* halt until it's our turn and kicked. */
+	halt();
+
+	local_irq_save(flags);
+out:
+	cpumask_clear_cpu(cpu, &waiting_cpus);
+	w->lock = NULL;
+	local_irq_restore(flags);
+	spin_time_accum_blocked(start);
+}
+PV_CALLEE_SAVE_REGS_THUNK(kvm_lock_spinning);
+
+/* Kick vcpu waiting on @lock->head to reach value @ticket */
+static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
+{
+	int cpu;
+
+	add_stats(RELEASED_SLOW, 1);
+	for_each_cpu(cpu, &waiting_cpus) {
+		const struct kvm_lock_waiting *w = &per_cpu(lock_waiting, cpu);
+		if (ACCESS_ONCE(w->lock) == lock &&
+		    ACCESS_ONCE(w->want) == ticket) {
+			add_stats(RELEASED_SLOW_KICKED, 1);
+			kvm_kick_cpu(cpu);
+			break;
+		}
+	}
+}
+
+/*
+ * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
+ */
+void __init kvm_spinlock_init(void)
+{
+	if (!kvm_para_available())
+		return;
+	/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
+	if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
+		return;
+
+	printk(KERN_INFO"KVM setup paravirtual spinlock\n");
+
+	static_key_slow_inc(&paravirt_ticketlocks_enabled);
+
+	pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
+	pv_lock_ops.unlock_kick = kvm_unlock_kick;
+}
+#endif	/* CONFIG_PARAVIRT_SPINLOCKS */

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox