Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: Raghavendra K T @ 2012-04-01 13:56 UTC (permalink / raw)
  To: Avi Kivity
  Cc: KVM, Alan Meadows, Peter Zijlstra, Stefano Stabellini,
	the arch/x86 maintainers, LKML, Konrad Rzeszutek Wilk, Andi Kleen,
	Srivatsa Vaddagiri, Jeremy Fitzhardinge, H. Peter Anvin,
	Attilio Rao, Ingo Molnar, Virtualization, Linus Torvalds,
	Xen Devel, Stephan Diestelhorst
In-Reply-To: <4F785DCF.7020809@redhat.com>

On 04/01/2012 07:23 PM, Avi Kivity wrote:
> On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>>>> I have patch something like below in mind to try:
>
> I'm interested in how PLE does vs. your patches, both with PLE enabled
> and disabled.
>

Sure. will update with the experimental results.

^ permalink raw reply

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: Avi Kivity @ 2012-04-01 13:53 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: KVM, Alan Meadows, Peter Zijlstra, Stefano Stabellini,
	the arch/x86 maintainers, LKML, Konrad Rzeszutek Wilk, Andi Kleen,
	Srivatsa Vaddagiri, Jeremy Fitzhardinge, H. Peter Anvin,
	Attilio Rao, Ingo Molnar, Virtualization, Linus Torvalds,
	Xen Devel, Stephan Diestelhorst
In-Reply-To: <4F785CC9.7070204@linux.vnet.ibm.com>

On 04/01/2012 04:48 PM, Raghavendra K T wrote:
>>> I have patch something like below in mind to try:
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index d3b98b1..5127668 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>        * else and called schedule in __vcpu_run.  Hopefully that
>>>        * VCPU is holding the lock that we need and will release it.
>>>        * We approximate round-robin by starting at the last boosted
>>> VCPU.
>>> +     * Priority is given to vcpu that are unhalted.
>>>        */
>>> -    for (pass = 0; pass<  2&&  !yielded; pass++) {
>>> +    for (pass = 0; pass<  3&&  !yielded; pass++) {
>>>           kvm_for_each_vcpu(i, vcpu, kvm) {
>>>               struct task_struct *task = NULL;
>>>               struct pid *pid;
>>> -            if (!pass&&  i<  last_boosted_vcpu) {
>>> +            if (!pass&&  !vcpu->pv_unhalted)
>>> +                continue;
>>> +            else if (pass == 1&&  i<  last_boosted_vcpu) {
>>>                   i = last_boosted_vcpu;
>>>                   continue;
>>> -            } else if (pass&&  i>  last_boosted_vcpu)
>>> +            } else if (pass == 2&&  i>  last_boosted_vcpu)
>>>                   break;
>>>               if (vcpu == me)
>>>                   continue;
>>>
>>
>> Actually I think this is unneeded.  The loops tries to find vcpus that
>> are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
>> don't match this condition.
>>
>
>
> I almost agree. But at corner of my thought,
>
> Suppose there are 8 vcpus runnable out of which 4 of them are kicked
> but not running, making yield_to those 4 vcpus would result in better
> lock progress. no?

That's what the code does.

>   I still have little problem getting PLE setup, here (instead
> rebasing patches).
> Once I get PLE to get that running, and numbers prove no improvement,
> I will drop this idea.
>

I'm interested in how PLE does vs. your patches, both with PLE enabled
and disabled.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: Raghavendra K T @ 2012-04-01 13:48 UTC (permalink / raw)
  To: Avi Kivity
  Cc: KVM, Alan Meadows, Peter Zijlstra, Stefano Stabellini,
	the arch/x86 maintainers, LKML, Konrad Rzeszutek Wilk, Andi Kleen,
	Srivatsa Vaddagiri, Jeremy Fitzhardinge, H. Peter Anvin,
	Attilio Rao, Ingo Molnar, Virtualization, Linus Torvalds,
	Xen Devel, Stephan Diestelhorst
In-Reply-To: <4F7855A1.80107@redhat.com>

On 04/01/2012 06:48 PM, Avi Kivity wrote:
> On 03/30/2012 01:07 PM, Raghavendra K T wrote:
>> On 03/29/2012 11:33 PM, Raghavendra K T wrote:
>>> On 03/29/2012 03:28 PM, Avi Kivity wrote:
>>>> On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>>
>>> I really like below ideas. Thanks for that!.
>>>
>>>> - from the PLE handler, don't wake up a vcpu that is sleeping
>>>> because it
>>>> is waiting for a kick
>>>
>>> How about, adding another pass in the beginning of kvm_vcpu_on_spin()
>>> to check if any vcpu is already kicked. This would almost result in
>>> yield_to(kicked_vcpu). IMO this is also worth trying.
>>>
>>> will try above ideas soon.
>>>
>>
>> I have patch something like below in mind to try:
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index d3b98b1..5127668 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>        * else and called schedule in __vcpu_run.  Hopefully that
>>        * VCPU is holding the lock that we need and will release it.
>>        * We approximate round-robin by starting at the last boosted VCPU.
>> +     * Priority is given to vcpu that are unhalted.
>>        */
>> -    for (pass = 0; pass<  2&&  !yielded; pass++) {
>> +    for (pass = 0; pass<  3&&  !yielded; pass++) {
>>           kvm_for_each_vcpu(i, vcpu, kvm) {
>>               struct task_struct *task = NULL;
>>               struct pid *pid;
>> -            if (!pass&&  i<  last_boosted_vcpu) {
>> +            if (!pass&&  !vcpu->pv_unhalted)
>> +                continue;
>> +            else if (pass == 1&&  i<  last_boosted_vcpu) {
>>                   i = last_boosted_vcpu;
>>                   continue;
>> -            } else if (pass&&  i>  last_boosted_vcpu)
>> +            } else if (pass == 2&&  i>  last_boosted_vcpu)
>>                   break;
>>               if (vcpu == me)
>>                   continue;
>>
>
> Actually I think this is unneeded.  The loops tries to find vcpus that
> are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
> don't match this condition.
>

I almost agree. But at corner of my thought,

Suppose there are 8 vcpus runnable out of which 4 of them are kicked
but not running, making yield_to those 4 vcpus would result in better
lock progress. no?

  I still have little problem getting PLE setup, here (instead rebasing 
patches).
Once I get PLE to get that running, and numbers prove no improvement, I 
will drop this idea.

^ permalink raw reply

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: Avi Kivity @ 2012-04-01 13:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: the arch/x86 maintainers, KVM, Konrad Rzeszutek Wilk,
	Peter Zijlstra, Stefano Stabellini, Raghavendra K T, LKML,
	Andi Kleen, Srivatsa Vaddagiri, Jeremy Fitzhardinge,
	H. Peter Anvin, Attilio Rao, Ingo Molnar, Virtualization,
	Linus Torvalds, Xen Devel, Stephan Diestelhorst
In-Reply-To: <alpine.LFD.2.02.1203302333560.2542@ionos>

On 03/31/2012 01:07 AM, Thomas Gleixner wrote:
> On Fri, 30 Mar 2012, H. Peter Anvin wrote:
>
> > What is the current status of this patchset?  I haven't looked at it too
> > closely because I have been focused on 3.4 up until now...
>
> The real question is whether these heuristics are the correct approach
> or not.
>
> If I look at it from the non virtualized kernel side then this is ass
> backwards. We know already that we are holding a spinlock which might
> cause other (v)cpus going into eternal spin. The non virtualized
> kernel solves this by disabling preemption and therefor getting out of
> the critical section as fast as possible,
>
> The virtualization problem reminds me a lot of the problem which RT
> kernels are observing where non raw spinlocks are turned into
> "sleeping spinlocks" and therefor can cause throughput issues for non
> RT workloads.
>
> Though the virtualized situation is even worse. Any preempted guest
> section which holds a spinlock is prone to cause unbound delays.
>
> The paravirt ticketlock solution can only mitigate the problem, but
> not solve it. With massive overcommit there is always a way to trigger
> worst case scenarious unless you are educating the scheduler to cope
> with that.
>
> So if we need to fiddle with the scheduler and frankly that's the only
> way to get a real gain (the numbers, which are achieved by this
> patches, are not that impressive) then the question arises whether we
> should turn the whole thing around.
>
> I know that Peter is going to go berserk on me, but if we are running
> a paravirt guest then it's simple to provide a mechanism which allows
> the host (aka hypervisor) to check that in the guest just by looking
> at some global state.
>
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be
> modified to indicate the guest to invoke an exit when the last nested
> lock has been released.

Interesting idea (I think it has been raised before btw, don't recall by
who).

One thing about it is that it can give many false positives.  Consider a
fine-grained spinlock that is being accessed by many threads.  That is,
the lock is taken and released with high frequency, but there is no
contention, because each vcpu is accessing a different instance.  So the
host scheduler will needlessly delay preemption of vcpus that happen to
be holding a lock, even though this gains nothing.

A second issue may happen with a lock that is taken and released with
high frequency, with a high hold percentage.  The host scheduler may
always sample the guest in a held state, leading it to conclude that
it's exceeding its timeout when in fact the lock is held for a short
time only.

> Of course this needs to be time bound, so a rogue guest cannot
> monopolize the cpu forever, but that's the least to worry about
> problem simply because a guest which does not get out of a spinlocked
> region within a certain amount of time is borked and elegible to
> killing anyway.

Hopefully not killing!  Just because a guest doesn't scale well, or even
if it's deadlocked, doesn't mean it should be killed.  Just preempt it.

> Thoughts ?

It's certainly interesting.  Maybe a combination is worthwhile - prevent
lockholder preemption for a short period of time AND put waiters to
sleep in case that period is insufficient to release the lock.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: Avi Kivity @ 2012-04-01 13:18 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: KVM, Alan Meadows, Peter Zijlstra, Stefano Stabellini,
	the arch/x86 maintainers, LKML, Konrad Rzeszutek Wilk, Andi Kleen,
	Srivatsa Vaddagiri, Jeremy Fitzhardinge, H. Peter Anvin,
	Attilio Rao, Ingo Molnar, Virtualization, Linus Torvalds,
	Xen Devel, Stephan Diestelhorst
In-Reply-To: <4F7585EE.7060203@linux.vnet.ibm.com>

On 03/30/2012 01:07 PM, Raghavendra K T wrote:
> On 03/29/2012 11:33 PM, Raghavendra K T wrote:
>> On 03/29/2012 03:28 PM, Avi Kivity wrote:
>>> On 03/28/2012 08:21 PM, Raghavendra K T wrote:
>
>> I really like below ideas. Thanks for that!.
>>
>>> - from the PLE handler, don't wake up a vcpu that is sleeping
>>> because it
>>> is waiting for a kick
>>
>> How about, adding another pass in the beginning of kvm_vcpu_on_spin()
>> to check if any vcpu is already kicked. This would almost result in
>> yield_to(kicked_vcpu). IMO this is also worth trying.
>>
>> will try above ideas soon.
>>
>
> I have patch something like below in mind to try:
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d3b98b1..5127668 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1608,15 +1608,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>       * else and called schedule in __vcpu_run.  Hopefully that
>       * VCPU is holding the lock that we need and will release it.
>       * We approximate round-robin by starting at the last boosted VCPU.
> +     * Priority is given to vcpu that are unhalted.
>       */
> -    for (pass = 0; pass < 2 && !yielded; pass++) {
> +    for (pass = 0; pass < 3 && !yielded; pass++) {
>          kvm_for_each_vcpu(i, vcpu, kvm) {
>              struct task_struct *task = NULL;
>              struct pid *pid;
> -            if (!pass && i < last_boosted_vcpu) {
> +            if (!pass && !vcpu->pv_unhalted)
> +                continue;
> +            else if (pass == 1 && i < last_boosted_vcpu) {
>                  i = last_boosted_vcpu;
>                  continue;
> -            } else if (pass && i > last_boosted_vcpu)
> +            } else if (pass == 2 && i > last_boosted_vcpu)
>                  break;
>              if (vcpu == me)
>                  continue;
>

Actually I think this is unneeded.  The loops tries to find vcpus that
are runnable but not running (vcpu_active(vcpu->wq)), and halted vcpus
don't match this condition.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH 0/4] block: move sd_format_disk_name() into block core as disk_name_format()
From: Michael S. Tsirkin @ 2012-04-01 11:20 UTC (permalink / raw)
  To: Ren Mingxin; +Cc: Jens Axboe, KVM, SCSI, LKML, VIRTUAL, Tejun Heo
In-Reply-To: <4F7581D4.4040301@cn.fujitsu.com>

On Fri, Mar 30, 2012 at 05:50:12PM +0800, Ren Mingxin wrote:
> 
> 
> This patch series renames "sd_format_disk_name()" to
> "disk_name_format()" and moves it into block core. So
> that who needs formatting disk name can use it, instead
> of duplicating these similar help functions.

I see there are some responses about naming and comments,
so that would need to be fixed.
Besides this, Jens, would you like to take the patchset
through your tree or prefer for me to merge it through virtio tree? 
Block tree seems more appropriate, right?
Also this is a bugfix technically since it fixes setups
with a ton of disks, so 3.4 material?

> Ren Mingxin (4):
>   block: add function disk_name_format() into block core
>   scsi: replace sd_format_disk_name() to disk_name_format()
>   block: replace rssd_disk_name_format() to disk_name_format()
>   virtio_blk: use disk_name_format() to support mass of disks naming
> 
>  block/genhd.c                     |   49 ++++++++++++++++++++++++++++++++++++++
>  drivers/block/mtip32xx/mtip32xx.c |   33 -------------------------
>  drivers/block/virtio_blk.c        |   13 ----------
>  drivers/scsi/sd.c                 |   48 -------------------------------------
>  include/linux/genhd.h             |    2 +
>  5 files changed, 54 insertions(+), 91 deletions(-)
> 
> Any comment will be appreciated.
> 
> Thanks,
> Ren
> 

^ permalink raw reply

* Re: [PATCH] virtio_blk: Drop unused request tracking list
From: Michael S. Tsirkin @ 2012-04-01 10:07 UTC (permalink / raw)
  To: Asias He; +Cc: kvm, virtualization
In-Reply-To: <1333077850-21717-1-git-send-email-asias@redhat.com>

On Fri, Mar 30, 2012 at 11:24:10AM +0800, Asias He wrote:
> Benchmark shows small performance improvement on fusion io device.
> 
> Before:
>   seq-read : io=1,024MB, bw=19,982KB/s, iops=39,964, runt= 52475msec
>   seq-write: io=1,024MB, bw=20,321KB/s, iops=40,641, runt= 51601msec
>   rnd-read : io=1,024MB, bw=15,404KB/s, iops=30,808, runt= 68070msec
>   rnd-write: io=1,024MB, bw=14,776KB/s, iops=29,552, runt= 70963msec
> 
> After:
>   seq-read : io=1,024MB, bw=20,343KB/s, iops=40,685, runt= 51546msec
>   seq-write: io=1,024MB, bw=20,803KB/s, iops=41,606, runt= 50404msec
>   rnd-read : io=1,024MB, bw=16,221KB/s, iops=32,442, runt= 64642msec
>   rnd-write: io=1,024MB, bw=15,199KB/s, iops=30,397, runt= 68991msec
> 
> Signed-off-by: Asias He <asias@redhat.com>

Thanks, the patch makes sense to me.
Acked-by: Michael S. Tsirkin <mst@redhat.com>

This is 3.5 material, correct?

> ---
>  drivers/block/virtio_blk.c |   10 ----------
>  1 files changed, 0 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index c4a60ba..338da9a 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -29,9 +29,6 @@ struct virtio_blk
>  	/* The disk structure for the kernel. */
>  	struct gendisk *disk;
>  
> -	/* Request tracking. */
> -	struct list_head reqs;
> -
>  	mempool_t *pool;
>  
>  	/* Process context for config space updates */
> @@ -55,7 +52,6 @@ struct virtio_blk
>  
>  struct virtblk_req
>  {
> -	struct list_head list;
>  	struct request *req;
>  	struct virtio_blk_outhdr out_hdr;
>  	struct virtio_scsi_inhdr in_hdr;
> @@ -99,7 +95,6 @@ static void blk_done(struct virtqueue *vq)
>  		}
>  
>  		__blk_end_request_all(vbr->req, error);
> -		list_del(&vbr->list);
>  		mempool_free(vbr, vblk->pool);
>  	}
>  	/* In case queue is stopped waiting for more buffers. */
> @@ -184,7 +179,6 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
>  		return false;
>  	}
>  
> -	list_add_tail(&vbr->list, &vblk->reqs);
>  	return true;
>  }
>  
> @@ -408,7 +402,6 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
>  		goto out_free_index;
>  	}
>  
> -	INIT_LIST_HEAD(&vblk->reqs);
>  	spin_lock_init(&vblk->lock);
>  	vblk->vdev = vdev;
>  	vblk->sg_elems = sg_elems;
> @@ -571,9 +564,6 @@ static void __devexit virtblk_remove(struct virtio_device *vdev)
>  	vblk->config_enable = false;
>  	mutex_unlock(&vblk->config_lock);
>  
> -	/* Nothing should be pending. */
> -	BUG_ON(!list_empty(&vblk->reqs));
> -
>  	/* Stop all the virtqueues. */
>  	vdev->config->reset(vdev);
>  
> -- 
> 1.7.7.6

^ permalink raw reply

* Re: [PATCH] virtio_blk: Drop unused request tracking list
From: Asias He @ 2012-03-31 10:22 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: virtualization, kvm, Michael S. Tsirkin
In-Reply-To: <CAJSP0QWN2byeZt9RCwgwVp9HuD5G0yha3xNYT+m+OJPS16W8SQ@mail.gmail.com>

Hi, Stefan

On Fri, Mar 30, 2012 at 6:14 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Fri, Mar 30, 2012 at 4:24 AM, Asias He <asias@redhat.com> wrote:
>> Benchmark shows small performance improvement on fusion io device.
>>
>> Before:
>>  seq-read : io=1,024MB, bw=19,982KB/s, iops=39,964, runt= 52475msec
>>  seq-write: io=1,024MB, bw=20,321KB/s, iops=40,641, runt= 51601msec
>>  rnd-read : io=1,024MB, bw=15,404KB/s, iops=30,808, runt= 68070msec
>>  rnd-write: io=1,024MB, bw=14,776KB/s, iops=29,552, runt= 70963msec
>>
>> After:
>>  seq-read : io=1,024MB, bw=20,343KB/s, iops=40,685, runt= 51546msec
>>  seq-write: io=1,024MB, bw=20,803KB/s, iops=41,606, runt= 50404msec
>>  rnd-read : io=1,024MB, bw=16,221KB/s, iops=32,442, runt= 64642msec
>>  rnd-write: io=1,024MB, bw=15,199KB/s, iops=30,397, runt= 68991msec
>>
>> Signed-off-by: Asias He <asias@redhat.com>
>> ---
>>  drivers/block/virtio_blk.c |   10 ----------
>>  1 files changed, 0 insertions(+), 10 deletions(-)
>
> Thanks for providing performance results.  It's a bit scary that this
> unused list has an impact...I'm sure we have worse things elsewhere in
> the KVM storage code path.

Do you find any worse things? I saw your trace work here:

   http://www.linux-kvm.org/page/Virtio/Block/Latency

> Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

Thanks for reviewing!

--
Asias He
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: Ingo Molnar @ 2012-03-31  8:11 UTC (permalink / raw)
  To: Andi Kleen
  Cc: the arch/x86 maintainers, KVM, Konrad Rzeszutek Wilk,
	Peter Zijlstra, Stefano Stabellini, Raghavendra K T, LKML,
	Xen Devel, Srivatsa Vaddagiri, Avi Kivity, Jeremy Fitzhardinge,
	H. Peter Anvin, Attilio Rao, Thomas Gleixner, Virtualization,
	Linus Torvalds, Ingo Molnar, Stephan Diestelhorst
In-Reply-To: <20120331000843.GO17822@one.firstfloor.org>


* Andi Kleen <andi@firstfloor.org> wrote:

> > Care to back that up with numbers and proper trace evidence 
> > instead of handwaving?
> 
> E.g. my plumbers presentations on lock and mm scalability from 
> last year has some graphs that show this very clearly, plus 
> some additional data on the mutexes. This compares to the 
> glibc futex locks, which perform much better than the kernel 
> mutex locks on larger systems under higher contention

If you mean these draft slides:

  http://www.halobates.de/plumbers-fork-locks_v2.pdf

it has very little verifiable information in it. It just 
cryptically says lock hold time "microbenchmark", which might or 
might not be a valid measurement.

You could have been honest and straightforward in your first 
mail:

 "I ran workload X on machine Y, and got results Z."

Instead you are *hindering* the discussion:

> Given your tone I will not supply an URL. [...]

If you meant the above URL then it's not the proper numbers 
Thomas asked for, just some vague slides. If you meant something 
else then put up or shut up.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: Srivatsa Vaddagiri @ 2012-03-31  4:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: the arch/x86 maintainers, KVM, Konrad Rzeszutek Wilk,
	Peter Zijlstra, Stefano Stabellini, Raghavendra K T, LKML,
	Andi Kleen, Avi Kivity, Jeremy Fitzhardinge, H. Peter Anvin,
	Attilio Rao, Ingo Molnar, Virtualization, Linus Torvalds,
	Xen Devel, Stephan Diestelhorst
In-Reply-To: <20120331040745.GC14030@linux.vnet.ibm.com>

* Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> [2012-03-31 09:37:45]:

> The issue is with ticketlocks though. VCPUs could go into a spin w/o
> a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
> that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
> and releases the lock. VCPU1 is next eligible to take the lock. If 

Sorry I meant to say "VCPU2 is next eligible ..."

> that is not scheduled early enough by host, then remaining vcpus would keep 
> spinning (even though lock is technically not held by anybody) w/o making 
> forward progress.
> 
> In that situation, what we really need is for the guest to hint to host
> scheduler to schedule VCPU1 early (via yield_to or something similar). 

s/VCPU1/VCPU2 ..

- vatsa

^ permalink raw reply

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: Srivatsa Vaddagiri @ 2012-03-31  4:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: the arch/x86 maintainers, KVM, Konrad Rzeszutek Wilk,
	Peter Zijlstra, Stefano Stabellini, Raghavendra K T, LKML,
	Andi Kleen, Avi Kivity, Jeremy Fitzhardinge, H. Peter Anvin,
	Attilio Rao, Ingo Molnar, Virtualization, Linus Torvalds,
	Xen Devel, Stephan Diestelhorst
In-Reply-To: <alpine.LFD.2.02.1203302333560.2542@ionos>

* Thomas Gleixner <tglx@linutronix.de> [2012-03-31 00:07:58]:

> I know that Peter is going to go berserk on me, but if we are running
> a paravirt guest then it's simple to provide a mechanism which allows
> the host (aka hypervisor) to check that in the guest just by looking
> at some global state.
> 
> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be
> modified to indicate the guest to invoke an exit when the last nested
> lock has been released.

I had attempted something like that long back:

http://lkml.org/lkml/2010/6/3/4

The issue is with ticketlocks though. VCPUs could go into a spin w/o
a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in
that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it)
and releases the lock. VCPU1 is next eligible to take the lock. If 
that is not scheduled early enough by host, then remaining vcpus would keep 
spinning (even though lock is technically not held by anybody) w/o making 
forward progress.

In that situation, what we really need is for the guest to hint to host
scheduler to schedule VCPU1 early (via yield_to or something similar). 

The current pv-spinlock patches however does not track which vcpu is
spinning at what head of the ticketlock. I suppose we can consider 
that optimization in future and see how much benefit it provides (over
plain yield/sleep the way its done now).

Do you see any issues if we take in what we have today and address the
finer-grained optimization as next step?

- vatsa 

^ permalink raw reply

* Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming
From: Asias He @ 2012-03-31  3:03 UTC (permalink / raw)
  To: Ren Mingxin
  Cc: Jens Axboe, KVM, SCSI, Michael S. Tsirkin, LKML, VIRTUAL,
	Tejun Heo
In-Reply-To: <4F765A72.6090006@cn.fujitsu.com>

On Sat, Mar 31, 2012 at 9:14 AM, Ren Mingxin <renmx@cn.fujitsu.com> wrote:
>  Hi, He:
>
>
> On 03/30/2012 07:22 PM, Asias He wrote:
>>
>> On Fri, Mar 30, 2012 at 5:53 PM, Ren Mingxin<renmx@cn.fujitsu.com>  wrote:
>>>
>>>  The current virtblk's naming algorithm only supports 263  disks.
>>> If there are mass of virtblks(exceeding 263), there will be disks
>>> with the same name.
>>
>> This fix is pretty nice. However, current virtblk's naming still
>> supports up to 18278 disks, no?
>> ( index 0  ->  vda, index 18277 ->  vdzzz ).
>
>
> Sorry, I intended to type 26^3+ (26^2+26)...
> It may still be a restriction which should be improved though.

OK.

-- 
Asias He
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PULL] virtio: S3 support, use PM API macro for init
From: Amit Shah @ 2012-03-31  2:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Michael S. Tsirkin, linux-kernel, Virtualization List

Linus,

Please pull for virtio-s3/s4 fixes.  I'm taking the liberty to send
this to you directly since Rusty's on vacation.

The patches were posted earlier at

http://www.spinics.net/lists/linux-virtualization/msg15781.html

and have been tested to work fine, here's the message from the
original submission:

Turns out S3 is not different from S4 for virtio devices: the device
is assumed to be reset, so the host and guest state are to be assumed
to be out of sync upon resume.  We handle the S4 case with exactly the
same scenario, so just point the suspend/resume routines to the
freeze/restore ones.

Once that is done, we also use the PM API's macro to initialise the
sleep functions.

A couple of cleanups are included: there's no need for special thaw
processing in the balloon driver, so that's addressed in patches 1 and
2.

Testing: both S3 and S4 support have been tested using these patches
using a similar method used earlier during S4 patch development: a
guest is started with virtio-blk as the only disk, a virtio network
card, a virtio-serial port and a virtio balloon device.  Ping from
guest to host, dd /dev/zero to a file on the disk, and IO from the
host on the virtio-serial port, all at once, while exercising S4 and
S3 (separately) were tested.  They all continue to work fine after
resume.  virtio balloon values too were tested by inflating and
deflating the balloon.

The following changes since commit fa2a4519cb6ad94224eb56a1341fff570fd44ea1:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc (2012-03-30 18:40:33 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/amit/virtio-console.git s3-for-3.4

Amit Shah (5):
      virtio: balloon: Allow stats update after restore from S4
      virtio: drop thaw PM operation
      virtio-pci: drop restore_common()
      virtio-pci: S3 support
      virtio-pci: switch to PM ops macro to initialise PM functions

 drivers/virtio/virtio_balloon.c |   14 -------
 drivers/virtio/virtio_pci.c     |   74 ++++----------------------------------
 include/linux/virtio.h          |    1 -
 3 files changed, 8 insertions(+), 81 deletions(-)

		Amit

^ permalink raw reply

* Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming
From: Ren Mingxin @ 2012-03-31  1:14 UTC (permalink / raw)
  To: Asias He
  Cc: Jens Axboe, KVM, SCSI, Michael S. Tsirkin, LKML, VIRTUAL,
	Tejun Heo
In-Reply-To: <CAFO3S427vWXHBDaRmfwqZrGZ2Y8RPUn7CQswQou9EGMh43jU8Q@mail.gmail.com>

  Hi, He:

On 03/30/2012 07:22 PM, Asias He wrote:
> On Fri, Mar 30, 2012 at 5:53 PM, Ren Mingxin<renmx@cn.fujitsu.com>  wrote:
>>   The current virtblk's naming algorithm only supports 263  disks.
>> If there are mass of virtblks(exceeding 263), there will be disks
>> with the same name.
> This fix is pretty nice. However, current virtblk's naming still
> supports up to 18278 disks, no?
> ( index 0  ->  vda, index 18277 ->  vdzzz ).

Sorry, I intended to type 26^3+ (26^2+26)...
It may still be a restriction which should be improved though.

Thanks,
Ren

^ permalink raw reply

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: Raghavendra K T @ 2012-03-31  0:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: KVM, Konrad Rzeszutek Wilk, Peter Zijlstra, Stefano Stabellini,
	the arch/x86 maintainers, LKML, Virtualization, Andi Kleen,
	Avi Kivity, Jeremy Fitzhardinge, Srivatsa Vaddagiri, Attilio Rao,
	Ingo Molnar, Linus Torvalds, Xen Devel, Stephan Diestelhorst
In-Reply-To: <4F7616F5.4070000@zytor.com>

On 03/31/2012 01:56 AM, H. Peter Anvin wrote:
> What is the current status of this patchset?  I haven't looked at it too
> closely because I have been focused on 3.4 up until now...
>

Thanks Peter,

Currently targeting the patchset for next merge window. IMO These
patches are in good shape now. I' ll rebase these patches and send it
ASAP. It needs "jumplabel split patch" (from Andrew Jones) as
dependency, I would fold that patch (got ok from him for that) also
into series, and send them ASAP.

  - Raghu

^ permalink raw reply

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: Andi Kleen @ 2012-03-31  0:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: the arch/x86 maintainers, KVM, Konrad Rzeszutek Wilk,
	Peter Zijlstra, Stefano Stabellini, Raghavendra K T, LKML,
	Andi Kleen, Srivatsa Vaddagiri, Avi Kivity, Jeremy Fitzhardinge,
	H. Peter Anvin, Attilio Rao, Ingo Molnar, Virtualization,
	Linus Torvalds, Xen Devel, Stephan Diestelhorst
In-Reply-To: <alpine.LFD.2.02.1203310044580.2542@ionos>

On Sat, Mar 31, 2012 at 01:04:41AM +0200, Thomas "Kubys" Gleixner wrote:
> On Sat, 31 Mar 2012, Andi Kleen wrote:
> 
> > > So if a guest exits due to an external event it's easy to inspect the
> > > state of that guest and avoid to schedule away when it was interrupted
> > > in a spinlock held section. That guest/host shared state needs to be
> > 
> > On a large system under high contention sleeping can perform surprisingly
> > well. pthread mutexes have a tendency to beat kernel spinlocks there.
> > So avoiding sleeping locks completely (that is what pv locks are
> > essentially) is not necessarily that good.
> 
> Care to back that up with numbers and proper trace evidence instead of
> handwaving?

E.g. my plumbers presentations on lock and mm scalability from last year has some 
graphs that show this very clearly, plus some additional data on the mutexes. 
This compares to the glibc futex locks, which perform much better than the kernel 
mutex locks on larger systems under higher contention

Given your tone I will not supply an URL. I'm sure you can find it if you
need it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH 3/4] block: replace rssd_disk_name_format() to disk_name_format()
From: Asai Thambi Samymuthu Pattrayasamy (asamymuthupa) [CONT - Type 2] @ 2012-03-30 23:54 UTC (permalink / raw)
  To: Ren Mingxin
  Cc: Jens Axboe, KVM, SCSI, Michael S. Tsirkin, LKML, VIRTUAL,
	Tejun Heo
In-Reply-To: <4F758297.5020600@cn.fujitsu.com>

On 3/30/2012 2:53 AM, Ren Mingxin wrote:

>  Currently, block core has been supplied "disk_name_format()", so
> we should remove duplicate function "rssd_disk_name_format()"
> and use the new function to format rssd disk names.
> 
> Signed-off-by: Ren Mingxin <renmx@cn.fujitsu.com>
> ---
>  mtip32xx.c |   33 +--------------------------------
>  1 file changed, 1 insertion(+), 32 deletions(-)


Looks fine.

Should the subject be "mtip32xx: ..." instead of "block: ..."? I
understand "block:" as relating to block core. I am fairly new here. If
"block:" can be used for block drivers too, that is fine too.

--
Regards,
Asai Thambi

^ permalink raw reply

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: Thomas Gleixner @ 2012-03-30 23:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: the arch/x86 maintainers, KVM, Konrad Rzeszutek Wilk,
	Peter Zijlstra, Stefano Stabellini, Raghavendra K T, LKML,
	Xen Devel, Srivatsa Vaddagiri, Avi Kivity, Jeremy Fitzhardinge,
	H. Peter Anvin, Attilio Rao, Ingo Molnar, Virtualization,
	Linus Torvalds, Stephan Diestelhorst
In-Reply-To: <20120330221836.GN17822@one.firstfloor.org>

On Sat, 31 Mar 2012, Andi Kleen wrote:

> > So if a guest exits due to an external event it's easy to inspect the
> > state of that guest and avoid to schedule away when it was interrupted
> > in a spinlock held section. That guest/host shared state needs to be
> 
> On a large system under high contention sleeping can perform surprisingly
> well. pthread mutexes have a tendency to beat kernel spinlocks there.
> So avoiding sleeping locks completely (that is what pv locks are
> essentially) is not necessarily that good.

Care to back that up with numbers and proper trace evidence instead of
handwaving?

I've stared at RT traces and throughput problems on _LARGE_ machines
long enough to know what I'm talking about and I can provide evidence
in a split second.

> Your proposal is probably only a good idea on low contention
> and relatively small systems.

Sigh, you have really no fcking clue what you are talking about.

On RT we observed scalabilty problems way before hardware was
available to expose them. So what's your point?

Put up or shut up, really!

    tglx

^ permalink raw reply

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: Andi Kleen @ 2012-03-30 22:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: the arch/x86 maintainers, KVM, Konrad Rzeszutek Wilk,
	Peter Zijlstra, Stefano Stabellini, Raghavendra K T, LKML,
	Andi Kleen, Srivatsa Vaddagiri, Avi Kivity, Jeremy Fitzhardinge,
	H. Peter Anvin, Attilio Rao, Ingo Molnar, Virtualization,
	Linus Torvalds, Xen Devel, Stephan Diestelhorst
In-Reply-To: <alpine.LFD.2.02.1203302333560.2542@ionos>

> So if a guest exits due to an external event it's easy to inspect the
> state of that guest and avoid to schedule away when it was interrupted
> in a spinlock held section. That guest/host shared state needs to be

On a large system under high contention sleeping can perform surprisingly
well. pthread mutexes have a tendency to beat kernel spinlocks there.
So avoiding sleeping locks completely (that is what pv locks are
essentially) is not necessarily that good.

Your proposal is probably only a good idea on low contention
and relatively small systems.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: Thomas Gleixner @ 2012-03-30 22:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: the arch/x86 maintainers, KVM, Konrad Rzeszutek Wilk,
	Peter Zijlstra, Stefano Stabellini, Raghavendra K T, LKML,
	Andi Kleen, Avi Kivity, Jeremy Fitzhardinge, Srivatsa Vaddagiri,
	Attilio Rao, Ingo Molnar, Virtualization, Linus Torvalds,
	Xen Devel, Stephan Diestelhorst
In-Reply-To: <4F7616F5.4070000@zytor.com>

On Fri, 30 Mar 2012, H. Peter Anvin wrote:

> What is the current status of this patchset?  I haven't looked at it too
> closely because I have been focused on 3.4 up until now...

The real question is whether these heuristics are the correct approach
or not.

If I look at it from the non virtualized kernel side then this is ass
backwards. We know already that we are holding a spinlock which might
cause other (v)cpus going into eternal spin. The non virtualized
kernel solves this by disabling preemption and therefor getting out of
the critical section as fast as possible,

The virtualization problem reminds me a lot of the problem which RT
kernels are observing where non raw spinlocks are turned into
"sleeping spinlocks" and therefor can cause throughput issues for non
RT workloads.

Though the virtualized situation is even worse. Any preempted guest
section which holds a spinlock is prone to cause unbound delays.

The paravirt ticketlock solution can only mitigate the problem, but
not solve it. With massive overcommit there is always a way to trigger
worst case scenarious unless you are educating the scheduler to cope
with that.

So if we need to fiddle with the scheduler and frankly that's the only
way to get a real gain (the numbers, which are achieved by this
patches, are not that impressive) then the question arises whether we
should turn the whole thing around.

I know that Peter is going to go berserk on me, but if we are running
a paravirt guest then it's simple to provide a mechanism which allows
the host (aka hypervisor) to check that in the guest just by looking
at some global state.

So if a guest exits due to an external event it's easy to inspect the
state of that guest and avoid to schedule away when it was interrupted
in a spinlock held section. That guest/host shared state needs to be
modified to indicate the guest to invoke an exit when the last nested
lock has been released.

Of course this needs to be time bound, so a rogue guest cannot
monopolize the cpu forever, but that's the least to worry about
problem simply because a guest which does not get out of a spinlocked
region within a certain amount of time is borked and elegible to
killing anyway.

Thoughts ?

Thanks,

	tglx

^ permalink raw reply

* Re: [Xen-devel] [PATCH RFC V5 1/6] debugfs: Add support to print u32 array in debugfs
From: Greg KH @ 2012-03-30 21:11 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Xen, Jeremy Fitzhardinge, Raghavendra K T, KVM, linux-doc, X86,
	LKML, Ingo Molnar, Srivatsa Vaddagiri, Avi Kivity, H. Peter Anvin,
	Virtualization, Greg Kroah-Hartman, Stefano Stabellini,
	Sasha Levin
In-Reply-To: <20120330204912.GA29329@phenom.dumpdata.com>

On Fri, Mar 30, 2012 at 04:49:12PM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Mar 23, 2012 at 01:36:28PM +0530, Raghavendra K T wrote:
> > From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> > 
> > Move the code from Xen to debugfs to make the code common
> > for other users as well.
> > 
> > Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> > Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
> > Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> 
> Greg,
> 
> I was thinking to stick this patch in my queue, but I need your
> OK since it touches fs/debugfs/file.c.

That's fine with me:

Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

^ permalink raw reply

* Re: [Xen-devel] [PATCH RFC V5 1/6] debugfs: Add support to print u32 array in debugfs
From: Konrad Rzeszutek Wilk @ 2012-03-30 20:49 UTC (permalink / raw)
  To: Raghavendra K T, gregkh
  Cc: Jeremy Fitzhardinge, Greg Kroah-Hartman, KVM, linux-doc, X86,
	LKML, Ingo Molnar, Srivatsa Vaddagiri, Avi Kivity, H. Peter Anvin,
	Virtualization, Xen, Stefano Stabellini, Sasha Levin
In-Reply-To: <20120323080606.14568.31335.sendpatchset@codeblue>

On Fri, Mar 23, 2012 at 01:36:28PM +0530, Raghavendra K T wrote:
> From: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> 
> Move the code from Xen to debugfs to make the code common
> for other users as well.
> 
> Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
> Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Greg,

I was thinking to stick this patch in my queue, but I need your
OK since it touches fs/debugfs/file.c.

> ---
> diff --git a/arch/x86/xen/debugfs.c b/arch/x86/xen/debugfs.c
> index ef1db19..c8377fb 100644
> --- a/arch/x86/xen/debugfs.c
> +++ b/arch/x86/xen/debugfs.c
> @@ -19,107 +19,3 @@ struct dentry * __init xen_init_debugfs(void)
>  	return d_xen_debug;
>  }
>  
> -struct array_data
> -{
> -	void *array;
> -	unsigned elements;
> -};
> -
> -static int u32_array_open(struct inode *inode, struct file *file)
> -{
> -	file->private_data = NULL;
> -	return nonseekable_open(inode, file);
> -}
> -
> -static size_t format_array(char *buf, size_t bufsize, const char *fmt,
> -			   u32 *array, unsigned array_size)
> -{
> -	size_t ret = 0;
> -	unsigned i;
> -
> -	for(i = 0; i < array_size; i++) {
> -		size_t len;
> -
> -		len = snprintf(buf, bufsize, fmt, array[i]);
> -		len++;	/* ' ' or '\n' */
> -		ret += len;
> -
> -		if (buf) {
> -			buf += len;
> -			bufsize -= len;
> -			buf[-1] = (i == array_size-1) ? '\n' : ' ';
> -		}
> -	}
> -
> -	ret++;		/* \0 */
> -	if (buf)
> -		*buf = '\0';
> -
> -	return ret;
> -}
> -
> -static char *format_array_alloc(const char *fmt, u32 *array, unsigned array_size)
> -{
> -	size_t len = format_array(NULL, 0, fmt, array, array_size);
> -	char *ret;
> -
> -	ret = kmalloc(len, GFP_KERNEL);
> -	if (ret == NULL)
> -		return NULL;
> -
> -	format_array(ret, len, fmt, array, array_size);
> -	return ret;
> -}
> -
> -static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len,
> -			      loff_t *ppos)
> -{
> -	struct inode *inode = file->f_path.dentry->d_inode;
> -	struct array_data *data = inode->i_private;
> -	size_t size;
> -
> -	if (*ppos == 0) {
> -		if (file->private_data) {
> -			kfree(file->private_data);
> -			file->private_data = NULL;
> -		}
> -
> -		file->private_data = format_array_alloc("%u", data->array, data->elements);
> -	}
> -
> -	size = 0;
> -	if (file->private_data)
> -		size = strlen(file->private_data);
> -
> -	return simple_read_from_buffer(buf, len, ppos, file->private_data, size);
> -}
> -
> -static int xen_array_release(struct inode *inode, struct file *file)
> -{
> -	kfree(file->private_data);
> -
> -	return 0;
> -}
> -
> -static const struct file_operations u32_array_fops = {
> -	.owner	= THIS_MODULE,
> -	.open	= u32_array_open,
> -	.release= xen_array_release,
> -	.read	= u32_array_read,
> -	.llseek = no_llseek,
> -};
> -
> -struct dentry *xen_debugfs_create_u32_array(const char *name, umode_t mode,
> -					    struct dentry *parent,
> -					    u32 *array, unsigned elements)
> -{
> -	struct array_data *data = kmalloc(sizeof(*data), GFP_KERNEL);
> -
> -	if (data == NULL)
> -		return NULL;
> -
> -	data->array = array;
> -	data->elements = elements;
> -
> -	return debugfs_create_file(name, mode, parent, data, &u32_array_fops);
> -}
> diff --git a/arch/x86/xen/debugfs.h b/arch/x86/xen/debugfs.h
> index 78d2549..12ebf33 100644
> --- a/arch/x86/xen/debugfs.h
> +++ b/arch/x86/xen/debugfs.h
> @@ -3,8 +3,4 @@
>  
>  struct dentry * __init xen_init_debugfs(void);
>  
> -struct dentry *xen_debugfs_create_u32_array(const char *name, umode_t mode,
> -					    struct dentry *parent,
> -					    u32 *array, unsigned elements);
> -
>  #endif /* _XEN_DEBUGFS_H */
> diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
> index 4926974..b74cebb 100644
> --- a/arch/x86/xen/spinlock.c
> +++ b/arch/x86/xen/spinlock.c
> @@ -314,7 +314,7 @@ static int __init xen_spinlock_debugfs(void)
>  	debugfs_create_u64("time_blocked", 0444, d_spin_debug,
>  			   &spinlock_stats.time_blocked);
>  
> -	xen_debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
> +	debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
>  				     spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
>  
>  	return 0;
> diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
> index ef023ee..cb6cff3 100644
> --- a/fs/debugfs/file.c
> +++ b/fs/debugfs/file.c
> @@ -20,6 +20,7 @@
>  #include <linux/namei.h>
>  #include <linux/debugfs.h>
>  #include <linux/io.h>
> +#include <linux/slab.h>
>  
>  static ssize_t default_read_file(struct file *file, char __user *buf,
>  				 size_t count, loff_t *ppos)
> @@ -528,6 +529,133 @@ struct dentry *debugfs_create_blob(const char *name, umode_t mode,
>  }
>  EXPORT_SYMBOL_GPL(debugfs_create_blob);
>  
> +struct array_data {
> +	void *array;
> +	u32 elements;
> +};
> +
> +static int u32_array_open(struct inode *inode, struct file *file)
> +{
> +	file->private_data = NULL;
> +	return nonseekable_open(inode, file);
> +}
> +
> +static size_t format_array(char *buf, size_t bufsize, const char *fmt,
> +			   u32 *array, u32 array_size)
> +{
> +	size_t ret = 0;
> +	u32 i;
> +
> +	for (i = 0; i < array_size; i++) {
> +		size_t len;
> +
> +		len = snprintf(buf, bufsize, fmt, array[i]);
> +		len++;	/* ' ' or '\n' */
> +		ret += len;
> +
> +		if (buf) {
> +			buf += len;
> +			bufsize -= len;
> +			buf[-1] = (i == array_size-1) ? '\n' : ' ';
> +		}
> +	}
> +
> +	ret++;		/* \0 */
> +	if (buf)
> +		*buf = '\0';
> +
> +	return ret;
> +}
> +
> +static char *format_array_alloc(const char *fmt, u32 *array,
> +						u32 array_size)
> +{
> +	size_t len = format_array(NULL, 0, fmt, array, array_size);
> +	char *ret;
> +
> +	ret = kmalloc(len, GFP_KERNEL);
> +	if (ret == NULL)
> +		return NULL;
> +
> +	format_array(ret, len, fmt, array, array_size);
> +	return ret;
> +}
> +
> +static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len,
> +			      loff_t *ppos)
> +{
> +	struct inode *inode = file->f_path.dentry->d_inode;
> +	struct array_data *data = inode->i_private;
> +	size_t size;
> +
> +	if (*ppos == 0) {
> +		if (file->private_data) {
> +			kfree(file->private_data);
> +			file->private_data = NULL;
> +		}
> +
> +		file->private_data = format_array_alloc("%u", data->array,
> +							      data->elements);
> +	}
> +
> +	size = 0;
> +	if (file->private_data)
> +		size = strlen(file->private_data);
> +
> +	return simple_read_from_buffer(buf, len, ppos,
> +					file->private_data, size);
> +}
> +
> +static int u32_array_release(struct inode *inode, struct file *file)
> +{
> +	kfree(file->private_data);
> +
> +	return 0;
> +}
> +
> +static const struct file_operations u32_array_fops = {
> +	.owner	 = THIS_MODULE,
> +	.open	 = u32_array_open,
> +	.release = u32_array_release,
> +	.read	 = u32_array_read,
> +	.llseek  = no_llseek,
> +};
> +
> +/**
> + * debugfs_create_u32_array - create a debugfs file that is used to read u32
> + * array.
> + * @name: a pointer to a string containing the name of the file to create.
> + * @mode: the permission that the file should have.
> + * @parent: a pointer to the parent dentry for this file.  This should be a
> + *          directory dentry if set.  If this parameter is %NULL, then the
> + *          file will be created in the root of the debugfs filesystem.
> + * @array: u32 array that provides data.
> + * @elements: total number of elements in the array.
> + *
> + * This function creates a file in debugfs with the given name that exports
> + * @array as data. If the @mode variable is so set it can be read from.
> + * Writing is not supported. Seek within the file is also not supported.
> + * Once array is created its size can not be changed.
> + *
> + * The function returns a pointer to dentry on success. If debugfs is not
> + * enabled in the kernel, the value -%ENODEV will be returned.
> + */
> +struct dentry *debugfs_create_u32_array(const char *name, umode_t mode,
> +					    struct dentry *parent,
> +					    u32 *array, u32 elements)
> +{
> +	struct array_data *data = kmalloc(sizeof(*data), GFP_KERNEL);
> +
> +	if (data == NULL)
> +		return NULL;
> +
> +	data->array = array;
> +	data->elements = elements;
> +
> +	return debugfs_create_file(name, mode, parent, data, &u32_array_fops);
> +}
> +EXPORT_SYMBOL_GPL(debugfs_create_u32_array);
> +
>  #ifdef CONFIG_HAS_IOMEM
>  
>  /*
> diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
> index 6169c26..5cb4435 100644
> --- a/include/linux/debugfs.h
> +++ b/include/linux/debugfs.h
> @@ -93,6 +93,10 @@ struct dentry *debugfs_create_regset32(const char *name, mode_t mode,
>  int debugfs_print_regs32(struct seq_file *s, const struct debugfs_reg32 *regs,
>  			 int nregs, void __iomem *base, char *prefix);
>  
> +struct dentry *debugfs_create_u32_array(const char *name, umode_t mode,
> +					struct dentry *parent,
> +					u32 *array, u32 elements);
> +
>  bool debugfs_initialized(void);
>  
>  #else
> @@ -219,6 +223,13 @@ static inline bool debugfs_initialized(void)
>  	return false;
>  }
>  
> +struct dentry *debugfs_create_u32_array(const char *name, umode_t mode,
> +					struct dentry *parent,
> +					u32 *array, u32 elements)
> +{
> +	return ERR_PTR(-ENODEV);
> +}
> +
>  #endif
>  
>  #endif
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply

* Re: [PATCH RFC V6 0/11] Paravirtualized ticketlocks
From: H. Peter Anvin @ 2012-03-30 20:26 UTC (permalink / raw)
  To: Raghavendra K T
  Cc: KVM, Konrad Rzeszutek Wilk, Peter Zijlstra, Stefano Stabellini,
	the arch/x86 maintainers, LKML, Virtualization, Andi Kleen,
	Avi Kivity, Jeremy Fitzhardinge, Srivatsa Vaddagiri, Attilio Rao,
	Ingo Molnar, Linus Torvalds, Xen Devel, Stephan Diestelhorst
In-Reply-To: <20120321102041.473.61069.sendpatchset@codeblue.in.ibm.com>

What is the current status of this patchset?  I haven't looked at it too
closely because I have been focused on 3.4 up until now...

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply

* Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming
From: Tejun Heo @ 2012-03-30 15:38 UTC (permalink / raw)
  To: Ren Mingxin; +Cc: Jens Axboe, SCSI, KVM, Michael S. Tsirkin, LKML, VIRTUAL
In-Reply-To: <20120330152606.GB28934@google.com>

On Fri, Mar 30, 2012 at 08:26:06AM -0700, Tejun Heo wrote:
> On Fri, Mar 30, 2012 at 05:53:52PM +0800, Ren Mingxin wrote:
> >  The current virtblk's naming algorithm only supports 263  disks.
> > If there are mass of virtblks(exceeding 263), there will be disks
> > with the same name.
> > 
> > By renaming "sd_format_disk_name()" to "disk_name_format()"
> > and moving it into block core, virtio_blk can use this function to
> > support mass of disks.
> > 
> > Signed-off-by: Ren Mingxin <renmx@cn.fujitsu.com>
> 
> I guess it's already way too late but why couldn't they have been
> named vdD-P where both D and P are integers denoting disk number and

So, if a device name ends in digit the partition code adds delimiter
'p' automatically and this is already in use by md.  So, partitioned
md devices are named mdDpP where D and P are base 10 number indicating
the sequence like md12p4.  So, let's please add comment that new
drivers should name their devices PREFIX%d where the sequence number
can be allocated by ida.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH 4/4] virtio_blk: use disk_name_format() to support mass of disks naming
From: Tejun Heo @ 2012-03-30 15:28 UTC (permalink / raw)
  To: Ren Mingxin; +Cc: Jens Axboe, SCSI, KVM, Michael S. Tsirkin, LKML, VIRTUAL
In-Reply-To: <20120330152606.GB28934@google.com>

On Fri, Mar 30, 2012 at 08:26:06AM -0700, Tejun Heo wrote:
> On Fri, Mar 30, 2012 at 05:53:52PM +0800, Ren Mingxin wrote:
> >  The current virtblk's naming algorithm only supports 263  disks.
> > If there are mass of virtblks(exceeding 263), there will be disks
> > with the same name.
> > 
> > By renaming "sd_format_disk_name()" to "disk_name_format()"
> > and moving it into block core, virtio_blk can use this function to
> > support mass of disks.
> > 
> > Signed-off-by: Ren Mingxin <renmx@cn.fujitsu.com>
> 
> I guess it's already way too late but why couldn't they have been
> named vdD-P where both D and P are integers denoting disk number and
> partition number?  [sh]dX's were created when there weren't supposed
> to be too many disks, so we had to come up with the horrible alphabet
> based numbering scheme but vd is new enough.  I mean, naming is one
> thing but who wants to figure out which sequence is or guess what
> comes next vdzz9?  :(
> 
> If we're gonna move it to block layer, let's add big blinking red
> comment saying "don't ever use it for any new driver".

And also let's make that clear in the function name - say,
format_legacy_disk_name() or something.

Thanks.

-- 
tejun

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox