qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] qemu_cond_signal() taking a long time to complete.
@ 2010-10-04 10:40 Arun R Bharadwaj
  2010-10-04 13:58 ` Stefan Hajnoczi
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Arun R Bharadwaj @ 2010-10-04 10:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: sripathik, arun, mohan, jvrao, aneesh.kumar

Hi,

I am working on introducing threading model into Qemu. This introduces
the Threadlets infrastructure which allows subsystems to offload possibly 
blocking work to a queue to be processed by a pool of threads asynchrnonously.
Threadlets are useful when there are operations that can be performed
outside the context of the VCPU/IO threads inorder to free these latter
to service any other guest requests. As of now we have converted a few
v9fs calls like v9fs_read, v9fs_write etc to this model to test the
working nature of this model.

I observed that performance is degrading in the threading model for the
following reason:

Taking the example of v9fs_read call: We submit the blocking work in
v9fs_read to a queue and do a qemu_cond_signal(). the work will be picked 
up by a worker thread which is waiting on the condition variable to go
true. I measured the time taken to execute the v9fs_read call; in the
case without the threading model, it takes around 15microseconds to
complete, while in the threading model, it takes around 30microsends
to complete. Most of this extra time (around 22microsends) is spent in
completing the qemu_cond_signal() call. I suspect this is the reason why
I am seeing performance hit with the threading model, because this
time is much more than the time needed to complete the entire
v9fs_read call in the non threading model case.

I need advice on how to proceed from this situation. Pasting relevant
code snippets below.

thanks
arun.
---

/* Code to sumbit work to the queue */
void submit_threadlet_to_queue(ThreadletQueue *queue, ThreadletWork
*work)
{
    qemu_mutex_lock(&(queue->lock));
    if (queue->idle_threads == 0 && queue->cur_threads <
queue->max_threads) {
        spawn_threadlet(queue);
    }
    QTAILQ_INSERT_TAIL(&(queue->request_list), work, node);
    qemu_cond_signal(&(queue->cond));
    qemu_mutex_unlock(&(queue->lock));
}

/* Worker thread code */
static void *threadlet_worker(void *data)
{
    ThreadletQueue *queue = data;

    while (1) {
        ThreadletWork *work;
        int ret = 0;
        qemu_mutex_lock(&(queue->lock));

        while (QTAILQ_EMPTY(&(queue->request_list)) &&
               (ret != ETIMEDOUT)) {
            ret = qemu_cond_timedwait(&(queue->cond),
                                         &(queue->lock), 10*100000);
        }

        assert(queue->idle_threads != 0);
        if (QTAILQ_EMPTY(&(queue->request_list))) {
            if (queue->cur_threads > queue->min_threads) {
                /* We retain the minimum number of threads */
                break;
            }
        } else {
            work = QTAILQ_FIRST(&(queue->request_list));
            QTAILQ_REMOVE(&(queue->request_list), work, node);

            queue->idle_threads--;
            qemu_mutex_unlock(&(queue->lock));

            /* execute the work function */
            work->func(work);

            qemu_mutex_lock(&(queue->lock));
            queue->idle_threads++;
        }
        qemu_mutex_unlock(&(queue->lock));
    }

    queue->idle_threads--;
    queue->cur_threads--;

    return NULL;
}

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] qemu_cond_signal() taking a long time to complete.
  2010-10-04 10:40 [Qemu-devel] qemu_cond_signal() taking a long time to complete Arun R Bharadwaj
@ 2010-10-04 13:58 ` Stefan Hajnoczi
  2010-10-04 15:37 ` Anthony Liguori
  2010-10-05  4:26 ` Venkateswararao Jujjuri (JV)
  2 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2010-10-04 13:58 UTC (permalink / raw)
  To: arun; +Cc: mohan, sripathik, jvrao, qemu-devel, aneesh.kumar

On Mon, Oct 4, 2010 at 11:40 AM, Arun R Bharadwaj
<arun@linux.vnet.ibm.com> wrote:
> I suspect this is the reason why
> I am seeing performance hit with the threading model, because this
> time is much more than the time needed to complete the entire
> v9fs_read call in the non threading model case.

If the cost of the thread pool is in the same range or higher than the
cost of the operation itself, then the thread pool is not a good
choice.

I think 15 microseconds overhead sounds high but not massively.  Can
you profile it in more detail or collect lock statistics (hold time,
contention, etc)?

> ---
>
> /* Code to sumbit work to the queue */
> void submit_threadlet_to_queue(ThreadletQueue *queue, ThreadletWork
> *work)
> {
>    qemu_mutex_lock(&(queue->lock));
>    if (queue->idle_threads == 0 && queue->cur_threads <
> queue->max_threads) {
>        spawn_threadlet(queue);
>    }
>    QTAILQ_INSERT_TAIL(&(queue->request_list), work, node);
>    qemu_cond_signal(&(queue->cond));
>    qemu_mutex_unlock(&(queue->lock));
> }
>
> /* Worker thread code */
> static void *threadlet_worker(void *data)
> {
>    ThreadletQueue *queue = data;
>
>    while (1) {
>        ThreadletWork *work;
>        int ret = 0;
>        qemu_mutex_lock(&(queue->lock));

Move this above the while() loop.

>
>        while (QTAILQ_EMPTY(&(queue->request_list)) &&
>               (ret != ETIMEDOUT)) {
>            ret = qemu_cond_timedwait(&(queue->cond),
>                                         &(queue->lock), 10*100000);
>        }
>
>        assert(queue->idle_threads != 0);
>        if (QTAILQ_EMPTY(&(queue->request_list))) {
>            if (queue->cur_threads > queue->min_threads) {
>                /* We retain the minimum number of threads */
>                break;

Note we're breaking here with queue->lock still held!

>            }
>        } else {
>            work = QTAILQ_FIRST(&(queue->request_list));
>            QTAILQ_REMOVE(&(queue->request_list), work, node);
>
>            queue->idle_threads--;
>            qemu_mutex_unlock(&(queue->lock));
>
>            /* execute the work function */
>            work->func(work);
>
>            qemu_mutex_lock(&(queue->lock));
>            queue->idle_threads++;
>        }
>        qemu_mutex_unlock(&(queue->lock));

Move this down before return NULL.

>    }
>
>    queue->idle_threads--;
>    queue->cur_threads--;
>
>    return NULL;
> }

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] qemu_cond_signal() taking a long time to complete.
  2010-10-04 10:40 [Qemu-devel] qemu_cond_signal() taking a long time to complete Arun R Bharadwaj
  2010-10-04 13:58 ` Stefan Hajnoczi
@ 2010-10-04 15:37 ` Anthony Liguori
  2010-10-04 15:49   ` Stefan Hajnoczi
  2010-10-05  4:26 ` Venkateswararao Jujjuri (JV)
  2 siblings, 1 reply; 6+ messages in thread
From: Anthony Liguori @ 2010-10-04 15:37 UTC (permalink / raw)
  To: arun; +Cc: mohan, sripathik, jvrao, qemu-devel, aneesh.kumar

On 10/04/2010 05:40 AM, Arun R Bharadwaj wrote:
> Hi,
>
> I am working on introducing threading model into Qemu. This introduces
> the Threadlets infrastructure which allows subsystems to offload possibly
> blocking work to a queue to be processed by a pool of threads asynchrnonously.
> Threadlets are useful when there are operations that can be performed
> outside the context of the VCPU/IO threads inorder to free these latter
> to service any other guest requests. As of now we have converted a few
> v9fs calls like v9fs_read, v9fs_write etc to this model to test the
> working nature of this model.
>
> I observed that performance is degrading in the threading model for the
> following reason:
>
> Taking the example of v9fs_read call: We submit the blocking work in
> v9fs_read to a queue and do a qemu_cond_signal(). the work will be picked
> up by a worker thread which is waiting on the condition variable to go
> true. I measured the time taken to execute the v9fs_read call; in the
> case without the threading model, it takes around 15microseconds to
> complete, while in the threading model, it takes around 30microsends
> to complete. Most of this extra time (around 22microsends) is spent in
> completing the qemu_cond_signal() call. I suspect this is the reason why
> I am seeing performance hit with the threading model, because this
> time is much more than the time needed to complete the entire
> v9fs_read call in the non threading model case.
>
> I need advice on how to proceed from this situation. Pasting relevant
> code snippets below.
>
> thanks
> arun.
> ---
>
> /* Code to sumbit work to the queue */
> void submit_threadlet_to_queue(ThreadletQueue *queue, ThreadletWork
> *work)
> {
>      qemu_mutex_lock(&(queue->lock));
>      if (queue->idle_threads == 0&&  queue->cur_threads<
> queue->max_threads) {
>          spawn_threadlet(queue);
>      }
>      QTAILQ_INSERT_TAIL(&(queue->request_list), work, node);
>      qemu_cond_signal(&(queue->cond));
>      qemu_mutex_unlock(&(queue->lock));
> }
>    

Try moving qemu_cond_signal past qemu_mutex_unlock().

Regards,

Anthony Liguori

> /* Worker thread code */
> static void *threadlet_worker(void *data)
> {
>      ThreadletQueue *queue = data;
>
>      while (1) {
>          ThreadletWork *work;
>          int ret = 0;
>          qemu_mutex_lock(&(queue->lock));
>
>          while (QTAILQ_EMPTY(&(queue->request_list))&&
>                 (ret != ETIMEDOUT)) {
>              ret = qemu_cond_timedwait(&(queue->cond),
>                                           &(queue->lock), 10*100000);
>          }
>
>          assert(queue->idle_threads != 0);
>          if (QTAILQ_EMPTY(&(queue->request_list))) {
>              if (queue->cur_threads>  queue->min_threads) {
>                  /* We retain the minimum number of threads */
>                  break;
>              }
>          } else {
>              work = QTAILQ_FIRST(&(queue->request_list));
>              QTAILQ_REMOVE(&(queue->request_list), work, node);
>
>              queue->idle_threads--;
>              qemu_mutex_unlock(&(queue->lock));
>
>              /* execute the work function */
>              work->func(work);
>
>              qemu_mutex_lock(&(queue->lock));
>              queue->idle_threads++;
>          }
>          qemu_mutex_unlock(&(queue->lock));
>      }
>
>      queue->idle_threads--;
>      queue->cur_threads--;
>
>      return NULL;
> }
>
>    

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] qemu_cond_signal() taking a long time to complete.
  2010-10-04 15:37 ` Anthony Liguori
@ 2010-10-04 15:49   ` Stefan Hajnoczi
  2010-10-05  3:58     ` Venkateswararao Jujjuri (JV)
  0 siblings, 1 reply; 6+ messages in thread
From: Stefan Hajnoczi @ 2010-10-04 15:49 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: mohan, qemu-devel, sripathik, aneesh.kumar, arun, jvrao

On Mon, Oct 4, 2010 at 4:37 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 10/04/2010 05:40 AM, Arun R Bharadwaj wrote:
>>
>> Hi,
>>
>> I am working on introducing threading model into Qemu. This introduces
>> the Threadlets infrastructure which allows subsystems to offload possibly
>> blocking work to a queue to be processed by a pool of threads
>> asynchrnonously.
>> Threadlets are useful when there are operations that can be performed
>> outside the context of the VCPU/IO threads inorder to free these latter
>> to service any other guest requests. As of now we have converted a few
>> v9fs calls like v9fs_read, v9fs_write etc to this model to test the
>> working nature of this model.
>>
>> I observed that performance is degrading in the threading model for the
>> following reason:
>>
>> Taking the example of v9fs_read call: We submit the blocking work in
>> v9fs_read to a queue and do a qemu_cond_signal(). the work will be picked
>> up by a worker thread which is waiting on the condition variable to go
>> true. I measured the time taken to execute the v9fs_read call; in the
>> case without the threading model, it takes around 15microseconds to
>> complete, while in the threading model, it takes around 30microsends
>> to complete. Most of this extra time (around 22microsends) is spent in
>> completing the qemu_cond_signal() call. I suspect this is the reason why
>> I am seeing performance hit with the threading model, because this
>> time is much more than the time needed to complete the entire
>> v9fs_read call in the non threading model case.
>>
>> I need advice on how to proceed from this situation. Pasting relevant
>> code snippets below.
>>
>> thanks
>> arun.
>> ---
>>
>> /* Code to sumbit work to the queue */
>> void submit_threadlet_to_queue(ThreadletQueue *queue, ThreadletWork
>> *work)
>> {
>>     qemu_mutex_lock(&(queue->lock));
>>     if (queue->idle_threads == 0&&  queue->cur_threads<
>> queue->max_threads) {
>>         spawn_threadlet(queue);
>>     }
>>     QTAILQ_INSERT_TAIL(&(queue->request_list), work, node);
>>     qemu_cond_signal(&(queue->cond));
>>     qemu_mutex_unlock(&(queue->lock));
>> }
>>
>
> Try moving qemu_cond_signal past qemu_mutex_unlock().

There are race conditions here:

1. When a new threadlet is started because there are no idle threads,
qemu_cond_signal() may fire a blank because the threadlet isn't inside
qemu_cond_timedwait() yet.  The result, the work item is deadlocked
until another thread grabs more work off the queue.  If I'm reading
the code correctly this bug is currently present!
2. Moving qemu_cond_signal() outside queue->lock is dangerous for the
same reason: you need to be careful not to qemu_cond_signal() when the
thread isn't inside qemu_cond_timedwait().

Stefan

Stefan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] qemu_cond_signal() taking a long time to complete.
  2010-10-04 15:49   ` Stefan Hajnoczi
@ 2010-10-05  3:58     ` Venkateswararao Jujjuri (JV)
  0 siblings, 0 replies; 6+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2010-10-05  3:58 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: mohan, qemu-devel, sripathik, aneesh.kumar, arun

On 10/4/2010 8:49 AM, Stefan Hajnoczi wrote:
> On Mon, Oct 4, 2010 at 4:37 PM, Anthony Liguori<anthony@codemonkey.ws>  wrote:
>> On 10/04/2010 05:40 AM, Arun R Bharadwaj wrote:
>>>
>>> Hi,
>>>
>>> I am working on introducing threading model into Qemu. This introduces
>>> the Threadlets infrastructure which allows subsystems to offload possibly
>>> blocking work to a queue to be processed by a pool of threads
>>> asynchrnonously.
>>> Threadlets are useful when there are operations that can be performed
>>> outside the context of the VCPU/IO threads inorder to free these latter
>>> to service any other guest requests. As of now we have converted a few
>>> v9fs calls like v9fs_read, v9fs_write etc to this model to test the
>>> working nature of this model.
>>>
>>> I observed that performance is degrading in the threading model for the
>>> following reason:
>>>
>>> Taking the example of v9fs_read call: We submit the blocking work in
>>> v9fs_read to a queue and do a qemu_cond_signal(). the work will be picked
>>> up by a worker thread which is waiting on the condition variable to go
>>> true. I measured the time taken to execute the v9fs_read call; in the
>>> case without the threading model, it takes around 15microseconds to
>>> complete, while in the threading model, it takes around 30microsends
>>> to complete. Most of this extra time (around 22microsends) is spent in
>>> completing the qemu_cond_signal() call. I suspect this is the reason why
>>> I am seeing performance hit with the threading model, because this
>>> time is much more than the time needed to complete the entire
>>> v9fs_read call in the non threading model case.
>>>
>>> I need advice on how to proceed from this situation. Pasting relevant
>>> code snippets below.
>>>
>>> thanks
>>> arun.
>>> ---
>>>
>>> /* Code to sumbit work to the queue */
>>> void submit_threadlet_to_queue(ThreadletQueue *queue, ThreadletWork
>>> *work)
>>> {
>>>      qemu_mutex_lock(&(queue->lock));
>>>      if (queue->idle_threads == 0&&    queue->cur_threads<
>>> queue->max_threads) {
>>>          spawn_threadlet(queue);
>>>      }
>>>      QTAILQ_INSERT_TAIL(&(queue->request_list), work, node);
>>>      qemu_cond_signal(&(queue->cond));
>>>      qemu_mutex_unlock(&(queue->lock));
>>> }
>>>
>>
>> Try moving qemu_cond_signal past qemu_mutex_unlock().
>
> There are race conditions here:
>
> 1. When a new threadlet is started because there are no idle threads,
> qemu_cond_signal() may fire a blank because the threadlet isn't inside
> qemu_cond_timedwait() yet.  The result, the work item is deadlocked
> until another thread grabs more work off the queue.  If I'm reading

Since it is timedwait, it is either the time expiry or more work is added to the 
queue.

- JV


> the code correctly this bug is currently present!
> 2. Moving qemu_cond_signal() outside queue->lock is dangerous for the
> same reason: you need to be careful not to qemu_cond_signal() when the
> thread isn't inside qemu_cond_timedwait().
>
> Stefan
>
> Stefan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] qemu_cond_signal() taking a long time to complete.
  2010-10-04 10:40 [Qemu-devel] qemu_cond_signal() taking a long time to complete Arun R Bharadwaj
  2010-10-04 13:58 ` Stefan Hajnoczi
  2010-10-04 15:37 ` Anthony Liguori
@ 2010-10-05  4:26 ` Venkateswararao Jujjuri (JV)
  2 siblings, 0 replies; 6+ messages in thread
From: Venkateswararao Jujjuri (JV) @ 2010-10-05  4:26 UTC (permalink / raw)
  To: arun; +Cc: mohan, sripathik, qemu-devel, aneesh.kumar

On 10/4/2010 3:40 AM, Arun R Bharadwaj wrote:
> Hi,
>
> I am working on introducing threading model into Qemu. This introduces
> the Threadlets infrastructure which allows subsystems to offload possibly
> blocking work to a queue to be processed by a pool of threads asynchrnonously.
> Threadlets are useful when there are operations that can be performed
> outside the context of the VCPU/IO threads inorder to free these latter
> to service any other guest requests. As of now we have converted a few
> v9fs calls like v9fs_read, v9fs_write etc to this model to test the
> working nature of this model.

My numbers are little different.

I had a patch with only read() happening through threadlet infrastructure.
Everything else is with old code i.e through vCPU thread.

With SMP=1,
Three threads of  dd of=/dev/null if=<file> bs=4k count=100000
Read on threadlet code gave - 21.6 MB/s
Everything on vCPU              - 24.2 MB/s

Not much difference.

With SMP=4
Read on threadlet code gave - 32 MB/s
Everything on vCPU               - 28 MB/s

Here threadlet code is better... still not by much.

In either case, Threadlet code is giving better system time.

vCPU model is taking around 36 sec system time

Typical everything in vCPU model:
real	0m44.885s
user	0m0.130s
sys	0m36.237s

Typical read in threadlet model:
real	0m37.354s
user	0m0.082s
sys	0m6.006s

Thanks,
JV


>
> I observed that performance is degrading in the threading model for the
> following reason:
>
> Taking the example of v9fs_read call: We submit the blocking work in
> v9fs_read to a queue and do a qemu_cond_signal(). the work will be picked
> up by a worker thread which is waiting on the condition variable to go
> true. I measured the time taken to execute the v9fs_read call; in the
> case without the threading model, it takes around 15microseconds to
> complete, while in the threading model, it takes around 30microsends
> to complete. Most of this extra time (around 22microsends) is spent in
> completing the qemu_cond_signal() call. I suspect this is the reason why
> I am seeing performance hit with the threading model, because this
> time is much more than the time needed to complete the entire
> v9fs_read call in the non threading model case.
>
> I need advice on how to proceed from this situation. Pasting relevant
> code snippets below.
>
> thanks
> arun.
> ---
>
> /* Code to sumbit work to the queue */
> void submit_threadlet_to_queue(ThreadletQueue *queue, ThreadletWork
> *work)
> {
>      qemu_mutex_lock(&(queue->lock));
>      if (queue->idle_threads == 0&&  queue->cur_threads<
> queue->max_threads) {
>          spawn_threadlet(queue);
>      }
>      QTAILQ_INSERT_TAIL(&(queue->request_list), work, node);
>      qemu_cond_signal(&(queue->cond));
>      qemu_mutex_unlock(&(queue->lock));
> }
>
> /* Worker thread code */
> static void *threadlet_worker(void *data)
> {
>      ThreadletQueue *queue = data;
>
>      while (1) {
>          ThreadletWork *work;
>          int ret = 0;
>          qemu_mutex_lock(&(queue->lock));
>
>          while (QTAILQ_EMPTY(&(queue->request_list))&&
>                 (ret != ETIMEDOUT)) {
>              ret = qemu_cond_timedwait(&(queue->cond),
>                                           &(queue->lock), 10*100000);
>          }
>
>          assert(queue->idle_threads != 0);
>          if (QTAILQ_EMPTY(&(queue->request_list))) {
>              if (queue->cur_threads>  queue->min_threads) {
>                  /* We retain the minimum number of threads */
>                  break;
>              }
>          } else {
>              work = QTAILQ_FIRST(&(queue->request_list));
>              QTAILQ_REMOVE(&(queue->request_list), work, node);
>
>              queue->idle_threads--;
>              qemu_mutex_unlock(&(queue->lock));
>
>              /* execute the work function */
>              work->func(work);
>
>              qemu_mutex_lock(&(queue->lock));
>              queue->idle_threads++;
>          }
>          qemu_mutex_unlock(&(queue->lock));
>      }
>
>      queue->idle_threads--;
>      queue->cur_threads--;
>
>      return NULL;
> }
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-10-05  4:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-04 10:40 [Qemu-devel] qemu_cond_signal() taking a long time to complete Arun R Bharadwaj
2010-10-04 13:58 ` Stefan Hajnoczi
2010-10-04 15:37 ` Anthony Liguori
2010-10-04 15:49   ` Stefan Hajnoczi
2010-10-05  3:58     ` Venkateswararao Jujjuri (JV)
2010-10-05  4:26 ` Venkateswararao Jujjuri (JV)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).