public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* request baset device mapper in Linux
@ 2011-07-20  8:26 Lukas Hejtmanek
  2011-07-21 11:11 ` Kiyoshi Ueda
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Hejtmanek @ 2011-07-20  8:26 UTC (permalink / raw)
  To: k-ueda; +Cc: agk, linux-kernel

Hi,

I encouter serious problems with you commit
cec47e3d4a861e1d942b3a580d0bbef2700d2bb2 introducing request based device
mapper in Linux.

I got machine with 80 SATA disks connected to two LSI SAS 2.0 controller
(mpt2sas driver).

All disks are configured as multipath devices in failover mode:

defaults {
        udev_dir        /dev
        polling_interval 10
        selector        "round-robin 0"
        path_grouping_policy    failover
        path_checker    directio
        rr_min_io       100
        no_path_retry   queue
        user_friendly_names no
}

if I run the following command, ksoftirqd eats 100% CPU as soon as all
available memory is used for buffers.

for i in `seq 0 79`; do dd if=/dev/dm-$i of=/dev/null bs=1M count=10000 & done

top looks like this:

Mem:     48390M total,    45741M used,     2649M free,    43243M buffers
Swap:        0M total,        0M used,        0M free,     1496M cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                       
   12 root      20   0     0    0    0 R   96  0.0   0:38.78 ksoftirqd/4                                   
17263 root      20   0  9432 1752  616 R   14  0.0   0:03.19 dd                                            
17275 root      20   0  9432 1756  616 D   14  0.0   0:03.16 dd                                            
17271 root      20   0  9432 1756  616 D   10  0.0   0:02.60 dd                                            
17258 root      20   0  9432 1756  616 D    7  0.0   0:02.67 dd                                            
17260 root      20   0  9432 1756  616 D    7  0.0   0:02.47 dd                                            
17262 root      20   0  9432 1752  616 D    7  0.0   0:02.38 dd                                            
17264 root      20   0  9432 1756  616 D    7  0.0   0:02.42 dd                                            
17267 root      20   0  9432 1756  616 D    7  0.0   0:02.35 dd                                            
17268 root      20   0  9432 1756  616 D    7  0.0   0:02.45 dd                                            
17274 root      20   0  9432 1756  616 D    7  0.0   0:02.47 dd                                            
17277 root      20   0  9432 1756  616 D    7  0.0   0:02.53 dd                                            
17261 root      20   0  9432 1756  616 D    7  0.0   0:02.36 dd                                            
17265 root      20   0  9432 1756  616 R    7  0.0   0:02.47 dd                                            
17266 root      20   0  9432 1756  616 R    7  0.0   0:02.44 dd                                            
17269 root      20   0  9432 1756  616 D    7  0.0   0:02.62 dd                                            
17270 root      20   0  9432 1756  616 D    7  0.0   0:02.46 dd                                            
17272 root      20   0  9432 1756  616 D    7  0.0   0:02.36 dd                                            
17273 root      20   0  9432 1756  616 D    7  0.0   0:02.46 dd                                            
17276 root      20   0  9432 1752  616 D    7  0.0   0:02.36 dd                                            
17278 root      20   0  9432 1752  616 D    7  0.0   0:02.44 dd                                            
17259 root      20   0  9432 1752  616 D    6  0.0   0:02.37 dd 


It looks like device mapper produces long SG lists and  end_clone_bio() has
someting like quadratic complexity.

The problem can be workarounded using:
for i in /sys/block/dm-*; do echo 128 > $i/queue/max_sectors_kb; done

to short SG lists.

I use SLES 2.6.32.36-0.5-default kernel.

Using iostat -x, I can see there is about 25000 rrmq/s, while there is only
180 r/s, so it looks like each bio contains more then 100 requests which makes
serious troubles for ksoftirqd call backs.

Without the mentioned workeround, I got only 600MB/s sum of all dd readers.
With workernoud, I got about 2.8GB/s sum of all dd readers.

-- 
Lukáš Hejtmánek

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-07-20  8:26 request baset device mapper in Linux Lukas Hejtmanek
@ 2011-07-21 11:11 ` Kiyoshi Ueda
  2011-07-21 13:26   ` Lukas Hejtmanek
  0 siblings, 1 reply; 16+ messages in thread
From: Kiyoshi Ueda @ 2011-07-21 11:11 UTC (permalink / raw)
  To: Lukas Hejtmanek; +Cc: agk, linux-kernel

Hi Lukas,

Lukas Hejtmanek wrote:
> Hi,
> 
> I encouter serious problems with you commit
> cec47e3d4a861e1d942b3a580d0bbef2700d2bb2 introducing request based device
> mapper in Linux.
> 
> I got machine with 80 SATA disks connected to two LSI SAS 2.0 controller
> (mpt2sas driver).
> 
> All disks are configured as multipath devices in failover mode:
> 
> defaults {
>         udev_dir        /dev
>         polling_interval 10
>         selector        "round-robin 0"
>         path_grouping_policy    failover
>         path_checker    directio
>         rr_min_io       100
>         no_path_retry   queue
>         user_friendly_names no
> }
> 
> if I run the following command, ksoftirqd eats 100% CPU as soon as all
> available memory is used for buffers.
> 
> for i in `seq 0 79`; do dd if=/dev/dm-$i of=/dev/null bs=1M count=10000 & done
> 
> top looks like this:
> 
> Mem:     48390M total,    45741M used,     2649M free,    43243M buffers
> Swap:        0M total,        0M used,        0M free,     1496M cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                       
>    12 root      20   0     0    0    0 R   96  0.0   0:38.78 ksoftirqd/4                                   
> 17263 root      20   0  9432 1752  616 R   14  0.0   0:03.19 dd                                            
> 17275 root      20   0  9432 1756  616 D   14  0.0   0:03.16 dd                                            
> 17271 root      20   0  9432 1756  616 D   10  0.0   0:02.60 dd                                            
> 17258 root      20   0  9432 1756  616 D    7  0.0   0:02.67 dd                                            
> 17260 root      20   0  9432 1756  616 D    7  0.0   0:02.47 dd                                            
> 17262 root      20   0  9432 1752  616 D    7  0.0   0:02.38 dd                                            
> 17264 root      20   0  9432 1756  616 D    7  0.0   0:02.42 dd                                            
> 17267 root      20   0  9432 1756  616 D    7  0.0   0:02.35 dd                                            
> 17268 root      20   0  9432 1756  616 D    7  0.0   0:02.45 dd                                            
> 17274 root      20   0  9432 1756  616 D    7  0.0   0:02.47 dd                                            
> 17277 root      20   0  9432 1756  616 D    7  0.0   0:02.53 dd                                            
> 17261 root      20   0  9432 1756  616 D    7  0.0   0:02.36 dd                                            
> 17265 root      20   0  9432 1756  616 R    7  0.0   0:02.47 dd                                            
> 17266 root      20   0  9432 1756  616 R    7  0.0   0:02.44 dd                                            
> 17269 root      20   0  9432 1756  616 D    7  0.0   0:02.62 dd                                            
> 17270 root      20   0  9432 1756  616 D    7  0.0   0:02.46 dd                                            
> 17272 root      20   0  9432 1756  616 D    7  0.0   0:02.36 dd                                            
> 17273 root      20   0  9432 1756  616 D    7  0.0   0:02.46 dd                                            
> 17276 root      20   0  9432 1752  616 D    7  0.0   0:02.36 dd                                            
> 17278 root      20   0  9432 1752  616 D    7  0.0   0:02.44 dd                                            
> 17259 root      20   0  9432 1752  616 D    6  0.0   0:02.37 dd 
> 
> 
> It looks like device mapper produces long SG lists and  end_clone_bio() has
> someting like quadratic complexity.
> 
> The problem can be workarounded using:
> for i in /sys/block/dm-*; do echo 128 > $i/queue/max_sectors_kb; done
> 
> to short SG lists.
> 
> I use SLES 2.6.32.36-0.5-default kernel.
> 
> Using iostat -x, I can see there is about 25000 rrmq/s, while there is only
> 180 r/s, so it looks like each bio contains more then 100 requests which makes
> serious troubles for ksoftirqd call backs.

I don't understand why you are saying request-based device-mapper makes
serious troubles.
BIO merging is done in the block layer.  Don't you see the same thing
if you use sd devices (/dev/sdX)?

Also, what do you see with the latest upstream kernel (say, 3.0-rc7)?

If you see your problem only with request-based device-mapper, please
elaborate about below:
    - end_clone_bio() has someting like quadratic complexity.
        * What do you mean the "quadratic complexity"?
    - each request contains more then 100 bios which makes serious
      troubles for ksoftirqd call backs.
        * What do you mean the "serious troubles for ksoftirqd call backs"?

> Without the mentioned workeround, I got only 600MB/s sum of all dd readers.
> With workernoud, I got about 2.8GB/s sum of all dd readers.

Thanks,
Kiyoshi Ueda


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-07-21 11:11 ` Kiyoshi Ueda
@ 2011-07-21 13:26   ` Lukas Hejtmanek
  2011-07-22  6:56     ` Kiyoshi Ueda
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Hejtmanek @ 2011-07-21 13:26 UTC (permalink / raw)
  To: Kiyoshi Ueda; +Cc: agk, linux-kernel

On Thu, Jul 21, 2011 at 08:11:32PM +0900, Kiyoshi Ueda wrote:
> I don't understand why you are saying request-based device-mapper makes
> serious troubles.
> BIO merging is done in the block layer.  Don't you see the same thing
> if you use sd devices (/dev/sdX)?

no, if I use sd devices directly, it does not overload ksoftirqd and this is
obvious. The problem with overloading ksoftirqd has roots in request based
stuff in dm layer, in particular in dm_softirq_done() call. 

> Also, what do you see with the latest upstream kernel (say, 3.0-rc7)?

I didn't check it with this kernel. I don't see many changes in request based
stuff in dm.

> If you see your problem only with request-based device-mapper, please
> elaborate about below:
>     - end_clone_bio() has someting like quadratic complexity.
>         * What do you mean the "quadratic complexity"?

end_clone_bio calls blk_update_request which calls __blk_recalc_rq_segments
which has code:
for_each_bio(bio) {
        bio_for_each_segment(bv, bio, i) {

in total, it seems that *whole* bio list is traversed again and again as some parts
are done and some not which leads to comlexity O(n^2) with respect to number
of bio and segments. But this is just wild guess. The real problem could be
elsewhere.

However, oprofile or sysprof show that ksoftirq spends most time in
__blk_recalc_rq_segments().

>     - each request contains more then 100 bios which makes serious
>       troubles for ksoftirqd call backs.
>         * What do you mean the "serious troubles for ksoftirqd call backs"?

serious troubles means that ksoftirqd eats 100% CPU and slows down I/O
significantly (from 2.8GB/s to 500MB/s).

-- 
Lukáš Hejtmánek

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-07-21 13:26   ` Lukas Hejtmanek
@ 2011-07-22  6:56     ` Kiyoshi Ueda
  2011-07-22  8:19       ` Lukas Hejtmanek
  0 siblings, 1 reply; 16+ messages in thread
From: Kiyoshi Ueda @ 2011-07-22  6:56 UTC (permalink / raw)
  To: Lukas Hejtmanek; +Cc: agk, linux-kernel

Hi Lukas,

Lukas Hejtmanek wrote:
> On Thu, Jul 21, 2011 at 08:11:32PM +0900, Kiyoshi Ueda wrote:
>> I don't understand why you are saying request-based device-mapper makes
>> serious troubles.
>> BIO merging is done in the block layer.  Don't you see the same thing
>> if you use sd devices (/dev/sdX)?
> 
> no, if I use sd devices directly, it does not overload ksoftirqd and this is
> obvious. The problem with overloading ksoftirqd has roots in request based
> stuff in dm layer, in particular in dm_softirq_done() call. 
> 
>> If you see your problem only with request-based device-mapper, please
>> elaborate about below:
>>     - end_clone_bio() has someting like quadratic complexity.
>>         * What do you mean the "quadratic complexity"?
> 
> end_clone_bio calls blk_update_request which calls __blk_recalc_rq_segments
> which has code:
> for_each_bio(bio) {
>         bio_for_each_segment(bv, bio, i) {
> 
> in total, it seems that *whole* bio list is traversed again and again as some parts
> are done and some not which leads to comlexity O(n^2) with respect to number
> of bio and segments. But this is just wild guess. The real problem could be
> elsewhere.
> 
> However, oprofile or sysprof show that ksoftirq spends most time in
> __blk_recalc_rq_segments().

Thank you very much for the detailed explanation.
Now, I understand what you mentioned except below:

> ksoftirqd eats 100% CPU as soon as all available memory is used for buffers.

If the slow down is caused only by lack of CPU power, memory usage
should not matter here.
Don't you see 100% CPU of ksoftirqd (nor the slow down) if you use
a fewer size than memory size? (e.g. 500[MB] for each dd)

> The problem with overloading ksoftirqd has roots in request based
> stuff in dm layer, in particular in dm_softirq_done() call. 

Is that your actual trace result?
end_clone_bio(), which seems taking much time, is called
in the context of SCSI's softirq.
So if you really see that dm_softirq_done() is taking a time,
there may be other problems, too.


>>     - each request contains more then 100 bios which makes serious
>>       troubles for ksoftirqd call backs.
>>         * What do you mean the "serious troubles for ksoftirqd call backs"?
> 
> serious troubles means that ksoftirqd eats 100% CPU and slows down I/O
> significantly (from 2.8GB/s to 500MB/s).

OK, at least, request-based device-mapper eats more CPU/memory resources
than bio-based device-mapper due to the design which clones bio, too,
not only request.
So I think you need more CPUs on such environments which have lots of devices,
or you may be able to work around by splitting each request to shorter size
as you mentioned.
(How many CPUs do you have and how fast are those CPUs?
 I just tried, but no such phenomenon can be seen on the environment
 of 10 (FC) devices and 1 CPU (Xeon(R) E5205 1.86[GHz]).)

# I will be on a vacation whole next week, so I won't be able to respond
# until 8/1.  Sorry about that.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-07-22  6:56     ` Kiyoshi Ueda
@ 2011-07-22  8:19       ` Lukas Hejtmanek
  2011-07-23  7:28         ` Jun'ichi Nomura
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Hejtmanek @ 2011-07-22  8:19 UTC (permalink / raw)
  To: Kiyoshi Ueda; +Cc: agk, linux-kernel

On Fri, Jul 22, 2011 at 03:56:40PM +0900, Kiyoshi Ueda wrote:
> > in total, it seems that *whole* bio list is traversed again and again as some parts
> > are done and some not which leads to comlexity O(n^2) with respect to number
> > of bio and segments. But this is just wild guess. The real problem could be
> > elsewhere.
> > 
> > However, oprofile or sysprof show that ksoftirq spends most time in
> > __blk_recalc_rq_segments().
> 
> Thank you very much for the detailed explanation.
> Now, I understand what you mentioned except below:
> 
> > ksoftirqd eats 100% CPU as soon as all available memory is used for buffers.
> 
> If the slow down is caused only by lack of CPU power, memory usage
> should not matter here.
> Don't you see 100% CPU of ksoftirqd (nor the slow down) if you use
> a fewer size than memory size? (e.g. 500[MB] for each dd)

if data transfered by dd fits into memory (e.g., I transfer those 500MB in
each dd), I have no problem. ksoftirqd mostly sleeps.

The problem arises as soon as there is no more free memory for buffers. (those
buffers reported by 'top' for instance)

> > The problem with overloading ksoftirqd has roots in request based
> > stuff in dm layer, in particular in dm_softirq_done() call. 
> 
> Is that your actual trace result?
> end_clone_bio(), which seems taking much time, is called
> in the context of SCSI's softirq.
> So if you really see that dm_softirq_done() is taking a time,
> there may be other problems, too.

the real trace is hard to catch. I have traces using oprofile or sysprof, but
the call sequence is randomized as the calls are too fast for sample based
profilers.

this is trace from sysprof but it is not in exact order: (and percetage is not
accurate and is across whole system, ksoftirqd runs at 100% CPU according to
top).

  [ksoftirqd/0]                        0.00%  33.45%
    - - kernel - -                     0.00%  33.45%
      __blk_recalc_rq_segments        16.61%  16.61%
      _spin_unlock_irqrestore          6.17%   6.17%
      kmem_cache_free                  2.21%   2.21%
      blk_update_request               1.78%   1.80%
      end_buffer_async_read            1.40%   1.40%
      mempool_free                     0.85%   0.91%
      end_clone_bio                    0.80%   0.82%
      end_bio_bh_io_sync               0.07%   0.79%
      req_bio_endio                    0.41%   0.41%
      bio_put                          0.32%   0.33%
      bio_free                         0.31%   0.31%
      unlock_page                      0.20%   0.21%
      bio_endio                        0.14%   0.20%
      disk_map_sector_rcu              0.19%   0.19%
      __wake_up_bit                    0.15%   0.15%
      dm_rq_bio_destructor             0.09%   0.11%
      __wake_page_waiters              0.10%   0.10%
      blk_recalc_rq_segments           0.09%   0.10%
      child_rip                        0.00%   0.08%
      mempool_free_slab                0.03%   0.07%
      page_waitqueue                   0.06%   0.06%
      bio_fs_destructor                0.05%   0.05%
      multipath_end_io                 0.05%   0.05%
      scsi_finish_command              0.04%   0.04%
      dm_softirq_done                  0.03%   0.04%
      blk_done_softirq                 0.02%   0.03%
      scsi_softirq_done                0.03%   0.03%
      add_disk_randomness              0.03%   0.03%
      __wake_up                        0.03%   0.03%
      blk_rq_unprep_clone              0.02%   0.02%
      blk_end_request_all              0.02%   0.02%
      __sg_free_table                  0.02%   0.02%
      scsi_handle_queue_ramp_up        0.01%   0.01%
      kref_get                         0.01%   0.01%
      scsi_decide_disposition          0.01%   0.01%
      scsi_next_command                0.01%   0.01%
      rq_completed                     0.01%   0.01%
      blk_end_bidi_request             0.01%   0.01%
      kobject_get                      0.01%   0.01%
      __inet_lookup_established        0.01%   0.01%
      kref_put                         0.01%   0.01%
      scsi_pool_free_command           0.01%   0.01%
      scsi_run_queue                   0.01%   0.01%
      sd_done                          0.01%   0.01%
      add_timer_randomness             0.01%   0.01%
 
> >>     - each request contains more then 100 bios which makes serious
> >>       troubles for ksoftirqd call backs.
> >>         * What do you mean the "serious troubles for ksoftirqd call backs"?
> > 
> > serious troubles means that ksoftirqd eats 100% CPU and slows down I/O
> > significantly (from 2.8GB/s to 500MB/s).
> 
> OK, at least, request-based device-mapper eats more CPU/memory resources
> than bio-based device-mapper due to the design which clones bio, too,
> not only request.
> So I think you need more CPUs on such environments which have lots of devices,
> or you may be able to work around by splitting each request to shorter size
> as you mentioned.
> (How many CPUs do you have and how fast are those CPUs?
>  I just tried, but no such phenomenon can be seen on the environment
>  of 10 (FC) devices and 1 CPU (Xeon(R) E5205 1.86[GHz]).)

I have E5640  @ 2.67GHz with 16 cores (8 real cores with HT).

10 devices is not enough. I cannot preproduce it with just 10 devices. At
least 20 is necessary. 
 
-- 
Lukáš Hejtmánek

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-07-22  8:19       ` Lukas Hejtmanek
@ 2011-07-23  7:28         ` Jun'ichi Nomura
  2011-07-24 22:16           ` Lukas Hejtmanek
  0 siblings, 1 reply; 16+ messages in thread
From: Jun'ichi Nomura @ 2011-07-23  7:28 UTC (permalink / raw)
  To: Lukas Hejtmanek; +Cc: Kiyoshi Ueda, agk, linux-kernel

Hi,

On 07/22/11 17:19, Lukas Hejtmanek wrote:
> this is trace from sysprof but it is not in exact order: (and percetage is not
> accurate and is across whole system, ksoftirqd runs at 100% CPU according to
> top).
> 
>   [ksoftirqd/0]                        0.00%  33.45%
>     - - kernel - -                     0.00%  33.45%
>       __blk_recalc_rq_segments        16.61%  16.61%
>       _spin_unlock_irqrestore          6.17%   6.17%
>       kmem_cache_free                  2.21%   2.21%
>       blk_update_request               1.78%   1.80%
>       end_buffer_async_read            1.40%   1.40%
...
>> (How many CPUs do you have and how fast are those CPUs?
>>  I just tried, but no such phenomenon can be seen on the environment
>>  of 10 (FC) devices and 1 CPU (Xeon(R) E5205 1.86[GHz]).)
> 
> I have E5640  @ 2.67GHz with 16 cores (8 real cores with HT).
> 
> 10 devices is not enough. I cannot preproduce it with just 10 devices. At
> least 20 is necessary. 

How fast is the single disk performance?
Could you check /proc/interrupts and /proc/softirqs and
see how they are distributed among CPUs?
As for the memory usage, what happens if you add 'iflag=direct' to dd?

Also, is it possible for you to try the attached patch?
I would like to know whether it changes the phenomenon you see.
This patch should reduce the number of calls to recalc segments.
If it is the root cause, the patch should fix your case.
The patch is generated for 3.0 but should be easily applicable to
other version of request-based dm.

As Kiyoshi suggested, it is important to know whether this
problem occurs with the latest kernel.
So if you could try 3.0, it would be very helpful.

# and sorry, I will not be able to respond e-mail during next week..

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation


--- linux-3.0/drivers/md/dm.c.orig	2011-07-23 11:04:54.487100496 +0900
+++ linux-3.0/drivers/md/dm.c	2011-07-23 15:30:14.748606235 +0900
@@ -70,6 +70,7 @@ struct dm_rq_target_io {
 	struct mapped_device *md;
 	struct dm_target *ti;
 	struct request *orig, clone;
+	unsigned int done_bytes;
 	int error;
 	union map_info info;
 };
@@ -705,23 +706,8 @@ static void end_clone_bio(struct bio *cl
 
 	/*
 	 * I/O for the bio successfully completed.
-	 * Notice the data completion to the upper layer.
 	 */
-
-	/*
-	 * bios are processed from the head of the list.
-	 * So the completing bio should always be rq->bio.
-	 * If it's not, something wrong is happening.
-	 */
-	if (tio->orig->bio != bio)
-		DMERR("bio completion is going in the middle of the request");
-
-	/*
-	 * Update the original request.
-	 * Do not use blk_end_request() here, because it may complete
-	 * the original request before the clone, and break the ordering.
-	 */
-	blk_update_request(tio->orig, 0, nr_bytes);
+	tio->done_bytes += nr_bytes;
 }
 
 /*
@@ -850,6 +836,16 @@ static void dm_done(struct request *clon
 	struct dm_rq_target_io *tio = clone->end_io_data;
 	dm_request_endio_fn rq_end_io = tio->ti->type->rq_end_io;
 
+	/*
+	 * Update the original request.
+	 * Do not use blk_end_request() here, because it may complete
+	 * the original request before the clone, and break the ordering.
+	 */
+	if (tio->done_bytes) {
+		blk_update_request(tio->orig, 0, tio->done_bytes);
+		tio->done_bytes = 0;
+	}
+
 	if (mapped && rq_end_io)
 		r = rq_end_io(tio->ti, clone, error, &tio->info);
 
@@ -1507,6 +1503,7 @@ static struct request *clone_rq(struct r
 	tio->md = md;
 	tio->ti = NULL;
 	tio->orig = rq;
+	tio->done_bytes = 0;
 	tio->error = 0;
 	memset(&tio->info, 0, sizeof(tio->info));
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-07-23  7:28         ` Jun'ichi Nomura
@ 2011-07-24 22:16           ` Lukas Hejtmanek
  2011-08-01  9:31             ` Kiyoshi Ueda
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Hejtmanek @ 2011-07-24 22:16 UTC (permalink / raw)
  To: Jun'ichi Nomura; +Cc: Kiyoshi Ueda, agk, linux-kernel

On Sat, Jul 23, 2011 at 04:28:06PM +0900, Jun'ichi Nomura wrote:
> How fast is the single disk performance?

about 120MB/s if I run dd on a single disk only. 

> Could you check /proc/interrupts and /proc/softirqs and
> see how they are distributed among CPUs?

I don't see anything strange except function call interrupts hit single CPU
(that ksoftirqd). 

> As for the memory usage, what happens if you add 'iflag=direct' to dd?

then ksoftirqd does not eat significant amount of CPU.


> Also, is it possible for you to try the attached patch?
> I would like to know whether it changes the phenomenon you see.
> This patch should reduce the number of calls to recalc segments.
> If it is the root cause, the patch should fix your case.
> The patch is generated for 3.0 but should be easily applicable to
> other version of request-based dm.

the patch fixed the problem. I didn't test whether it broke something else or
not, though.

> As Kiyoshi suggested, it is important to know whether this
> problem occurs with the latest kernel.
> So if you could try 3.0, it would be very helpful.

so, do I need to check recent kernel as the patch works for me?

-- 
Lukáš Hejtmánek

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-07-24 22:16           ` Lukas Hejtmanek
@ 2011-08-01  9:31             ` Kiyoshi Ueda
  2011-09-08 13:27               ` Lukas Hejtmanek
  0 siblings, 1 reply; 16+ messages in thread
From: Kiyoshi Ueda @ 2011-08-01  9:31 UTC (permalink / raw)
  To: Lukas Hejtmanek; +Cc: Jun'ichi Nomura, agk, linux-kernel

Hi Lukas,

Lukas Hejtmanek wrote:
> On Sat, Jul 23, 2011 at 04:28:06PM +0900, Jun'ichi Nomura wrote:
>> As Kiyoshi suggested, it is important to know whether this
>> problem occurs with the latest kernel.
>> So if you could try 3.0, it would be very helpful.
> 
> so, do I need to check recent kernel as the patch works for me?

Yes, please.
(Unfortunately my storage has been broken today :-<)

CPU usage seems to be one of the reasons for your case, but I
still cannot understand why memory pressure is needed.
I want to know your problem also occurs on 3.0 without the patch.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-08-01  9:31             ` Kiyoshi Ueda
@ 2011-09-08 13:27               ` Lukas Hejtmanek
  2011-09-15 18:49                 ` Mike Snitzer
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Hejtmanek @ 2011-09-08 13:27 UTC (permalink / raw)
  To: Kiyoshi Ueda; +Cc: Jun'ichi Nomura, agk, linux-kernel

On Mon, Aug 01, 2011 at 06:31:02PM +0900, Kiyoshi Ueda wrote:
> Hi Lukas,
> 
> Lukas Hejtmanek wrote:
> > On Sat, Jul 23, 2011 at 04:28:06PM +0900, Jun'ichi Nomura wrote:
> >> As Kiyoshi suggested, it is important to know whether this
> >> problem occurs with the latest kernel.
> >> So if you could try 3.0, it would be very helpful.
> > 
> > so, do I need to check recent kernel as the patch works for me?
> 
> Yes, please.
> (Unfortunately my storage has been broken today :-<)
> 
> CPU usage seems to be one of the reasons for your case, but I
> still cannot understand why memory pressure is needed.
> I want to know your problem also occurs on 3.0 without the patch.

Well, I tried 3.0.3 kernel. It is not so bad as 2.6.32.36, but it is not as
good as the same kernel with your patch.

CPU load is caused by ksoftirqd (30-40%) and kworker (30%).

Overal speed is about 2.5GB/s. With your patch I can reach 2.8-2.9GB/s.

-- 
Lukáš Hejtmánek

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-09-08 13:27               ` Lukas Hejtmanek
@ 2011-09-15 18:49                 ` Mike Snitzer
  2011-09-16 14:08                   ` Lukas Hejtmanek
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Snitzer @ 2011-09-15 18:49 UTC (permalink / raw)
  To: Lukas Hejtmanek; +Cc: Kiyoshi Ueda, Jun'ichi Nomura, agk, linux-kernel

2011/9/8 Lukas Hejtmanek <xhejtman@ics.muni.cz>:
> On Mon, Aug 01, 2011 at 06:31:02PM +0900, Kiyoshi Ueda wrote:
>> Hi Lukas,
>>
>> Lukas Hejtmanek wrote:
>> > On Sat, Jul 23, 2011 at 04:28:06PM +0900, Jun'ichi Nomura wrote:
>> >> As Kiyoshi suggested, it is important to know whether this
>> >> problem occurs with the latest kernel.
>> >> So if you could try 3.0, it would be very helpful.
>> >
>> > so, do I need to check recent kernel as the patch works for me?
>>
>> Yes, please.
>> (Unfortunately my storage has been broken today :-<)
>>
>> CPU usage seems to be one of the reasons for your case, but I
>> still cannot understand why memory pressure is needed.
>> I want to know your problem also occurs on 3.0 without the patch.
>
> Well, I tried 3.0.3 kernel. It is not so bad as 2.6.32.36, but it is not as
> good as the same kernel with your patch.
>
> CPU load is caused by ksoftirqd (30-40%) and kworker (30%).
>
> Overal speed is about 2.5GB/s. With your patch I can reach 2.8-2.9GB/s.

Does "with your patch" mean against 3.0.3 or the original 2.6.32 kernel?

If you apply the patch to the 3.0.3 kernel do you regain the performance?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-09-15 18:49                 ` Mike Snitzer
@ 2011-09-16 14:08                   ` Lukas Hejtmanek
  2011-09-19  5:50                     ` Jun'ichi Nomura
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Hejtmanek @ 2011-09-16 14:08 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Kiyoshi Ueda, Jun'ichi Nomura, agk, linux-kernel

On Thu, Sep 15, 2011 at 02:49:52PM -0400, Mike Snitzer wrote:
> Does "with your patch" mean against 3.0.3 or the original 2.6.32 kernel?
> 
> If you apply the patch to the 3.0.3 kernel do you regain the performance?

Unfortunately, the patch does not help with 3.0.3 kernel

top looks like this:
   63 root      20   0     0    0    0 S   38  0.0   0:16.65 kswapd1                                      
   18 root      20   0     0    0    0 R   32  0.0   0:14.38 kworker/4:0                                  
   19 root      20   0     0    0    0 R   30  0.0   0:20.88 ksoftirqd/4                                  
  153 root      20   0     0    0    0 R   29  0.0   0:14.46 kworker/4:2                                  
   62 root      20   0     0    0    0 S   22  0.0   0:09.44 kswapd0                                      
16726 root      20   0  9484 1628  588 D   11  0.0   0:03.94 dd                                           
16721 root      20   0  9484 1628  588 D   11  0.0   0:03.87 dd                                           
16723 root      20   0  9484 1624  588 D   11  0.0   0:04.02 dd                                           
16725 root      20   0  9484 1628  588 D   11  0.0   0:03.89 dd                                           
16727 root      20   0  9484 1628  588 D   11  0.0   0:03.91 dd                                           
16728 root      20   0  9484 1624  588 D   11  0.0   0:03.89 dd                                           
16729 root      20   0  9484 1624  588 D   11  0.0   0:03.89 dd                                           
16724 root      20   0  9484 1624  588 D   11  0.0   0:03.93 dd                                           
16674 root      20   0  9484 1624  588 R    6  0.0   0:01.99 dd                                           
16685 root      20   0  9484 1624  588 D    6  0.0   0:02.00 dd                                           
16654 root      20   0  9484 1628  588 D    5  0.0   0:01.55 dd                                           
16662 root      20   0  9484 1588  588 D    5  0.0   0:01.53 dd                                           
16665 root      20   0  9484 1628  588 D    5  0.0   0:01.57 dd 

with patched 2.6.32, it looks like:
17330 root      20   0  9432 1752  616 D    4  0.0   0:00.34 dd                                           
17335 root      20   0  9432 1756  616 D    4  0.0   0:00.35 dd                                           
17316 root      20   0  9432 1756  616 D    4  0.0   0:00.34 dd                                           
17318 root      20   0  9432 1756  616 D    4  0.0   0:00.34 dd                                           
17322 root      20   0  9432 1756  616 D    4  0.0   0:00.34 dd                                           
17323 root      20   0  9432 1756  616 D    4  0.0   0:00.33 dd                                           
17324 root      20   0  9432 1756  616 D    4  0.0   0:00.34 dd                                           
17325 root      20   0  9432 1756  616 D    4  0.0   0:00.34 dd                                           
17326 root      20   0  9432 1756  616 D    4  0.0   0:00.33 dd                                           
17327 root      20   0  9432 1756  616 D    4  0.0   0:00.34 dd                                           
17328 root      20   0  9432 1756  616 D    4  0.0   0:00.34 dd                                           
17329 root      20   0  9432 1756  616 D    4  0.0   0:00.34 dd                                           
17333 root      20   0  9432 1756  616 D    4  0.0   0:00.34 dd                                           
17336 root      20   0  9432 1756  616 D    4  0.0   0:00.35 dd                                           
17344 root      20   0  9432 1756  616 D    4  0.0   0:00.32 dd                                           
17317 root      20   0  9432 1756  616 D    4  0.0   0:00.33 dd                                           
17319 root      20   0  9432 1752  616 D    4  0.0   0:00.34 dd                                           
17320 root      20   0  9432 1756  616 D    4  0.0   0:00.33 dd                                           
17321 root      20   0  9432 1756  616 D    4  0.0   0:00.33 dd                                           
17331 root      20   0  9432 1756  616 R    4  0.0   0:00.33 dd                                           
17332 root      20   0  9432 1756  616 D    4  0.0   0:00.34 dd                                           
17334 root      20   0  9432 1756  616 D    4  0.0   0:00.34 dd 

unpatched 2.6.32 looks like:
    4 root      20   0     0    0    0 R  100  0.0   0:25.60 ksoftirqd/0                                  
19325 root      20   0  9432 1756  616 D   48  0.0   0:12.51 dd                                           
19294 root      20   0  9432 1756  616 D   11  0.0   0:01.09 dd                                           
19323 root      20   0  9432 1756  616 D    4  0.0   0:01.02 dd                                           
19295 root      20   0  9432 1756  616 D    2  0.0   0:02.73 dd                                           
19305 root      20   0  9432 1756  616 D    2  0.0   0:00.53 dd                                           
19327 root      20   0  9432 1760  616 D    2  0.0   0:00.65 dd                                           
19328 root      20   0  9432 1752  616 D    2  0.0   0:00.60 dd                                           
19336 root      20   0  9432 1752  616 D    2  0.0   0:00.49 dd                                           
19350 root      20   0  9432 1756  616 D    2  0.0   0:00.55 dd                                           
19366 root      20   0  9432 1752  616 D    2  0.0   0:00.50 dd                                           
19292 root      20   0  9432 1756  616 D    2  0.0   0:00.65 dd                                           
19300 root      20   0  9432 1756  616 D    2  0.0   0:00.51 dd                                           
19303 root      20   0  9432 1752  616 D    2  0.0   0:00.59 dd                                           
19307 root      20   0  9432 1756  616 D    2  0.0   0:00.71 dd                                           
19318 root      20   0  9432 1756  616 R    2  0.0   0:00.46 dd                                           
19321 root      20   0  9432 1756  616 D    2  0.0   0:00.47 dd                                           
19332 root      20   0  9432 1756  616 D    2  0.0   0:00.49 dd                                           
19334 root      20   0  9432 1760  616 D    2  0.0   0:00.49 dd                                           
19339 root      20   0  9432 1756  616 D    2  0.0   0:00.47 dd                                           
19348 root      20   0  9432 1756  616 D    2  0.0   0:00.48 dd                                           
19355 root      20   0  9432 1756  616 D    2  0.0   0:00.47 dd             

-- 
Lukáš Hejtmánek

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-09-16 14:08                   ` Lukas Hejtmanek
@ 2011-09-19  5:50                     ` Jun'ichi Nomura
  2011-09-29 20:57                       ` Lukas Hejtmanek
  0 siblings, 1 reply; 16+ messages in thread
From: Jun'ichi Nomura @ 2011-09-19  5:50 UTC (permalink / raw)
  To: Lukas Hejtmanek; +Cc: Mike Snitzer, Kiyoshi Ueda, agk, linux-kernel

Hi Lukas,

On 09/16/11 23:08, Lukas Hejtmanek wrote:
> On Thu, Sep 15, 2011 at 02:49:52PM -0400, Mike Snitzer wrote:
>> Does "with your patch" mean against 3.0.3 or the original 2.6.32 kernel?
>>
>> If you apply the patch to the 3.0.3 kernel do you regain the performance?
> 
> Unfortunately, the patch does not help with 3.0.3 kernel

Thanks for the additional testing.

So:

  2.6.32.36
    no-multipath:        2.8 GB/s
    multipath:           600 MB/s
    multipath w/ patch:  2.8-2.9 GB/s

  3.0.3
    no-multipath:        ??
    multipath:           2.5 GB/s
    multipath w/ patch:  2.5 GB/s(?)

Have you tried 3.0.3 without multipath?

> top looks like this:
>    63 root      20   0     0    0    0 S   38  0.0   0:16.65 kswapd1
>    18 root      20   0     0    0    0 R   32  0.0   0:14.38 kworker/4:0
>    19 root      20   0     0    0    0 R   30  0.0   0:20.88 ksoftirqd/4
>   153 root      20   0     0    0    0 R   29  0.0   0:14.46 kworker/4:2
>    62 root      20   0     0    0    0 S   22  0.0   0:09.44 kswapd0

dm-multipath itself does not use workqueue for normal I/Os.
So you might be hitting a different problem.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-09-19  5:50                     ` Jun'ichi Nomura
@ 2011-09-29 20:57                       ` Lukas Hejtmanek
  2011-10-05  8:13                         ` Jun'ichi Nomura
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Hejtmanek @ 2011-09-29 20:57 UTC (permalink / raw)
  To: Jun'ichi Nomura; +Cc: Mike Snitzer, Kiyoshi Ueda, agk, linux-kernel

On Mon, Sep 19, 2011 at 02:50:08PM +0900, Jun'ichi Nomura wrote:
>   2.6.32.36
>     no-multipath:        2.8 GB/s
>     multipath:           600 MB/s
>     multipath w/ patch:  2.8-2.9 GB/s
> 
>   3.0.3
>     no-multipath:        ??
>     multipath:           2.5 GB/s
>     multipath w/ patch:  2.5 GB/s(?)
> 
> Have you tried 3.0.3 without multipath?

yes, 3GB/s and only kwapd0 and kswapd1 is running, no kworker or ksoftirqd..
 
> > top looks like this:
> >    63 root      20   0     0    0    0 S   38  0.0   0:16.65 kswapd1
> >    18 root      20   0     0    0    0 R   32  0.0   0:14.38 kworker/4:0
> >    19 root      20   0     0    0    0 R   30  0.0   0:20.88 ksoftirqd/4
> >   153 root      20   0     0    0    0 R   29  0.0   0:14.46 kworker/4:2
> >    62 root      20   0     0    0    0 S   22  0.0   0:09.44 kswapd0
> 
> dm-multipath itself does not use workqueue for normal I/Os.
> So you might be hitting a different problem.

-- 
Lukáš Hejtmánek

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-09-29 20:57                       ` Lukas Hejtmanek
@ 2011-10-05  8:13                         ` Jun'ichi Nomura
  2011-10-05 10:35                           ` Lukas Hejtmanek
  0 siblings, 1 reply; 16+ messages in thread
From: Jun'ichi Nomura @ 2011-10-05  8:13 UTC (permalink / raw)
  To: Lukas Hejtmanek; +Cc: Mike Snitzer, Kiyoshi Ueda, agk, linux-kernel

Hi Lukas,

On 09/30/11 05:57, Lukas Hejtmanek wrote:
> On Mon, Sep 19, 2011 at 02:50:08PM +0900, Jun'ichi Nomura wrote:
>>   2.6.32.36
>>     no-multipath:        2.8 GB/s
>>     multipath:           600 MB/s
>>     multipath w/ patch:  2.8-2.9 GB/s
>>
>>   3.0.3
>>     no-multipath:        ??
>>     multipath:           2.5 GB/s
>>     multipath w/ patch:  2.5 GB/s(?)
>>
>> Have you tried 3.0.3 without multipath?
> 
> yes, 3GB/s and only kwapd0 and kswapd1 is running, no kworker or ksoftirqd..

Hmm.. did you find any difference in your profile this time?

I'm trying to reproduce it myself but no success so far
(perhaps disks are not fast enough to saturate CPU on my test machine).

As ksoftirqd in top implies your CPU4 gets too much I/O completions,
'rq_affnity = 2' for both dm and SCSI devices might be a solution.
It'll distribute block completion softirqs to submitters and possibly
reduce the loads of the 1st CPU in the socket.
(See the commit below. It's a new feature of 3.1. Not available in 3.0...)

  commit 5757a6d76cdf6dda2a492c09b985c015e86779b1
  Author: Dan Williams <dan.j.williams@intel.com>
  Date:   Sat Jul 23 20:44:25 2011 +0200

    block: strict rq_affinity
    
    Some systems benefit from completions always being steered to the strict
    requester cpu rather than the looser "per-socket" steering that
    blk_cpu_to_group() attempts by default. This is because the first
    CPU in the group mask ends up being completely overloaded with work,
    while the others (including the original submitter) has power left
    to spare.
    
    Allow the strict mode to be set by writing '2' to the sysfs control
    file. This is identical to the scheme used for the nomerges file,
    where '2' is a more aggressive setting than just being turned on.
    
    echo 2 > /sys/block/<bdev>/queue/rq_affinity

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-10-05  8:13                         ` Jun'ichi Nomura
@ 2011-10-05 10:35                           ` Lukas Hejtmanek
  2011-10-06  5:11                             ` Jun'ichi Nomura
  0 siblings, 1 reply; 16+ messages in thread
From: Lukas Hejtmanek @ 2011-10-05 10:35 UTC (permalink / raw)
  To: Jun'ichi Nomura; +Cc: Mike Snitzer, Kiyoshi Ueda, agk, linux-kernel

On Wed, Oct 05, 2011 at 05:13:36PM +0900, Jun'ichi Nomura wrote:
> > yes, 3GB/s and only kwapd0 and kswapd1 is running, no kworker or ksoftirqd..
> 
> Hmm.. did you find any difference in your profile this time?

not sure what do you mean.

> I'm trying to reproduce it myself but no success so far
> (perhaps disks are not fast enough to saturate CPU on my test machine).

hmm, I have 80 SAS 2.0 disks and two E5640 @ 2.67GHz cpus. 
 
> As ksoftirqd in top implies your CPU4 gets too much I/O completions,
> 'rq_affnity = 2' for both dm and SCSI devices might be a solution.
> It'll distribute block completion softirqs to submitters and possibly
> reduce the loads of the 1st CPU in the socket.
> (See the commit below. It's a new feature of 3.1. Not available in 3.0...)

So what next? Should I try 3.1 kernel with this patch applied?

-- 
Lukáš Hejtmánek

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: request baset device mapper in Linux
  2011-10-05 10:35                           ` Lukas Hejtmanek
@ 2011-10-06  5:11                             ` Jun'ichi Nomura
  0 siblings, 0 replies; 16+ messages in thread
From: Jun'ichi Nomura @ 2011-10-06  5:11 UTC (permalink / raw)
  To: Lukas Hejtmanek; +Cc: Mike Snitzer, Kiyoshi Ueda, agk, linux-kernel

Hi Lukas,

On 10/05/11 19:35, Lukas Hejtmanek wrote:
> On Wed, Oct 05, 2011 at 05:13:36PM +0900, Jun'ichi Nomura wrote:
>>> yes, 3GB/s and only kwapd0 and kswapd1 is running, no kworker or ksoftirqd..
>>
>> Hmm.. did you find any difference in your profile this time?
> 
> not sure what do you mean.

With SLES 2.6.32.36-0.5-default kernel, you found ksoftirqd
spent most of the time in __blk_recalc_rq_segments, using
sysprof/oprofile. That's why my patch was effective for it.

My question is whether you see such difference between
no-multipath and multipath, in profile data of 3.0.3 (without my patch).

>> I'm trying to reproduce it myself but no success so far
>> (perhaps disks are not fast enough to saturate CPU on my test machine).
> 
> hmm, I have 80 SAS 2.0 disks and two E5640 @ 2.67GHz cpus. 
>  
>> As ksoftirqd in top implies your CPU4 gets too much I/O completions,
>> 'rq_affnity = 2' for both dm and SCSI devices might be a solution.
>> It'll distribute block completion softirqs to submitters and possibly
>> reduce the loads of the 1st CPU in the socket.
>> (See the commit below. It's a new feature of 3.1. Not available in 3.0...)
> 
> So what next? Should I try 3.1 kernel with this patch applied?

Please try 3.1 (without my patch) + 'rq_affinity = 2',
on both multipath and no-multipath.

If you still see performance difference and ksoftirqd
spends most of the time in __blk_recalc_rq_segments,
try 3.1 + my patch + 'rq_affinity = 2'.

Thanks,
-- 
Jun'ichi Nomura, NEC Corporation

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-10-06  5:12 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-20  8:26 request baset device mapper in Linux Lukas Hejtmanek
2011-07-21 11:11 ` Kiyoshi Ueda
2011-07-21 13:26   ` Lukas Hejtmanek
2011-07-22  6:56     ` Kiyoshi Ueda
2011-07-22  8:19       ` Lukas Hejtmanek
2011-07-23  7:28         ` Jun'ichi Nomura
2011-07-24 22:16           ` Lukas Hejtmanek
2011-08-01  9:31             ` Kiyoshi Ueda
2011-09-08 13:27               ` Lukas Hejtmanek
2011-09-15 18:49                 ` Mike Snitzer
2011-09-16 14:08                   ` Lukas Hejtmanek
2011-09-19  5:50                     ` Jun'ichi Nomura
2011-09-29 20:57                       ` Lukas Hejtmanek
2011-10-05  8:13                         ` Jun'ichi Nomura
2011-10-05 10:35                           ` Lukas Hejtmanek
2011-10-06  5:11                             ` Jun'ichi Nomura

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox