All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <andrea@betterlinux.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: linux-kernel@vger.kernel.org, jaxboe@fusionio.com,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages()
Date: Wed, 29 Jun 2011 18:05:32 +0200	[thread overview]
Message-ID: <20110629160532.GA1255@thinkpad> (raw)
In-Reply-To: <20110628170624.GA12949@redhat.com>

On Tue, Jun 28, 2011 at 01:06:24PM -0400, Vivek Goyal wrote:
> On Tue, Jun 28, 2011 at 06:21:38PM +0200, Andrea Righi wrote:
> > On Tue, Jun 28, 2011 at 11:35:01AM -0400, Vivek Goyal wrote:
> > > Hi,
> > > 
> > > This is V2 of the patches. First version is posted here.
> > > 
> > > https://lkml.org/lkml/2011/6/3/375
> > > 
> > > There are no changes from first version except that I have rebased it to
> > > for-3.1/core branch of Jens's block tree.
> > > 
> > > I have been trying to find ways to solve two problems with block IO controller
> > > cgroups.
> > > 
> > > - Current throttling logic in IO controller does not throttle buffered WRITES.
> > >   Well it does throttle all the WRITEs at device and by that time buffered
> > >   WRITE have lost the submitter's context and most of the IO comes in flusher
> > >   thread's context at device. Hence currently buffered write throttling is
> > >   not supported.
> > > 
> > > - All WRITEs are throttled at device level and this can easily lead to
> > >   filesystem serialization.
> > > 
> > >   One simple example is that if a process writes some pages to cache and
> > >   then does fsync(), and process gets throttled then it locks up the
> > >   filesystem. With ext4, I noticed that even a simple "ls" does not make
> > >   progress. The reason boils down to the fact that filesystems are not
> > >   aware of cgroups and one of the things which get serialized is journalling
> > >   in ordered mode.
> > > 
> > >   So even if we do something to carry submitter's cgroup information
> > >   to device and do throttling there, it will lead to serialization of
> > >   filesystems and is not a good idea.
> > > 
> > > So how to go about fixing it. There seem to be two options.
> > > 
> > > - Throttling should still be done at device level. Make filesystems aware
> > >   of cgroups so that multiple transactions can make progress in parallel
> > >   (per cgroup) and there are no shared resources across cgroups in
> > >   filesystems which can lead to serialization.
> > > 
> > > - Throttle WRITEs while they are entering the cache and not after that.
> > >   Something like balance_dirty_pages(). Direct IO is still throttled
> > >   at device level. That way, we can avoid these journalling related
> > >   serialization issues w.r.t trottling.
> > 
> > I think that O_DIRECT WRITEs can hit the same serialization problem if
> > we throttle them at device level.
> 
> I think it can but number of cases probably comes down significantly. One
> of the main problems seems to be sync related variants sync/fsync etc.
> And I think we do not make any gurantees for inflight requests
> (not completed yet).
> 
> So it will boil down to how dependent these sync primitives are on
> inflight direct WRITEs. I did basic testing with ext4 and it looked fine.
> On XFS, sync gets blocked behind inflight direct writes. Last time I
> raised that issue and looks like Christoph has plans to do something
> about it.
> 
> So currently my understanding is that dependency on direct writes might
> not be a major issue in practice. (Until and unless there is more to
> it I am not aware about).
> 
> > 
> > Have you tried to do some tests? (i.e. create multiple cgroups with very
> > low I/O limit doing parallel O_DIRECT WRITEs, and try to run at the same
> > time "ls" or other simple commands from the root cgroup or unlimited
> > cgroup).
> 
> I did. On ext4, I created a cgroup with limit 1byte per second and 
> started a direct write and did "ls", "sync" and some directory traversal
> operations in same diretory and it seems to work.

Confirm. Everything seems to work fine also on my side.

Tested-by: Andrea Righi <andrea@betterlinux.com>

FYI, I've used the following script to test it if you're interested.
I tested both with O_DIRECT=1 and O_DIRECT=0.

-Andrea

---
#!/bin/bash
#
# blkio.throttle unit test
#
# This script creates many cgroups and spawns many parallel IO workers inside
# each cgroup.

# cgroupfs mount point
CGROUP_MP=/sys/fs/cgroup/blkio
# temporary directory used to generate IO
TMPDIR=/tmp

# how many cgroups?
CGROUPS=16
# how many IO workers per cgroup?
WORKERS=16
# max IO bandwidth of each cgroup
BW_MAX=$((1 * 1024 * 1024))
# max IO operations per second of each cgroup
IOPS_MAX=0

# IO block size
IO_BLOCK_SIZE=$((1 * 1024 * 1024))
# how many blocks to read/write (for each worker)
IO_BLOCK_NUM=4
# how many times each worker have to repeat the IO operation?
IO_COUNT=16

# enable O_DIRECT?
O_DIRECT=0

# timeout to consider a task blocked for too much time and dump a
# message in the kernel log (set to 0 to disable this check)
HUNG_TASK_TIMEOUT=60

cleanup_handler() {
 	pkill sleep
	pkill dd
	echo "terminating..."
	sleep 10
	rmdir $CGROUP_MP/grp_*
	rm -rf $TMPDIR/grp_*
	sleep 1
	exit 1
}

worker() {
	out=$1

	if [ "z$O_DIRECT" = "z1" ]; then
		out_flags=oflag=direct
		in_flags=iflag=direct
	else
		out_flag=
		in_flag=
	fi
	sleep 5
	for i in `seq 1 16`; do
		dd if=/dev/zero of=$out \
			bs=$IO_BLOCK_SIZE count=$IO_BLOCK_NUM \
			$out_flags 2>/dev/null
	done
	for i in `seq 1 16`; do
		dd if=$out of=/dev/null \
			bs=$IO_BLOCK_SIZE count=$IO_BLOCK_NUM \
			$in_flags 2>/dev/null
	done
	rm -f $out
	unset out
}

spawn_workers() {
	grp=$1
	device=`df $TMPDIR | sed '1d' | awk '{print $1}' | sed 's/[0-9]$//'`
	devnum=`grep $(basename $device)$ /proc/partitions | awk '{print $1":"$2}'`

	mkdir $CGROUP_MP/$grp

	echo $devnum $BW_MAX > $CGROUP_MP/$grp/blkio.throttle.read_bps_device
	echo $devnum $BW_MAX > $CGROUP_MP/$grp/blkio.throttle.write_bps_device

	echo $devnum $IOPS_MAX > $CGROUP_MP/$grp/blkio.throttle.read_iops_device
	echo $devnum $IOPS_MAX > $CGROUP_MP/$grp/blkio.throttle.write_iops_device

	mkdir -p $TMPDIR/$grp
	for i in `seq 1 $WORKERS`; do
		worker $TMPDIR/$grp/zero$i &
		echo $! > $CGROUP_MP/$grp/tasks
	done
	for i in `seq 1 $WORKERS`; do
		wait
	done
	rmdir $TMPDIR/$grp
	rmdir $CGROUP_MP/$grp
	unset grp
}

# mount cgroupfs
mount -t cgroup -o blkio none $CGROUP_MP

# set hung task check timeout (help to catch system-wide lockups)
echo $HUNG_TASK_TIMEOUT > /proc/sys/kernel/hung_task_timeout_secs

# invalidate page cache
sync
echo 3 > /proc/sys/vm/drop_caches

# show expected bandwidth
bw=$(($CGROUPS * $BW_MAX / 1024))
space=$(($CGROUPS * $WORKERS * $IO_BLOCK_SIZE * $IO_BLOCK_NUM / 1024 / 1024))
echo -ne "\n\n"
echo creating $CGROUPS cgroups, $WORKERS tasks per cgroup, bw=$BW_MAX
echo required disk space: $space MiB
echo expected average bandwith: $bw MiB/s
echo -ne "\n\n"

# trap SIGINT and SIGTERM to quit cleanly
trap cleanup_handler SIGINT SIGTERM

# run workers
for i in `seq 1 $CGROUPS`; do
	spawn_workers grp_$i &
done

# wait the completion of the workers
for i in `seq 1 $CGROUPS`; do
	wait
done

echo "test completed."

  parent reply	other threads:[~2011-06-29 16:05 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-28 15:35 [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages() Vivek Goyal
2011-06-28 15:35 ` [PATCH 1/8] blk-throttle: convert wait routines to return jiffies to wait Vivek Goyal
2011-06-28 15:35 ` [PATCH 2/8] blk-throttle: do not enforce first queued bio check in tg_wait_dispatch Vivek Goyal
2011-06-28 15:35 ` [PATCH 3/8] blk-throttle: use io size and direction as parameters to wait routines Vivek Goyal
2011-06-28 15:35 ` [PATCH 4/8] blk-throttle: specify number of ios during dispatch update Vivek Goyal
2011-06-28 15:35 ` [PATCH 5/8] blk-throttle: get rid of extend slice trace message Vivek Goyal
2011-06-28 15:35 ` [PATCH 6/8] blk-throttle: core logic to throttle task while dirtying pages Vivek Goyal
2011-06-29  9:30   ` Andrea Righi
2011-06-29 15:25   ` Andrea Righi
2011-06-29 20:03     ` Vivek Goyal
2011-06-28 15:35 ` [PATCH 7/8] blk-throttle: do not throttle writes at device level except direct io Vivek Goyal
2011-06-28 15:35 ` [PATCH 8/8] blk-throttle: enable throttling of task while dirtying pages Vivek Goyal
2011-06-30 14:52   ` Andrea Righi
2011-06-30 15:06     ` Andrea Righi
2011-06-30 17:14     ` Vivek Goyal
2011-06-30 21:22       ` Andrea Righi
2011-06-28 16:21 ` [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages() Andrea Righi
2011-06-28 17:06   ` Vivek Goyal
2011-06-28 17:39     ` Andrea Righi
2011-06-29 16:05     ` Andrea Righi [this message]
2011-06-29 20:04       ` Vivek Goyal
2011-06-29  0:42 ` Dave Chinner
2011-06-29  1:53   ` Vivek Goyal
2011-06-30 20:04     ` fsync serialization on ext4 with blkio throttling (Was: Re: [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages()) Vivek Goyal
2011-06-30 20:44       ` Vivek Goyal
2011-07-01  0:16         ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110629160532.GA1255@thinkpad \
    --to=andrea@betterlinux.com \
    --cc=jaxboe@fusionio.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.