linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Righi <andrea@betterlinux.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: linux-kernel@vger.kernel.org, jaxboe@fusionio.com,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages()
Date: Wed, 29 Jun 2011 18:05:32 +0200	[thread overview]
Message-ID: <20110629160532.GA1255@thinkpad> (raw)
In-Reply-To: <20110628170624.GA12949@redhat.com>

On Tue, Jun 28, 2011 at 01:06:24PM -0400, Vivek Goyal wrote:
> On Tue, Jun 28, 2011 at 06:21:38PM +0200, Andrea Righi wrote:
> > On Tue, Jun 28, 2011 at 11:35:01AM -0400, Vivek Goyal wrote:
> > > Hi,
> > > 
> > > This is V2 of the patches. First version is posted here.
> > > 
> > > https://lkml.org/lkml/2011/6/3/375
> > > 
> > > There are no changes from first version except that I have rebased it to
> > > for-3.1/core branch of Jens's block tree.
> > > 
> > > I have been trying to find ways to solve two problems with block IO controller
> > > cgroups.
> > > 
> > > - Current throttling logic in IO controller does not throttle buffered WRITES.
> > >   Well it does throttle all the WRITEs at device and by that time buffered
> > >   WRITE have lost the submitter's context and most of the IO comes in flusher
> > >   thread's context at device. Hence currently buffered write throttling is
> > >   not supported.
> > > 
> > > - All WRITEs are throttled at device level and this can easily lead to
> > >   filesystem serialization.
> > > 
> > >   One simple example is that if a process writes some pages to cache and
> > >   then does fsync(), and process gets throttled then it locks up the
> > >   filesystem. With ext4, I noticed that even a simple "ls" does not make
> > >   progress. The reason boils down to the fact that filesystems are not
> > >   aware of cgroups and one of the things which get serialized is journalling
> > >   in ordered mode.
> > > 
> > >   So even if we do something to carry submitter's cgroup information
> > >   to device and do throttling there, it will lead to serialization of
> > >   filesystems and is not a good idea.
> > > 
> > > So how to go about fixing it. There seem to be two options.
> > > 
> > > - Throttling should still be done at device level. Make filesystems aware
> > >   of cgroups so that multiple transactions can make progress in parallel
> > >   (per cgroup) and there are no shared resources across cgroups in
> > >   filesystems which can lead to serialization.
> > > 
> > > - Throttle WRITEs while they are entering the cache and not after that.
> > >   Something like balance_dirty_pages(). Direct IO is still throttled
> > >   at device level. That way, we can avoid these journalling related
> > >   serialization issues w.r.t trottling.
> > 
> > I think that O_DIRECT WRITEs can hit the same serialization problem if
> > we throttle them at device level.
> 
> I think it can but number of cases probably comes down significantly. One
> of the main problems seems to be sync related variants sync/fsync etc.
> And I think we do not make any gurantees for inflight requests
> (not completed yet).
> 
> So it will boil down to how dependent these sync primitives are on
> inflight direct WRITEs. I did basic testing with ext4 and it looked fine.
> On XFS, sync gets blocked behind inflight direct writes. Last time I
> raised that issue and looks like Christoph has plans to do something
> about it.
> 
> So currently my understanding is that dependency on direct writes might
> not be a major issue in practice. (Until and unless there is more to
> it I am not aware about).
> 
> > 
> > Have you tried to do some tests? (i.e. create multiple cgroups with very
> > low I/O limit doing parallel O_DIRECT WRITEs, and try to run at the same
> > time "ls" or other simple commands from the root cgroup or unlimited
> > cgroup).
> 
> I did. On ext4, I created a cgroup with limit 1byte per second and 
> started a direct write and did "ls", "sync" and some directory traversal
> operations in same diretory and it seems to work.

Confirm. Everything seems to work fine also on my side.

Tested-by: Andrea Righi <andrea@betterlinux.com>

FYI, I've used the following script to test it if you're interested.
I tested both with O_DIRECT=1 and O_DIRECT=0.

-Andrea

---
#!/bin/bash
#
# blkio.throttle unit test
#
# This script creates many cgroups and spawns many parallel IO workers inside
# each cgroup.

# cgroupfs mount point
CGROUP_MP=/sys/fs/cgroup/blkio
# temporary directory used to generate IO
TMPDIR=/tmp

# how many cgroups?
CGROUPS=16
# how many IO workers per cgroup?
WORKERS=16
# max IO bandwidth of each cgroup
BW_MAX=$((1 * 1024 * 1024))
# max IO operations per second of each cgroup
IOPS_MAX=0

# IO block size
IO_BLOCK_SIZE=$((1 * 1024 * 1024))
# how many blocks to read/write (for each worker)
IO_BLOCK_NUM=4
# how many times each worker have to repeat the IO operation?
IO_COUNT=16

# enable O_DIRECT?
O_DIRECT=0

# timeout to consider a task blocked for too much time and dump a
# message in the kernel log (set to 0 to disable this check)
HUNG_TASK_TIMEOUT=60

cleanup_handler() {
 	pkill sleep
	pkill dd
	echo "terminating..."
	sleep 10
	rmdir $CGROUP_MP/grp_*
	rm -rf $TMPDIR/grp_*
	sleep 1
	exit 1
}

worker() {
	out=$1

	if [ "z$O_DIRECT" = "z1" ]; then
		out_flags=oflag=direct
		in_flags=iflag=direct
	else
		out_flag=
		in_flag=
	fi
	sleep 5
	for i in `seq 1 16`; do
		dd if=/dev/zero of=$out \
			bs=$IO_BLOCK_SIZE count=$IO_BLOCK_NUM \
			$out_flags 2>/dev/null
	done
	for i in `seq 1 16`; do
		dd if=$out of=/dev/null \
			bs=$IO_BLOCK_SIZE count=$IO_BLOCK_NUM \
			$in_flags 2>/dev/null
	done
	rm -f $out
	unset out
}

spawn_workers() {
	grp=$1
	device=`df $TMPDIR | sed '1d' | awk '{print $1}' | sed 's/[0-9]$//'`
	devnum=`grep $(basename $device)$ /proc/partitions | awk '{print $1":"$2}'`

	mkdir $CGROUP_MP/$grp

	echo $devnum $BW_MAX > $CGROUP_MP/$grp/blkio.throttle.read_bps_device
	echo $devnum $BW_MAX > $CGROUP_MP/$grp/blkio.throttle.write_bps_device

	echo $devnum $IOPS_MAX > $CGROUP_MP/$grp/blkio.throttle.read_iops_device
	echo $devnum $IOPS_MAX > $CGROUP_MP/$grp/blkio.throttle.write_iops_device

	mkdir -p $TMPDIR/$grp
	for i in `seq 1 $WORKERS`; do
		worker $TMPDIR/$grp/zero$i &
		echo $! > $CGROUP_MP/$grp/tasks
	done
	for i in `seq 1 $WORKERS`; do
		wait
	done
	rmdir $TMPDIR/$grp
	rmdir $CGROUP_MP/$grp
	unset grp
}

# mount cgroupfs
mount -t cgroup -o blkio none $CGROUP_MP

# set hung task check timeout (help to catch system-wide lockups)
echo $HUNG_TASK_TIMEOUT > /proc/sys/kernel/hung_task_timeout_secs

# invalidate page cache
sync
echo 3 > /proc/sys/vm/drop_caches

# show expected bandwidth
bw=$(($CGROUPS * $BW_MAX / 1024))
space=$(($CGROUPS * $WORKERS * $IO_BLOCK_SIZE * $IO_BLOCK_NUM / 1024 / 1024))
echo -ne "\n\n"
echo creating $CGROUPS cgroups, $WORKERS tasks per cgroup, bw=$BW_MAX
echo required disk space: $space MiB
echo expected average bandwith: $bw MiB/s
echo -ne "\n\n"

# trap SIGINT and SIGTERM to quit cleanly
trap cleanup_handler SIGINT SIGTERM

# run workers
for i in `seq 1 $CGROUPS`; do
	spawn_workers grp_$i &
done

# wait the completion of the workers
for i in `seq 1 $CGROUPS`; do
	wait
done

echo "test completed."

  parent reply	other threads:[~2011-06-29 16:05 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-28 15:35 [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages() Vivek Goyal
2011-06-28 15:35 ` [PATCH 1/8] blk-throttle: convert wait routines to return jiffies to wait Vivek Goyal
2011-06-28 15:35 ` [PATCH 2/8] blk-throttle: do not enforce first queued bio check in tg_wait_dispatch Vivek Goyal
2011-06-28 15:35 ` [PATCH 3/8] blk-throttle: use io size and direction as parameters to wait routines Vivek Goyal
2011-06-28 15:35 ` [PATCH 4/8] blk-throttle: specify number of ios during dispatch update Vivek Goyal
2011-06-28 15:35 ` [PATCH 5/8] blk-throttle: get rid of extend slice trace message Vivek Goyal
2011-06-28 15:35 ` [PATCH 6/8] blk-throttle: core logic to throttle task while dirtying pages Vivek Goyal
2011-06-29  9:30   ` Andrea Righi
2011-06-29 15:25   ` Andrea Righi
2011-06-29 20:03     ` Vivek Goyal
2011-06-28 15:35 ` [PATCH 7/8] blk-throttle: do not throttle writes at device level except direct io Vivek Goyal
2011-06-28 15:35 ` [PATCH 8/8] blk-throttle: enable throttling of task while dirtying pages Vivek Goyal
2011-06-30 14:52   ` Andrea Righi
2011-06-30 15:06     ` Andrea Righi
2011-06-30 17:14     ` Vivek Goyal
2011-06-30 21:22       ` Andrea Righi
2011-06-28 16:21 ` [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages() Andrea Righi
2011-06-28 17:06   ` Vivek Goyal
2011-06-28 17:39     ` Andrea Righi
2011-06-29 16:05     ` Andrea Righi [this message]
2011-06-29 20:04       ` Vivek Goyal
2011-06-29  0:42 ` Dave Chinner
2011-06-29  1:53   ` Vivek Goyal
2011-06-30 20:04     ` fsync serialization on ext4 with blkio throttling (Was: Re: [PATCH 0/8][V2] blk-throttle: Throttle buffered WRITEs in balance_dirty_pages()) Vivek Goyal
2011-06-30 20:44       ` Vivek Goyal
2011-07-01  0:16         ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110629160532.GA1255@thinkpad \
    --to=andrea@betterlinux.com \
    --cc=jaxboe@fusionio.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).