From: Mathieu Avila <mathieu.avila@seanodes.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] Problem in ops_address.c :: gfs_writepage ?
Date: Tue, 20 Feb 2007 11:59:04 +0100 [thread overview]
Message-ID: <20070220115904.0ce08ac0@mathieu.toulouse> (raw)
In-Reply-To: <45DA1DEA.3080302@redhat.com>
Le Mon, 19 Feb 2007 17:00:10 -0500,
Wendy Cheng <wcheng@redhat.com> a ?crit :
> Mathieu Avila wrote:
> > Hello all,
> > But precisely, in that situation, there are multiple times when
> > gfs_writepage cannot perform its duty, because of the transaction
> > lock.
> Yes, we did have this problem in the past with direct IO and SYNC
> flag.
I understand that in that case, data are written with get_transaction
lock taken, so that gfs_writepage never writes the pages, and you go
beyond dirty_ratio limit if you write too much pages. How did you do to
get rid of it ?
> > [snip]
> > we've experienced it using the test program "bonnie++" whose
> > purpose is to test a FS performance. Bonnie++ makes multiple files
> > of 1GB when it is asked to run long multi-Go writes. There is no
> > problem with 5 GB (5 files of 1 GB) but many machines in the
> > cluster are OOM killed with 10GB bonnies....
> >
> I would like to know more about your experiments. So these
> bonnie++(s) are run on each cluster node with independent file sets ?
>
Exactly. I run "bonnie++ -s 10G" on each node, in different directories
of the same GFS file system. To get the problem happen surely
and quicker, i tune pdflush by quite disabling it :
echo 300000 > /proc/sys/vm/dirty_expire_centisecs
echo 50000 > /proc/sys/vm/dirty_writeback_centisecs
..., so that only /proc/sys/vm/dirty_ratio plays in. Not doing this
makes the problem harder to reproduce. (but reproducible)
The problem happens only when it starts the writeback of the dirty
pages of the 2nd file, once it is done with the 1st file. We still
have to determine why. So you can get the same problem with a program
that :
- opens 2 files
- writes 1Go in the 1st one,
- writes indefinitely in the second file.
We use a particular block device that doesn't read/write at the same
speed on all nodes. Don't know if this can help.
> > Keep a counter of pages in gfs_inode whose value represents those
> > not written in gfs_writepage, and at the end of do_do_write_buf,
> > call "balance_dirty_pages_ratelimited(file->f_mapping);" as many
> > times. The counter is possibly shared by multiple processes, but we
> > are assured that there is no transaction at that moment so pages
> > can be flushed, if "balance_dirty_pages_ratelimited" determines
> > that it must reclaim dirty pages. Otherwise performance is not
> > affected.
> In general, this approach looks ok if we do have this flushing
> problem. However, GFS flush code has been embedded in glock code so I
> would think it would be better to do this within glock code.
These are points we do not understand.
- I understand that the flushing code is done in glock, (when lock is
demoted or taken by another node, isn't it ?). Why isn't it possible to
let the kernel decide which pages to flush when it needs to ? For
example, in that particular case, it is not a good idea to flush the
page only when the lock is lost, the kernel needs to flush pages.
- Why does not gfs_writepage return an error when the page cannot be
flushed ?
- The balance_dirty_pages_ratelimited function is called inside
get_transaction/set_transaction, (i.e between gfs_trans_begin and
gfs_trans_end), therefore gfs_writepage should never work. Do you know
if there is some kind of asynchronous (kiocb ?) write so that
gfs_writepage is called later on the same process ? If not, what could
make gfs_writepage happen sometimes inside a transaction, and sometimes
not ?
- Ext3 should be affected as well, but it isn't. Is that because the
transaction lock is taken for a much shorter period of time, so that
dirty pages that are not flushed when the lock is taken will be
succesfully flushed later ?
- Some other file systems in the kernel : NTFS and ReiserFS, do explicit
calls to balance_dirty_pages_ratelimited
http://lxr.linux.no/source/fs/ntfs/file.c?v=2.6.18;a=x86_64#L284
http://lxr.linux.no/source/fs/ntfs/file.c?v=2.6.18;a=x86_64#L2101
http://lxr.linux.no/source/fs/reiserfs/file.c?v=2.6.11;a=x86_64#L1351
But they also redefines some generic functions from the kernel. Maybe
they have a strong reason to do so ?
> -- Wendy
Thank you for your answer,
--
Mathieu
next prev parent reply other threads:[~2007-02-20 10:59 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-02-19 15:00 [Cluster-devel] Problem in ops_address.c :: gfs_writepage ? Mathieu Avila
2007-02-19 22:00 ` Wendy Cheng
2007-02-20 10:59 ` Mathieu Avila [this message]
2007-02-21 5:04 ` Wendy Cheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070220115904.0ce08ac0@mathieu.toulouse \
--to=mathieu.avila@seanodes.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).