From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mathieu Avila Date: Tue, 20 Feb 2007 11:59:04 +0100 Subject: [Cluster-devel] Problem in ops_address.c :: gfs_writepage ? In-Reply-To: <45DA1DEA.3080302@redhat.com> References: <20070219160050.14eb6c10@mathieu.toulouse> <45DA1DEA.3080302@redhat.com> Message-ID: <20070220115904.0ce08ac0@mathieu.toulouse> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Le Mon, 19 Feb 2007 17:00:10 -0500, Wendy Cheng a ?crit : > Mathieu Avila wrote: > > Hello all, > > But precisely, in that situation, there are multiple times when > > gfs_writepage cannot perform its duty, because of the transaction > > lock. > Yes, we did have this problem in the past with direct IO and SYNC > flag. I understand that in that case, data are written with get_transaction lock taken, so that gfs_writepage never writes the pages, and you go beyond dirty_ratio limit if you write too much pages. How did you do to get rid of it ? > > [snip] > > we've experienced it using the test program "bonnie++" whose > > purpose is to test a FS performance. Bonnie++ makes multiple files > > of 1GB when it is asked to run long multi-Go writes. There is no > > problem with 5 GB (5 files of 1 GB) but many machines in the > > cluster are OOM killed with 10GB bonnies.... > > > I would like to know more about your experiments. So these > bonnie++(s) are run on each cluster node with independent file sets ? > Exactly. I run "bonnie++ -s 10G" on each node, in different directories of the same GFS file system. To get the problem happen surely and quicker, i tune pdflush by quite disabling it : echo 300000 > /proc/sys/vm/dirty_expire_centisecs echo 50000 > /proc/sys/vm/dirty_writeback_centisecs ..., so that only /proc/sys/vm/dirty_ratio plays in. Not doing this makes the problem harder to reproduce. (but reproducible) The problem happens only when it starts the writeback of the dirty pages of the 2nd file, once it is done with the 1st file. We still have to determine why. So you can get the same problem with a program that : - opens 2 files - writes 1Go in the 1st one, - writes indefinitely in the second file. We use a particular block device that doesn't read/write at the same speed on all nodes. Don't know if this can help. > > Keep a counter of pages in gfs_inode whose value represents those > > not written in gfs_writepage, and at the end of do_do_write_buf, > > call "balance_dirty_pages_ratelimited(file->f_mapping);" as many > > times. The counter is possibly shared by multiple processes, but we > > are assured that there is no transaction at that moment so pages > > can be flushed, if "balance_dirty_pages_ratelimited" determines > > that it must reclaim dirty pages. Otherwise performance is not > > affected. > In general, this approach looks ok if we do have this flushing > problem. However, GFS flush code has been embedded in glock code so I > would think it would be better to do this within glock code. These are points we do not understand. - I understand that the flushing code is done in glock, (when lock is demoted or taken by another node, isn't it ?). Why isn't it possible to let the kernel decide which pages to flush when it needs to ? For example, in that particular case, it is not a good idea to flush the page only when the lock is lost, the kernel needs to flush pages. - Why does not gfs_writepage return an error when the page cannot be flushed ? - The balance_dirty_pages_ratelimited function is called inside get_transaction/set_transaction, (i.e between gfs_trans_begin and gfs_trans_end), therefore gfs_writepage should never work. Do you know if there is some kind of asynchronous (kiocb ?) write so that gfs_writepage is called later on the same process ? If not, what could make gfs_writepage happen sometimes inside a transaction, and sometimes not ? - Ext3 should be affected as well, but it isn't. Is that because the transaction lock is taken for a much shorter period of time, so that dirty pages that are not flushed when the lock is taken will be succesfully flushed later ? - Some other file systems in the kernel : NTFS and ReiserFS, do explicit calls to balance_dirty_pages_ratelimited http://lxr.linux.no/source/fs/ntfs/file.c?v=2.6.18;a=x86_64#L284 http://lxr.linux.no/source/fs/ntfs/file.c?v=2.6.18;a=x86_64#L2101 http://lxr.linux.no/source/fs/reiserfs/file.c?v=2.6.11;a=x86_64#L1351 But they also redefines some generic functions from the kernel. Maybe they have a strong reason to do so ? > -- Wendy Thank you for your answer, -- Mathieu