From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mathieu Avila Date: Mon, 19 Feb 2007 16:00:50 +0100 Subject: [Cluster-devel] Problem in ops_address.c :: gfs_writepage ? Message-ID: <20070219160050.14eb6c10@mathieu.toulouse> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hello all, I need advice about a bug in GFS that may also affect other filesystems (like ext3). The problem: It is possible that the function "ops_address.c :: gfs_writepage" does not write the page it's asked to, because the transaction lock is taken. This is valid, and in such case, it should return an error code, so that the kernel knows it was not possible to write the page. But this function does not return an error code; instead, it returns 0. I've looked at ext3, it does the same. This is valid and there's no corruption, as the page is "redirtied" so that it will be flushed later. Returning an error code is not solution, because it's possible that no page is flushable, and also 'sync' misinterprets the error code as an I/O error. There may be other implications, too. The problem comes when there is quite a stress on the filesystem. I've made a test program that opens 1 file, writes 1Go (at least more than the system's total memory), then opens a 2nd file, and writes as much data as it can. When the number of dirty pages go beyond /proc/sys/vm/dirty_ratio, some pages must be flushed synchronously, so that the writer is blocked in writing, and the system does not starve of free clean pages to use. But precisely, in that situation, there are multiple times when gfs_writepage cannot perform its duty, because of the transaction lock. The pages that we need to flush cannot be flushed, but the point is that they are accounted as such : flushed. The process is then authorized to continue writing data and more and more pages are marked dirty. Hopefully, in most cases, pdflush comes to rescue us : it starts flushing pages so that the program recovers free new pages to dirty. Sometimes the system is stable : if the pages sent by "pdflush" are written fastly enough by the block device. But if we set pdflush to be less aggressive and dirty_ratio higher, it is possible to have OOM Killer on a machine, and then being completely blocked. In practical, we've experienced it using the test program "bonnie++" whose purpose is to test a FS performance. Bonnie++ makes multiple files of 1GB when it is asked to run long multi-Go writes. There is no problem with 5 GB (5 files of 1 GB) but many machines in the cluster are OOM killed with 10GB bonnies.... Setting more aggressive parameters for dirty_ratio and pdflush is not a complete solution (altough the problems happens much later or not at all), and kills performance. Proposed solution: Keep a counter of pages in gfs_inode whose value represents those not written in gfs_writepage, and at the end of do_do_write_buf, call "balance_dirty_pages_ratelimited(file->f_mapping);" as many times. The counter is possibly shared by multiple processes, but we are assured that there is no transaction at that moment so pages can be flushed, if "balance_dirty_pages_ratelimited" determines that it must reclaim dirty pages. Otherwise performance is not affected. About ext3: On a normal local disk with ext3, the problem does not happen, altough it should be affected as well. I suppose that's because ext3 doesn't keep the transaction lock so long, and a local disk is enough fast so that pdflush keeps the dirty pages counter low. Patch below. Beware, it's quite ugly. credits : Oliver.Cozette at seanodes.com and Mathieu.Avila at seanodes.com -- Mathieu Avila Index: cluster/gfs-kernel/src/gfs/ops_file.c =================================================================== --- cluster/gfs-kernel/src/gfs/ops_file.c~ +++ cluster/gfs-kernel/src/gfs/ops_file.c @@ -25,7 +25,7 @@ #include #include #include - +#include #include "gfs.h" #include "bmap.h" #include "dio.h" @@ -824,6 +824,22 @@ goto fail_ipres; } + do + { + int pages_not_written = atomic_read(&ip->pages_not_written); + int result; + if ( pages_not_written <= 0 ) + { + break; + } + result=cmpxchg(&ip->pages_not_written.counter, pages_not_written, (pages_not_written - 1) ); + if (result == pages_not_written) + { + balance_dirty_pages_ratelimited(file->f_mapping); + } + } + while (1); + Index: cluster/gfs-kernel/src/gfs/incore.h =================================================================== --- cluster/gfs-kernel/src/gfs/incore.h~ +++ cluster/gfs-kernel/src/gfs/incore.h @@ -617,6 +617,8 @@ unsigned int i_greedy; /* The amount of time to be greedy */ unsigned long i_last_pfault; /* The time of the last page fault */ + + atomic_t pages_not_written; /* Due to journal locking, how many pages were not written when "gfs_writepage" was called */ }; /* Index: cluster/gfs-kernel/src/gfs/ops_address.c =================================================================== --- cluster/gfs-kernel/src/gfs/ops_address.c~ +++ cluster/gfs-kernel/src/gfs/ops_address.c @@ -179,6 +183,7 @@ return -EIO; } if (get_transaction) { + atomic_inc(&ip->pages_not_written); redirty_page_for_writepage(wbc, page); unlock_page(page); return 0;