fio --direct=1 and Linux page cache effects

All of lore.kernel.org
 help / color / mirror / Atom feed

* fio --direct=1 and Linux page cache effects
@ 2012-12-05  0:23 Ken Raeburn
  2012-12-06 21:06 ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Ken Raeburn @ 2012-12-05  0:23 UTC (permalink / raw)
  To: fio

(tl;dr summary: Sometimes the Linux page cache slows down fio --direct=1
tests despite the direct I/O, and whether or not it happens appears to
depend on a race condition involving multiple fio job threads and udev, as
well as a quirk of Linux page-cache behavior. As a performance testing
program, it should be consistent.)

We've been running some fio tests against a Linux device-mapper driver
we're working on, and we've found a curious bimodal distribution of
performance values with "fio --direct=1 --rw=randwrite --ioengine=libaio
--numjobs=2 --thread --norandommap ..." directly to the device, on the 3.2
kernel (using the Debian "Squeeze" distro).

One of our developers found that in the "slow" cases, the kernel is
spending more time in __lookup in the kernel radix tree library (used by
the page cache) than in the "fast" cases, even though we're using direct
I/O.

After digging into the fio and kernel code a while, and sacrificing a
couple chickens at the altar of SystemTap [1], this is my current
understanding of the situation:

During the do_io call, the target file (in my case, a block device) is
opened by each "job" thread. There's a hash table keyed by filename. In
most runs, a job thread either will find the filename in the hash already,
or will not find it and after opening the file will add an entry to the
hash table.

Once in a while, though, a thread will not find the filename in the hash
table, but after it opens the file and tries to update the hash table, it
finds the filename is now present, thanks to the other job thread. This
causes the generic_open_file code to close the file and try again, this
next time finding the filename in the hash table.

However, the closing of a file opened with read/write access triggers udev
to run blkid and sometimes udisks-part-id. These run quickly, but open and
read the device without using O_DIRECT. This causes some pages (about 25)
to be inserted into the page cache, and the page count in the file's
associated mapping structure is incremented. Those entries are only
discarded when the individual pages are overwritten (unlikely to happen
for all the pages under randwrite and norandommap unless we write far more
than the device size), or all at once when the open-handle count on the
file goes to zero (which won't be until the test finishes), or memory
pressure gets too high, etc.

As fio runs its test, the kernel function generic_file_direct_write is
called, and if page cache mappings exist (mapping->nrpages > 0), it calls
into the page cache code to invalidate any mappings associated with the
pages being written. On our test machines, the cost seems to be on the
microsecond scale per invalidation call, but it adds up; at 1GB/s using
4KB pages, for example, we would invalidate 256K pages per second.

For an easy-to-describe example that highlights the problem, I tested with
the device-mapper dm-zero module (dmsetup create zzz --table "0
4000000000000 zero"; fio --filename=/dev/mapper/zzz ...) which involves no
external I/O hardware, to see what just the kernel's behavior is. This
device ignores data on writing and zero-fills on read, just like
/dev/zero, except it's a block device and thus can interact with the page
cache.

I did a set of runs with a 4KB block size[2], and got upwards of 3800MB/s
when the race condition didn't trigger; the few times when it did, the
write rate was under 2400MB/s, a drop of over 35%. (Since that's with two
threads, the individual threads are doing 1900MB/s vs 1200MB/s, or
2.2us/block vs 3.4us/block.) A couple of 10x larger test runs got similar
results, though both fast and slow "modes" were a little bit faster.

With a 1MB block size, fewer invalidation calls happened but they operated
on larger ranges of addresses in each call, and the difference was down
around 2%. The results were in two very tightly grouped clusters, though,
so the 2% isn't just random run-to-run variance.

Why each invalidation call should be that expensive with so few pages
mapped, I don't know, but it appears to be costly, if you've got a device
that should get GB/s-range performance. (There may be lock contention
issues exacerbating the problem, since the mapping is shared.)

Not using --norandommap seems to consistently avoid the problem, probably
because it calls smalloc in each thread, which allocates and clears the
random-map memory while holding a global lock; that may stagger processing
in the threads enough to avoid the race most of the time, but probably
doesn't guarantee it. Using one job, or not using --thread, would avoid
the problem because there wouldn't be two threads competing over one
instance of the hash table.

I have a few ideas to try to make the behavior more consistent. I tried
using a global lock in filesetup.c:generic_file_open for file opening and
hash table updating; it seems to eliminate the "slow" results, but it
seems rather hackish, so I'm still looking at how to fix the issue.

(Thanks to Michael Sclafani at Permabit for his help in digging into
this.)

Oh yes: I'm also seeing another blkid run triggered at the start of the
fio invocation, I believe caused by opening and closing the device in
order to ascertain its size. There's a delay of 0.1s or so before the
actual test starts, which seems to be long enough for blkid to complete,
but I don't see anything that ensures that it actually has completed.
There may be another difficult-to-hit race condition there.

Ken

[1] I log calls to __blkdev_get, __blkdev_put, blkdev_close, blkdev_open
for the target device with process ids and names, and bdev->bd_openers
values, and periodically report mapping->nrpages with the saved
filp->f_mapping value from blkdev_open, so I can see when the blkid and
fio open/close sequences overlap, and when page cache mappings are
retained. The script is available at http://pastebin.com/gM3kURHp for now.

[2] My full command line, including some irrelevant options inherited from
our original test case: .../fio --bs=4096 --rw=randwrite
--name=generic_job_name --filename=/dev/mapper/zzz --numjobs=2
--size=26843545600 --thread --norandommap --group_reporting
--gtod_reduce=1 --unlink=0 --direct=1 --rwmixread=70 --iodepth=1024
--iodepth_batch_complete=16 --iodepth_batch_submit=16 --ioengine=libaio
--scramble_buffers=1 --offset=0 --offset_increment=53687091200

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fio --direct=1 and Linux page cache effects
  2012-12-05  0:23 fio --direct=1 and Linux page cache effects Ken Raeburn
@ 2012-12-06 21:06 ` Jens Axboe
  2012-12-07  9:00   ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2012-12-06 21:06 UTC (permalink / raw)
  To: Ken Raeburn; +Cc: fio

On 2012-12-05 01:23, Ken Raeburn wrote:
> 
> (tl;dr summary: Sometimes the Linux page cache slows down fio --direct=1
> tests despite the direct I/O, and whether or not it happens appears to
> depend on a race condition involving multiple fio job threads and udev, as
> well as a quirk of Linux page-cache behavior. As a performance testing
> program, it should be consistent.)
> 
> We've been running some fio tests against a Linux device-mapper driver
> we're working on, and we've found a curious bimodal distribution of
> performance values with "fio --direct=1 --rw=randwrite --ioengine=libaio
> --numjobs=2 --thread --norandommap ..." directly to the device, on the 3.2
> kernel (using the Debian "Squeeze" distro).
> 
> One of our developers found that in the "slow" cases, the kernel is
> spending more time in __lookup in the kernel radix tree library (used by
> the page cache) than in the "fast" cases, even though we're using direct
> I/O.
> 
> After digging into the fio and kernel code a while, and sacrificing a
> couple chickens at the altar of SystemTap [1], this is my current
> understanding of the situation:
> 
> During the do_io call, the target file (in my case, a block device) is
> opened by each "job" thread. There's a hash table keyed by filename. In
> most runs, a job thread either will find the filename in the hash already,
> or will not find it and after opening the file will add an entry to the
> hash table.
> 
> Once in a while, though, a thread will not find the filename in the hash
> table, but after it opens the file and tries to update the hash table, it
> finds the filename is now present, thanks to the other job thread. This
> causes the generic_open_file code to close the file and try again, this
> next time finding the filename in the hash table.
> 
> However, the closing of a file opened with read/write access triggers udev
> to run blkid and sometimes udisks-part-id. These run quickly, but open and
> read the device without using O_DIRECT. This causes some pages (about 25)
> to be inserted into the page cache, and the page count in the file's
> associated mapping structure is incremented. Those entries are only
> discarded when the individual pages are overwritten (unlikely to happen
> for all the pages under randwrite and norandommap unless we write far more
> than the device size), or all at once when the open-handle count on the
> file goes to zero (which won't be until the test finishes), or memory
> pressure gets too high, etc.
> 
> As fio runs its test, the kernel function generic_file_direct_write is
> called, and if page cache mappings exist (mapping->nrpages > 0), it calls
> into the page cache code to invalidate any mappings associated with the
> pages being written. On our test machines, the cost seems to be on the
> microsecond scale per invalidation call, but it adds up; at 1GB/s using
> 4KB pages, for example, we would invalidate 256K pages per second.
> 
> 
> For an easy-to-describe example that highlights the problem, I tested with
> the device-mapper dm-zero module (dmsetup create zzz --table "0
> 4000000000000 zero"; fio --filename=/dev/mapper/zzz ...) which involves no
> external I/O hardware, to see what just the kernel's behavior is. This
> device ignores data on writing and zero-fills on read, just like
> /dev/zero, except it's a block device and thus can interact with the page
> cache.
> 
> I did a set of runs with a 4KB block size[2], and got upwards of 3800MB/s
> when the race condition didn't trigger; the few times when it did, the
> write rate was under 2400MB/s, a drop of over 35%. (Since that's with two
> threads, the individual threads are doing 1900MB/s vs 1200MB/s, or
> 2.2us/block vs 3.4us/block.) A couple of 10x larger test runs got similar
> results, though both fast and slow "modes" were a little bit faster.
> 
> With a 1MB block size, fewer invalidation calls happened but they operated
> on larger ranges of addresses in each call, and the difference was down
> around 2%. The results were in two very tightly grouped clusters, though,
> so the 2% isn't just random run-to-run variance.
> 
> Why each invalidation call should be that expensive with so few pages
> mapped, I don't know, but it appears to be costly, if you've got a device
> that should get GB/s-range performance. (There may be lock contention
> issues exacerbating the problem, since the mapping is shared.)
> 
> Not using --norandommap seems to consistently avoid the problem, probably
> because it calls smalloc in each thread, which allocates and clears the
> random-map memory while holding a global lock; that may stagger processing
> in the threads enough to avoid the race most of the time, but probably
> doesn't guarantee it. Using one job, or not using --thread, would avoid
> the problem because there wouldn't be two threads competing over one
> instance of the hash table.
> 
> I have a few ideas to try to make the behavior more consistent. I tried
> using a global lock in filesetup.c:generic_file_open for file opening and
> hash table updating; it seems to eliminate the "slow" results, but it
> seems rather hackish, so I'm still looking at how to fix the issue.
> 
> (Thanks to Michael Sclafani at Permabit for his help in digging into
> this.)
> 
> Oh yes: I'm also seeing another blkid run triggered at the start of the
> fio invocation, I believe caused by opening and closing the device in
> order to ascertain its size. There's a delay of 0.1s or so before the
> actual test starts, which seems to be long enough for blkid to complete,
> but I don't see anything that ensures that it actually has completed.
> There may be another difficult-to-hit race condition there.
> 
> Ken
> 
> [1] I log calls to __blkdev_get, __blkdev_put, blkdev_close, blkdev_open
> for the target device with process ids and names, and bdev->bd_openers
> values, and periodically report mapping->nrpages with the saved
> filp->f_mapping value from blkdev_open, so I can see when the blkid and
> fio open/close sequences overlap, and when page cache mappings are
> retained. The script is available at http://pastebin.com/gM3kURHp for now.
> 
> [2] My full command line, including some irrelevant options inherited from
> our original test case: .../fio --bs=4096 --rw=randwrite
> --name=generic_job_name --filename=/dev/mapper/zzz --numjobs=2
> --size=26843545600 --thread --norandommap --group_reporting
> --gtod_reduce=1 --unlink=0 --direct=1 --rwmixread=70 --iodepth=1024
> --iodepth_batch_complete=16 --iodepth_batch_submit=16 --ioengine=libaio
> --scramble_buffers=1 --offset=0 --offset_increment=53687091200

Thanks for this nice analysis! For most workloads, adding a global lock
for the duration of the file open is not an issue. So while it seems
like a hack, I don't necessarily think it's a bad solution to the issue.

This isn't the first time where blkid has caused confusing behaviour or
issues for folks. Another approach would be to just disable this
behaviour in the system. But it'd be better if fio could eliminate any
side effects of it at least, as the bi-modal behaviour can be extremely
annoying to find and diagnose (as I'm sure you found above too).

In other words, let me know if you find a great solution for this. If
not, I think we should just do the global file open lock for now.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fio --direct=1 and Linux page cache effects
  2012-12-06 21:06 ` Jens Axboe
@ 2012-12-07  9:00   ` Jens Axboe
  2012-12-11  1:00     ` Ken Raeburn
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2012-12-07  9:00 UTC (permalink / raw)
  To: Ken Raeburn; +Cc: fio

On 2012-12-06 22:06, Jens Axboe wrote:
> On 2012-12-05 01:23, Ken Raeburn wrote:
>>
>> (tl;dr summary: Sometimes the Linux page cache slows down fio --direct=1
>> tests despite the direct I/O, and whether or not it happens appears to
>> depend on a race condition involving multiple fio job threads and udev, as
>> well as a quirk of Linux page-cache behavior. As a performance testing
>> program, it should be consistent.)
>>
>> We've been running some fio tests against a Linux device-mapper driver
>> we're working on, and we've found a curious bimodal distribution of
>> performance values with "fio --direct=1 --rw=randwrite --ioengine=libaio
>> --numjobs=2 --thread --norandommap ..." directly to the device, on the 3.2
>> kernel (using the Debian "Squeeze" distro).
>>
>> One of our developers found that in the "slow" cases, the kernel is
>> spending more time in __lookup in the kernel radix tree library (used by
>> the page cache) than in the "fast" cases, even though we're using direct
>> I/O.
>>
>> After digging into the fio and kernel code a while, and sacrificing a
>> couple chickens at the altar of SystemTap [1], this is my current
>> understanding of the situation:
>>
>> During the do_io call, the target file (in my case, a block device) is
>> opened by each "job" thread. There's a hash table keyed by filename. In
>> most runs, a job thread either will find the filename in the hash already,
>> or will not find it and after opening the file will add an entry to the
>> hash table.
>>
>> Once in a while, though, a thread will not find the filename in the hash
>> table, but after it opens the file and tries to update the hash table, it
>> finds the filename is now present, thanks to the other job thread. This
>> causes the generic_open_file code to close the file and try again, this
>> next time finding the filename in the hash table.
>>
>> However, the closing of a file opened with read/write access triggers udev
>> to run blkid and sometimes udisks-part-id. These run quickly, but open and
>> read the device without using O_DIRECT. This causes some pages (about 25)
>> to be inserted into the page cache, and the page count in the file's
>> associated mapping structure is incremented. Those entries are only
>> discarded when the individual pages are overwritten (unlikely to happen
>> for all the pages under randwrite and norandommap unless we write far more
>> than the device size), or all at once when the open-handle count on the
>> file goes to zero (which won't be until the test finishes), or memory
>> pressure gets too high, etc.
>>
>> As fio runs its test, the kernel function generic_file_direct_write is
>> called, and if page cache mappings exist (mapping->nrpages > 0), it calls
>> into the page cache code to invalidate any mappings associated with the
>> pages being written. On our test machines, the cost seems to be on the
>> microsecond scale per invalidation call, but it adds up; at 1GB/s using
>> 4KB pages, for example, we would invalidate 256K pages per second.
>>
>>
>> For an easy-to-describe example that highlights the problem, I tested with
>> the device-mapper dm-zero module (dmsetup create zzz --table "0
>> 4000000000000 zero"; fio --filename=/dev/mapper/zzz ...) which involves no
>> external I/O hardware, to see what just the kernel's behavior is. This
>> device ignores data on writing and zero-fills on read, just like
>> /dev/zero, except it's a block device and thus can interact with the page
>> cache.
>>
>> I did a set of runs with a 4KB block size[2], and got upwards of 3800MB/s
>> when the race condition didn't trigger; the few times when it did, the
>> write rate was under 2400MB/s, a drop of over 35%. (Since that's with two
>> threads, the individual threads are doing 1900MB/s vs 1200MB/s, or
>> 2.2us/block vs 3.4us/block.) A couple of 10x larger test runs got similar
>> results, though both fast and slow "modes" were a little bit faster.
>>
>> With a 1MB block size, fewer invalidation calls happened but they operated
>> on larger ranges of addresses in each call, and the difference was down
>> around 2%. The results were in two very tightly grouped clusters, though,
>> so the 2% isn't just random run-to-run variance.
>>
>> Why each invalidation call should be that expensive with so few pages
>> mapped, I don't know, but it appears to be costly, if you've got a device
>> that should get GB/s-range performance. (There may be lock contention
>> issues exacerbating the problem, since the mapping is shared.)
>>
>> Not using --norandommap seems to consistently avoid the problem, probably
>> because it calls smalloc in each thread, which allocates and clears the
>> random-map memory while holding a global lock; that may stagger processing
>> in the threads enough to avoid the race most of the time, but probably
>> doesn't guarantee it. Using one job, or not using --thread, would avoid
>> the problem because there wouldn't be two threads competing over one
>> instance of the hash table.
>>
>> I have a few ideas to try to make the behavior more consistent. I tried
>> using a global lock in filesetup.c:generic_file_open for file opening and
>> hash table updating; it seems to eliminate the "slow" results, but it
>> seems rather hackish, so I'm still looking at how to fix the issue.
>>
>> (Thanks to Michael Sclafani at Permabit for his help in digging into
>> this.)
>>
>> Oh yes: I'm also seeing another blkid run triggered at the start of the
>> fio invocation, I believe caused by opening and closing the device in
>> order to ascertain its size. There's a delay of 0.1s or so before the
>> actual test starts, which seems to be long enough for blkid to complete,
>> but I don't see anything that ensures that it actually has completed.
>> There may be another difficult-to-hit race condition there.
>>
>> Ken
>>
>> [1] I log calls to __blkdev_get, __blkdev_put, blkdev_close, blkdev_open
>> for the target device with process ids and names, and bdev->bd_openers
>> values, and periodically report mapping->nrpages with the saved
>> filp->f_mapping value from blkdev_open, so I can see when the blkid and
>> fio open/close sequences overlap, and when page cache mappings are
>> retained. The script is available at http://pastebin.com/gM3kURHp for now.
>>
>> [2] My full command line, including some irrelevant options inherited from
>> our original test case: .../fio --bs=4096 --rw=randwrite
>> --name=generic_job_name --filename=/dev/mapper/zzz --numjobs=2
>> --size=26843545600 --thread --norandommap --group_reporting
>> --gtod_reduce=1 --unlink=0 --direct=1 --rwmixread=70 --iodepth=1024
>> --iodepth_batch_complete=16 --iodepth_batch_submit=16 --ioengine=libaio
>> --scramble_buffers=1 --offset=0 --offset_increment=53687091200
> 
> Thanks for this nice analysis! For most workloads, adding a global lock
> for the duration of the file open is not an issue. So while it seems
> like a hack, I don't necessarily think it's a bad solution to the issue.
> 
> This isn't the first time where blkid has caused confusing behaviour or
> issues for folks. Another approach would be to just disable this
> behaviour in the system. But it'd be better if fio could eliminate any
> side effects of it at least, as the bi-modal behaviour can be extremely
> annoying to find and diagnose (as I'm sure you found above too).
> 
> In other words, let me know if you find a great solution for this. If
> not, I think we should just do the global file open lock for now.

Another idea would be to ensure that we do the bdev cache invalidation
at a safe point. But that is done at open time right now, which should
invalidate any cache mappings for the device. If the file close triggers
blkid and there are no further opens of the device, then those mappings
would remain.

Would the below work for you?

diff --git a/filesetup.c b/filesetup.c
index f4e1adc..78fde39 100644
--- a/filesetup.c
+++ b/filesetup.c
@@ -547,7 +547,19 @@ open_again:
 		td_verror(td, __e, buf);
 	}
 
-	if (!from_hash && f->fd != -1) {
+	/*
+	 * If we are hitting this due to an attempted alias insert, we
+	 * could have stale mappings in the cache. This is commonly a
+	 * problem on Linux, where blkid will trigger cached IO to the
+	 * device at close time. Because of that, invalidate cache if
+	 * we found it in hash.
+	 */
+	if (from_hash) {
+		if (td->o.invalidate_cache) {
+			int fio_unused ret;
+			ret = file_invalidate_cache(td, f);
+		}
+	} else if (f->fd != -1) {
 		if (add_file_hash(f)) {
 			int fio_unused ret;
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: fio --direct=1 and Linux page cache effects
  2012-12-07  9:00   ` Jens Axboe
@ 2012-12-11  1:00     ` Ken Raeburn
  2012-12-11 13:29       ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Ken Raeburn @ 2012-12-11  1:00 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio

Jens Axboe <axboe@kernel.dk> writes:

> Another idea would be to ensure that we do the bdev cache invalidation
> at a safe point. But that is done at open time right now, which should
> invalidate any cache mappings for the device. If the file close triggers
> blkid and there are no further opens of the device, then those mappings
> would remain.
>
> Would the below work for you?

It looks like it's not enough; the BLKFLSBUF ioctl winds up happening
before blkid runs. I added some instrumentation to the patch and saw
that one thread took the "f->fd != -1" path and then, after
add_file_hash failed, came back to the "from_hash" path. But all of that
seems to finish before blkid gets going, so the pages it pulls in still
won't get invalidated.

Ken

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fio --direct=1 and Linux page cache effects
  2012-12-11  1:00     ` Ken Raeburn
@ 2012-12-11 13:29       ` Jens Axboe
  2012-12-11 20:10         ` Ken Raeburn
  0 siblings, 1 reply; 7+ messages in thread
From: Jens Axboe @ 2012-12-11 13:29 UTC (permalink / raw)
  To: Ken Raeburn; +Cc: fio

On 2012-12-11 02:00, Ken Raeburn wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> Another idea would be to ensure that we do the bdev cache invalidation
>> at a safe point. But that is done at open time right now, which should
>> invalidate any cache mappings for the device. If the file close triggers
>> blkid and there are no further opens of the device, then those mappings
>> would remain.
>>
>> Would the below work for you?
> 
> It looks like it's not enough; the BLKFLSBUF ioctl winds up happening
> before blkid runs. I added some instrumentation to the patch and saw
> that one thread took the "f->fd != -1" path and then, after
> add_file_hash failed, came back to the "from_hash" path. But all of that
> seems to finish before blkid gets going, so the pages it pulls in still
> won't get invalidated.

You are right, blkid likely wont be done by then, so it's still down to
timing whether it'll help or not. This is pretty annoying.

This issue is due to the file being opened for write. But unfortunately
we cannot open for read always, as fcntl() wont allow change of file
access mode flags.

So how about the below. Basically DON'T close the fd, defer that until
we really close the file. This will keep one extra fd open until the
original is closed, but I don't see that as being an issue.

diff --git a/file.h b/file.h
index 3024c54..11695e2 100644
--- a/file.h
+++ b/file.h
@@ -65,6 +65,7 @@ struct fio_file {
 
 	void *file_data;
 	int fd;
+	int shadow_fd;
 #ifdef WIN32
 	HANDLE hFile;
 	HANDLE ioCP;
diff --git a/filesetup.c b/filesetup.c
index 3462a03..170572d 100644
--- a/filesetup.c
+++ b/filesetup.c
@@ -434,6 +434,12 @@ int generic_close_file(struct thread_data fio_unused *td, struct fio_file *f)
 		ret = errno;
 
 	f->fd = -1;
+
+	if (f->shadow_fd != -1) {
+		close(f->shadow_fd);
+		f->shadow_fd = -1;
+	}
+
 	return ret;
 }
 
@@ -552,9 +558,22 @@ open_again:
 			int fio_unused ret;
 
 			/*
-			 * OK to ignore, we haven't done anything with it
+			 * Stash away descriptor for later close. This is to
+			 * work-around a "feature" on Linux, where a close of
+			 * an fd that has been opened for write will trigger
+			 * udev to call blkid to check partitions, fs id, etc.
+			 * That polutes the device cache, which can slow down
+			 * unbuffered accesses.
 			 */
-			ret = generic_close_file(td, f);
+			if (f->shadow_fd == -1)
+				f->shadow_fd = f->fd;
+			else {
+				/*
+			 	 * OK to ignore, we haven't done anything
+				 * with it
+				 */
+				ret = generic_close_file(td, f);
+			}
 			goto open_again;
 		}
 	}
@@ -1029,6 +1048,7 @@ int add_file(struct thread_data *td, const char *fname)
 	}
 
 	f->fd = -1;
+	f->shadow_fd = -1;
 	fio_file_reset(f);
 
 	if (td->files_size <= td->files_index) {

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: fio --direct=1 and Linux page cache effects
  2012-12-11 13:29       ` Jens Axboe
@ 2012-12-11 20:10         ` Ken Raeburn
  2012-12-12  7:24           ` Jens Axboe
  0 siblings, 1 reply; 7+ messages in thread
From: Ken Raeburn @ 2012-12-11 20:10 UTC (permalink / raw)
  To: Jens Axboe; +Cc: fio

On 12/11/12 08:29, Jens Axboe wrote:
> You are right, blkid likely wont be done by then, so it's still down 
> to timing whether it'll help or not. This is pretty annoying. This 
> issue is due to the file being opened for write. But unfortunately we 
> cannot open for read always, as fcntl() wont allow change of file 
> access mode flags. So how about the below. Basically DON'T close the 
> fd, defer that until we really close the file. This will keep one 
> extra fd open until the original is closed, but I don't see that as 
> being an issue.

This patch seems to be working just fine. I have reproduced the case 
where it uses the shadow_fd field, but blkid doesn't run and the page 
cache entries don't get loaded, and in my initial tests with dm-zero, it 
looks like the bimodal performance distribution is gone.

Thanks!

Ken


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fio --direct=1 and Linux page cache effects
  2012-12-11 20:10         ` Ken Raeburn
@ 2012-12-12  7:24           ` Jens Axboe
  0 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2012-12-12  7:24 UTC (permalink / raw)
  To: Ken Raeburn; +Cc: fio

On 2012-12-11 21:10, Ken Raeburn wrote:
> On 12/11/12 08:29, Jens Axboe wrote:
>> You are right, blkid likely wont be done by then, so it's still down 
>> to timing whether it'll help or not. This is pretty annoying. This 
>> issue is due to the file being opened for write. But unfortunately we 
>> cannot open for read always, as fcntl() wont allow change of file 
>> access mode flags. So how about the below. Basically DON'T close the 
>> fd, defer that until we really close the file. This will keep one 
>> extra fd open until the original is closed, but I don't see that as 
>> being an issue.
> 
> This patch seems to be working just fine. I have reproduced the case 
> where it uses the shadow_fd field, but blkid doesn't run and the page 
> cache entries don't get loaded, and in my initial tests with dm-zero, it 
> looks like the bimodal performance distribution is gone.
> 
> Thanks!

Excellent, I think this is probably as good as it is going to get. I
have committed the fix. Thanks a lot for reporting it, especially in
such great detail. When a problem is fully understood, fixing it is then
the smallest part of the effort.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-12-12  7:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-05  0:23 fio --direct=1 and Linux page cache effects Ken Raeburn
2012-12-06 21:06 ` Jens Axboe
2012-12-07  9:00   ` Jens Axboe
2012-12-11  1:00     ` Ken Raeburn
2012-12-11 13:29       ` Jens Axboe
2012-12-11 20:10         ` Ken Raeburn
2012-12-12  7:24           ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.