Disabling in-memory write cache for x86-64 in Linux II

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Disabling in-memory write cache for x86-64 in Linux II
@ 2013-10-25  7:25 Artem S. Tashkinov
  2013-10-25  8:18 ` Linus Torvalds
  2013-10-25 10:49 ` NeilBrown
  0 siblings, 2 replies; 21+ messages in thread
From: Artem S. Tashkinov @ 2013-10-25  7:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: torvalds, linux-fsdevel, axboe, linux-mm

Hello!

On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
is that the x86-64 kernel has the following problem:

When I copy large files to any storage device, be it my HDD with ext4 partitions
or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
then flushes them some time later (quite unpredictably though) or immediately upon
invoking "sync".

How can I disable this memory cache altogether (or at least minimize caching)? When
running the i686 kernel with the same configuration I don't observe this effect - files get
written out almost immediately (for instance "sync" takes less than a second, whereas
on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
performance).

I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX)
- firstly this command is detrimental to the performance of my PC, secondly, it won't help
in this instance.

Swap is totally disabled, usually my memory is entirely free.

My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531

Please, advise.

Best regards,

Artem 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  7:25 Disabling in-memory write cache for x86-64 in Linux II Artem S. Tashkinov
@ 2013-10-25  8:18 ` Linus Torvalds
  2013-11-05  0:50   ` Andreas Dilger
  2013-11-05  6:32   ` Figo.zhang
  2013-10-25 10:49 ` NeilBrown
  1 sibling, 2 replies; 21+ messages in thread
From: Linus Torvalds @ 2013-10-25  8:18 UTC (permalink / raw)
  To: Artem S. Tashkinov, Wu Fengguang, Andrew Morton
  Cc: Linux Kernel Mailing List, linux-fsdevel, Jens Axboe, linux-mm

On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov <t.artem@lycos.com> wrote:
>
> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
> is that the x86-64 kernel has the following problem:
>
> When I copy large files to any storage device, be it my HDD with ext4 partitions
> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
> then flushes them some time later (quite unpredictably though) or immediately upon
> invoking "sync".

Yeah, I think we default to a 10% "dirty background memory" (and
allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
of dirty memory for writeout before we even start writing, and twice
that before we start *waiting* for it.

On 32-bit x86, we only count the memory in the low 1GB (really
actually up to about 890MB), so "10% dirty" really means just about
90MB of buffering (and a "hard limit" of ~180MB of dirty).

And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
come from the old days of less memory (and perhaps servers that don't
much care), and the fact that x86-32 ends up having much lower limits
even if you end up having more memory.

You can easily tune it:

    echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes
    echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes

or similar. But you're right, we need to make the defaults much saner.

Wu? Andrew? Comments?

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  8:18 ` Linus Torvalds
@ 2013-11-05  0:50   ` Andreas Dilger
  2013-11-05  4:12     ` Dave Chinner
  2013-11-05  6:32   ` Figo.zhang
  1 sibling, 1 reply; 21+ messages in thread
From: Andreas Dilger @ 2013-11-05  0:50 UTC (permalink / raw)
  To: Artem S. Tashkinov
  Cc: Wu Fengguang, Linus Torvalds, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Jens Axboe, linux-mm

On Oct 25, 2013, at 2:18 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov <t.artem@lycos.com> wrote:
>> 
>> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11
>> kernel built for the i686 (with PAE) and x86-64 architectures. What’s
>> really troubling me is that the x86-64 kernel has the following problem:
>> 
>> When I copy large files to any storage device, be it my HDD with ext4
>> partitions or flash drive with FAT32 partitions, the kernel first
>> caches them in memory entirely then flushes them some time later
>> (quite unpredictably though) or immediately upon invoking "sync".
> 
> Yeah, I think we default to a 10% "dirty background memory" (and
> allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
> of dirty memory for writeout before we even start writing, and twice
> that before we start *waiting* for it.
> 
> On 32-bit x86, we only count the memory in the low 1GB (really
> actually up to about 890MB), so "10% dirty" really means just about
> 90MB of buffering (and a "hard limit" of ~180MB of dirty).
> 
> And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
> come from the old days of less memory (and perhaps servers that don't
> much care), and the fact that x86-32 ends up having much lower limits
> even if you end up having more memory.

I think the “delay writes for a long time” is a holdover from the
days when e.g. /tmp was on a disk and compilers had lousy IO
patterns, then they deleted the file.  Today, /tmp is always in
RAM, and IMHO the “write and delete” workload tested by dbench
is not worthwhile optimizing for.

With Lustre, we’ve long taken the approach that if there is enough
dirty data on a file to make a decent write (which is around 8MB
today even for very fast storage) then there isn’t much point to
hold back for more data before starting the IO.

Any decent allocator will be able to grow allocated extents to
handle following data, or allocate a new extent.  At 4-8MB extents,
even very seek-impaired media could do 400-800MB/s (likely much
faster than the underlying storage anyway).

This also avoids wasting (tens of?) seconds of idle disk bandwidth.
If the disk is already busy, then the IO will be delayed anyway.
If it is not busy, then why aggregate GB of dirty data in memory
before flushing it?

Something simple like “start writing at 16MB dirty on a single file”
would probably avoid a lot of complexity at little real-world cost.
That shouldn’t throttle dirtying memory above 16MB, but just start
writeout much earlier than it does today.

Cheers, Andreas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-11-05  0:50   ` Andreas Dilger
@ 2013-11-05  4:12     ` Dave Chinner
  2013-11-07 13:48       ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2013-11-05  4:12 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Artem S. Tashkinov, Wu Fengguang, Linus Torvalds, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Jens Axboe, linux-mm

On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> 
> On Oct 25, 2013, at 2:18 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov <t.artem@lycos.com> wrote:
> >> 
> >> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11
> >> kernel built for the i686 (with PAE) and x86-64 architectures. What’s
> >> really troubling me is that the x86-64 kernel has the following problem:
> >> 
> >> When I copy large files to any storage device, be it my HDD with ext4
> >> partitions or flash drive with FAT32 partitions, the kernel first
> >> caches them in memory entirely then flushes them some time later
> >> (quite unpredictably though) or immediately upon invoking "sync".
> > 
> > Yeah, I think we default to a 10% "dirty background memory" (and
> > allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
> > of dirty memory for writeout before we even start writing, and twice
> > that before we start *waiting* for it.
> > 
> > On 32-bit x86, we only count the memory in the low 1GB (really
> > actually up to about 890MB), so "10% dirty" really means just about
> > 90MB of buffering (and a "hard limit" of ~180MB of dirty).
> > 
> > And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
> > come from the old days of less memory (and perhaps servers that don't
> > much care), and the fact that x86-32 ends up having much lower limits
> > even if you end up having more memory.
> 
> I think the “delay writes for a long time” is a holdover from the
> days when e.g. /tmp was on a disk and compilers had lousy IO
> patterns, then they deleted the file.  Today, /tmp is always in
> RAM, and IMHO the “write and delete” workload tested by dbench
> is not worthwhile optimizing for.
> 
> With Lustre, we’ve long taken the approach that if there is enough
> dirty data on a file to make a decent write (which is around 8MB
> today even for very fast storage) then there isn’t much point to
> hold back for more data before starting the IO.

Agreed - write-through caching is much better for high throughput
streaming data environments than write back caching that can leave
the devices unnecessarily idle.

However, most systems are not running in high-throughput streaming
data environments... :/

> Any decent allocator will be able to grow allocated extents to
> handle following data, or allocate a new extent.  At 4-8MB extents,
> even very seek-impaired media could do 400-800MB/s (likely much
> faster than the underlying storage anyway).

True, but this makes the assumption that the filesystem you are
using is optimising purely for write throughput and your storage is
not seek limited on reads. That's simply not an assumption we can
allow the generic writeback code to make.

In more detail, if we simply implement "we have 8 MB of dirty pages
on a single file, write it" we can maximise write throughput by
allocating sequentially on disk for each subsquent write. The
problem with this comes when you are writing multiple files at a
time, and that leads to this pattern on disk:

 ABC...ABC....ABC....ABC....

And the result is a) fragmented files b) a large number of seeks
during sequential read operations and c) filesystems that age and
degrade rapidly under workloads that concurrently write files with
different life times (i.e. due to free space fragmention).

In some situations this is acceptable, but the performance
degradation as the filesystem ages that this sort of allocation
causes in most environments is not. I'd say that >90% of filesystems
out there would suffer accelerated aging as a result of doing
writeback in this manner by default.

> This also avoids wasting (tens of?) seconds of idle disk bandwidth.
> If the disk is already busy, then the IO will be delayed anyway.
> If it is not busy, then why aggregate GB of dirty data in memory
> before flushing it?

There are plenty of workloads out there where delaying IO for a few
seconds can result in writeback that is an order of magnitude
faster. Similarly, I've seen other workloads where the writeback
delay results in files that can be *read* orders of magnitude
faster....

> Something simple like “start writing at 16MB dirty on a single file”
> would probably avoid a lot of complexity at little real-world cost.
> That shouldn’t throttle dirtying memory above 16MB, but just start
> writeout much earlier than it does today.

That doesn't solve the "slow device, large file" problem. We can
write data into the page cache at rates of over a GB/s, so it's
irrelevant to a device that can write at 5MB/s whether we start
writeback immediately or a second later when there is 500MB of dirty
pages in memory.  AFAIK, the only way to avoid that problem is to
use write-through caching for such devices - where they throttle to
the IO rate at very low levels of cached data.

Realistically, there is no "one right answer" for all combinations
of applications, filesystems and hardware, but writeback caching is
the best *general solution* we've got right now.

However, IMO users should not need to care about tuning BDI dirty
ratios or even have to understand what a BDI dirty ratio is to
select the rigth caching method for their devices and/or workload.
The difference between writeback and write through caching is easy
to explain and AFAICT those two modes suffice to solve the problems
being discussed here.  Further, if two modes suffice to solve the
problems, then we should be able to easily define a trigger to
automatically switch modes.

/me notes that if we look at random vs sequential IO and the impact
that has on writeback duration, then it's very similar to suddenly
having a very slow device. IOWs, fadvise(RANDOM) could be used to
switch an *inode* to write through mode rather than writeback mode
to solve the problem aggregating massive amounts of random write IO
in the page cache...

So rather than treating this as a "one size fits all" type of
problem, let's step back and:

	a) define 2-3 different caching behaviours we consider
	   optimal for the majority of workloads/hardware we care
	   about.
	b) determine optimal workloads for each caching
	   behaviour.
	c) develop reliable triggers to detect when we
	   should switch between caching behaviours.

e.g:

	a) write back caching
		- what we have now
	   write through caching
		- extremely low dirty threshold before writeback
		  starts, enough to optimise for, say, stripe width
		  of the underlying storage.

	b) write back caching:
		- general purpose workload
	   write through caching:
		- slow device, write large file, sync
		- extremely high bandwidth devices, multi-stream
		  sequential IO
		- random IO.

	c) write back caching:
		- default
		- fadvise(NORMAL, SEQUENTIAL, WILLNEED)
	   write through caching:
		- fadvise(NOREUSE, DONTNEED, RANDOM)
		- random IO
		- sequential IO, BDI write bandwidth <<< dirty threshold
		- sequential IO, BDI write bandwidth >>> dirty threshold

I think that covers most of the issues and use cases that have been
discussed in this thread. IMO, this is the level at which we need to
solve the problem (i.e. architectural), not at the level of "let's
add sysfs variables so we can tweak bdi ratios".

Indeed, the above implies that we need the caching behaviour to be a
property of the address space, not just a property of the backing
device.

IOWs, the implementation needs to trickle down from a coherent high
level design - that will define the knobs that we need to expose to
userspace. We should not be adding new writeback behaviours by
adding knobs to sysfs without first having some clue about whether
we are solving the right problem and solving it in a sane manner...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-11-05  4:12     ` Dave Chinner
@ 2013-11-07 13:48       ` Jan Kara
  2013-11-11  3:22         ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2013-11-07 13:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Dilger, Artem S. Tashkinov, Wu Fengguang, Linus Torvalds,
	Andrew Morton, Linux Kernel Mailing List, linux-fsdevel,
	Jens Axboe, linux-mm

On Tue 05-11-13 15:12:45, Dave Chinner wrote:
> On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> > Something simple like “start writing at 16MB dirty on a single file”
> > would probably avoid a lot of complexity at little real-world cost.
> > That shouldn’t throttle dirtying memory above 16MB, but just start
> > writeout much earlier than it does today.
> 
> That doesn't solve the "slow device, large file" problem. We can
> write data into the page cache at rates of over a GB/s, so it's
> irrelevant to a device that can write at 5MB/s whether we start
> writeback immediately or a second later when there is 500MB of dirty
> pages in memory.  AFAIK, the only way to avoid that problem is to
> use write-through caching for such devices - where they throttle to
> the IO rate at very low levels of cached data.
  Agreed.

> Realistically, there is no "one right answer" for all combinations
> of applications, filesystems and hardware, but writeback caching is
> the best *general solution* we've got right now.
> 
> However, IMO users should not need to care about tuning BDI dirty
> ratios or even have to understand what a BDI dirty ratio is to
> select the rigth caching method for their devices and/or workload.
> The difference between writeback and write through caching is easy
> to explain and AFAICT those two modes suffice to solve the problems
> being discussed here.  Further, if two modes suffice to solve the
> problems, then we should be able to easily define a trigger to
> automatically switch modes.
> 
> /me notes that if we look at random vs sequential IO and the impact
> that has on writeback duration, then it's very similar to suddenly
> having a very slow device. IOWs, fadvise(RANDOM) could be used to
> switch an *inode* to write through mode rather than writeback mode
> to solve the problem aggregating massive amounts of random write IO
> in the page cache...
  I disagree here. Writeback cache is also useful for aggregating random
writes and making semi-sequential writes out of them. There are quite some
applications which rely on the fact that they can write a file in a rather
random manner (Berkeley DB, linker, ...) but the files are written out in
one large linear sweep. That is actually the reason why SLES (and I believe
RHEL as well) tune dirty_limit even higher than what's the default value.

So I think it's rather the other way around: If you can detect the file is
being written in a streaming manner, there's not much point in caching too
much data for it. And I agree with you that we also have to be careful not
to cache too few because otherwise two streaming writes would be
interleaved too much. Currently, we have writeback_chunk_size() which
determines how much we ask to write from a single inode. So streaming
writers are going to be interleaved at this chunk size anyway (currently
that number is "measured bandwidth / 2"). So it would make sense to also
limit amount of dirty cache for each file with streaming pattern at this
number.

> So rather than treating this as a "one size fits all" type of
> problem, let's step back and:
> 
> 	a) define 2-3 different caching behaviours we consider
> 	   optimal for the majority of workloads/hardware we care
> 	   about.
> 	b) determine optimal workloads for each caching
> 	   behaviour.
> 	c) develop reliable triggers to detect when we
> 	   should switch between caching behaviours.
> 
> e.g:
> 
> 	a) write back caching
> 		- what we have now
> 	   write through caching
> 		- extremely low dirty threshold before writeback
> 		  starts, enough to optimise for, say, stripe width
> 		  of the underlying storage.
> 
> 	b) write back caching:
> 		- general purpose workload
> 	   write through caching:
> 		- slow device, write large file, sync
> 		- extremely high bandwidth devices, multi-stream
> 		  sequential IO
> 		- random IO.
> 
> 	c) write back caching:
> 		- default
> 		- fadvise(NORMAL, SEQUENTIAL, WILLNEED)
> 	   write through caching:
> 		- fadvise(NOREUSE, DONTNEED, RANDOM)
> 		- random IO
> 		- sequential IO, BDI write bandwidth <<< dirty threshold
> 		- sequential IO, BDI write bandwidth >>> dirty threshold
> 
> I think that covers most of the issues and use cases that have been
> discussed in this thread. IMO, this is the level at which we need to
> solve the problem (i.e. architectural), not at the level of "let's
> add sysfs variables so we can tweak bdi ratios".
> 
> Indeed, the above implies that we need the caching behaviour to be a
> property of the address space, not just a property of the backing
> device.
  Yes, and that would be interesting to implement and not make a mess out
of the whole writeback logic because the way we currently do writeback is
inherently BDI based. When we introduce some special per-inode limits,
flusher threads would have to pick more carefully what to write and what
not. We might be forced to go that way eventually anyway because of memcg
aware writeback but it's not a simple step.

> IOWs, the implementation needs to trickle down from a coherent high
> level design - that will define the knobs that we need to expose to
> userspace. We should not be adding new writeback behaviours by
> adding knobs to sysfs without first having some clue about whether
> we are solving the right problem and solving it in a sane manner...
  Agreed. But the ability to limit amount of dirty pages outstanding
against a particular BDI seems as a sane one to me. It's not as flexible
and automatic as the approach you suggested but it's much simpler and
solves most of problems we currently have.

The biggest objection against the sysfs-tunable approach is that most
people won't have a clue meaning that the tunable is useless for them. But I
wonder if something like:
1) turn on strictlimit by default
2) don't allow dirty cache of BDI to grow over 5s of measured writeback
   speed

won't go a long way into solving our current problems without too much
complication...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-11-07 13:48       ` Jan Kara
@ 2013-11-11  3:22         ` Dave Chinner
  2013-11-11 19:31           ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2013-11-11  3:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Dilger, Artem S. Tashkinov, Wu Fengguang, Linus Torvalds,
	Andrew Morton, Linux Kernel Mailing List, linux-fsdevel,
	Jens Axboe, linux-mm

On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote:
> On Tue 05-11-13 15:12:45, Dave Chinner wrote:
> > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> > > Something simple like “start writing at 16MB dirty on a single file”
> > > would probably avoid a lot of complexity at little real-world cost.
> > > That shouldn’t throttle dirtying memory above 16MB, but just start
> > > writeout much earlier than it does today.
> > 
> > That doesn't solve the "slow device, large file" problem. We can
> > write data into the page cache at rates of over a GB/s, so it's
> > irrelevant to a device that can write at 5MB/s whether we start
> > writeback immediately or a second later when there is 500MB of dirty
> > pages in memory.  AFAIK, the only way to avoid that problem is to
> > use write-through caching for such devices - where they throttle to
> > the IO rate at very low levels of cached data.
>   Agreed.
> 
> > Realistically, there is no "one right answer" for all combinations
> > of applications, filesystems and hardware, but writeback caching is
> > the best *general solution* we've got right now.
> > 
> > However, IMO users should not need to care about tuning BDI dirty
> > ratios or even have to understand what a BDI dirty ratio is to
> > select the rigth caching method for their devices and/or workload.
> > The difference between writeback and write through caching is easy
> > to explain and AFAICT those two modes suffice to solve the problems
> > being discussed here.  Further, if two modes suffice to solve the
> > problems, then we should be able to easily define a trigger to
> > automatically switch modes.
> > 
> > /me notes that if we look at random vs sequential IO and the impact
> > that has on writeback duration, then it's very similar to suddenly
> > having a very slow device. IOWs, fadvise(RANDOM) could be used to
> > switch an *inode* to write through mode rather than writeback mode
> > to solve the problem aggregating massive amounts of random write IO
> > in the page cache...
>   I disagree here. Writeback cache is also useful for aggregating random
> writes and making semi-sequential writes out of them. There are quite some
> applications which rely on the fact that they can write a file in a rather
> random manner (Berkeley DB, linker, ...) but the files are written out in
> one large linear sweep. That is actually the reason why SLES (and I believe
> RHEL as well) tune dirty_limit even higher than what's the default value.

Right - but the correct behaviour really depends on the pattern of
randomness. The common case we get into trouble with is when no
clustering occurs and we end up with small, random IO for gigabytes
of cached data. That's the case where write-through caching for
random data is better.

It's also questionable whether writeback caching for aggregation is
faster for random IO on high-IOPS devices or not. Again, I think it
woul depend very much on how random the patterns are...

> So I think it's rather the other way around: If you can detect the file is
> being written in a streaming manner, there's not much point in caching too
> much data for it.

But we're not talking about how much data we cache here - we are
considering how much data we allow to get dirty before writing it
back.  It doesn't matter if we use writeback or write through
caching, the page cache footprint for a given workload is likely to
be similar, but without any data we can't draw any conclusions here.

> And I agree with you that we also have to be careful not
> to cache too few because otherwise two streaming writes would be
> interleaved too much. Currently, we have writeback_chunk_size() which
> determines how much we ask to write from a single inode. So streaming
> writers are going to be interleaved at this chunk size anyway (currently
> that number is "measured bandwidth / 2"). So it would make sense to also
> limit amount of dirty cache for each file with streaming pattern at this
> number.

My experience says that for streaming IO we typically need at least
5s of cached *dirty* data to even out delays and latencies in the
writeback IO pipeline. Hence limiting a file to what we can write in
a second given we might only write a file once a second is likely
going to result in pipeline stalls...

Remember, writeback caching is about maximising throughput, not
minimising latency. The "sync latency" problem with caching too much
dirty data on slow block devices is really a corner case behaviour
and should not compromise the common case for bulk writeback
throughput.

> > Indeed, the above implies that we need the caching behaviour to be a
> > property of the address space, not just a property of the backing
> > device.
>   Yes, and that would be interesting to implement and not make a mess out
> of the whole writeback logic because the way we currently do writeback is
> inherently BDI based. When we introduce some special per-inode limits,
> flusher threads would have to pick more carefully what to write and what
> not. We might be forced to go that way eventually anyway because of memcg
> aware writeback but it's not a simple step.

Agreed, it's not simple, and that's why we need to start working
from the architectural level....

> > IOWs, the implementation needs to trickle down from a coherent high
> > level design - that will define the knobs that we need to expose to
> > userspace. We should not be adding new writeback behaviours by
> > adding knobs to sysfs without first having some clue about whether
> > we are solving the right problem and solving it in a sane manner...
>   Agreed. But the ability to limit amount of dirty pages outstanding
> against a particular BDI seems as a sane one to me. It's not as flexible
> and automatic as the approach you suggested but it's much simpler and
> solves most of problems we currently have.

That's true, but....

> The biggest objection against the sysfs-tunable approach is that most
> people won't have a clue meaning that the tunable is useless for them.

.... that's the big problem I see - nobody is going to know how to
use it, when to use it, or be able to tell if it's the root cause of
some weird performance problem they are seeing.

> But I
> wonder if something like:
> 1) turn on strictlimit by default
> 2) don't allow dirty cache of BDI to grow over 5s of measured writeback
>    speed
> 
> won't go a long way into solving our current problems without too much
> complication...

Turning on strict limit by default is going to change behaviour
quite markedly. Again, it's not something I'd want to see done
without a bunch of data showing that it doesn't cause regressions
for common workloads...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-11-11  3:22         ` Dave Chinner
@ 2013-11-11 19:31           ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2013-11-11 19:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andreas Dilger, Artem S. Tashkinov, Wu Fengguang,
	Linus Torvalds, Andrew Morton, Linux Kernel Mailing List,
	linux-fsdevel, Jens Axboe, linux-mm

On Mon 11-11-13 14:22:11, Dave Chinner wrote:
> On Thu, Nov 07, 2013 at 02:48:06PM +0100, Jan Kara wrote:
> > On Tue 05-11-13 15:12:45, Dave Chinner wrote:
> > > On Mon, Nov 04, 2013 at 05:50:13PM -0700, Andreas Dilger wrote:
> > > Realistically, there is no "one right answer" for all combinations
> > > of applications, filesystems and hardware, but writeback caching is
> > > the best *general solution* we've got right now.
> > > 
> > > However, IMO users should not need to care about tuning BDI dirty
> > > ratios or even have to understand what a BDI dirty ratio is to
> > > select the rigth caching method for their devices and/or workload.
> > > The difference between writeback and write through caching is easy
> > > to explain and AFAICT those two modes suffice to solve the problems
> > > being discussed here.  Further, if two modes suffice to solve the
> > > problems, then we should be able to easily define a trigger to
> > > automatically switch modes.
> > > 
> > > /me notes that if we look at random vs sequential IO and the impact
> > > that has on writeback duration, then it's very similar to suddenly
> > > having a very slow device. IOWs, fadvise(RANDOM) could be used to
> > > switch an *inode* to write through mode rather than writeback mode
> > > to solve the problem aggregating massive amounts of random write IO
> > > in the page cache...
> >   I disagree here. Writeback cache is also useful for aggregating random
> > writes and making semi-sequential writes out of them. There are quite some
> > applications which rely on the fact that they can write a file in a rather
> > random manner (Berkeley DB, linker, ...) but the files are written out in
> > one large linear sweep. That is actually the reason why SLES (and I believe
> > RHEL as well) tune dirty_limit even higher than what's the default value.
> 
> Right - but the correct behaviour really depends on the pattern of
> randomness. The common case we get into trouble with is when no
> clustering occurs and we end up with small, random IO for gigabytes
> of cached data. That's the case where write-through caching for
> random data is better.
> 
> It's also questionable whether writeback caching for aggregation is
> faster for random IO on high-IOPS devices or not. Again, I think it
> woul depend very much on how random the patterns are...
  I agree usefulness of writeback caching for random IO very much depends
on the working set size vs cache size, how random the accesses really are,
and HW characteristics. I just wanted to point out there are fairly common
workloads & setups where writeback caching for semi-random IO really helps
(because you seemed to suggest that random IO implies we should disable
writeback cache).

> > So I think it's rather the other way around: If you can detect the file is
> > being written in a streaming manner, there's not much point in caching too
> > much data for it.
> 
> But we're not talking about how much data we cache here - we are
> considering how much data we allow to get dirty before writing it
> back.
  Sorry, I was imprecise here. I really meant that IMO it doesn't make
sense to allow too much dirty data for sequentially written files.

> It doesn't matter if we use writeback or write through
> caching, the page cache footprint for a given workload is likely to
> be similar, but without any data we can't draw any conclusions here.
> 
> > And I agree with you that we also have to be careful not
> > to cache too few because otherwise two streaming writes would be
> > interleaved too much. Currently, we have writeback_chunk_size() which
> > determines how much we ask to write from a single inode. So streaming
> > writers are going to be interleaved at this chunk size anyway (currently
> > that number is "measured bandwidth / 2"). So it would make sense to also
> > limit amount of dirty cache for each file with streaming pattern at this
> > number.
> 
> My experience says that for streaming IO we typically need at least
> 5s of cached *dirty* data to even out delays and latencies in the
> writeback IO pipeline. Hence limiting a file to what we can write in
> a second given we might only write a file once a second is likely
> going to result in pipeline stalls...
  I guess this begs for real data. We agree in principle but differ in
constants :).
 
> Remember, writeback caching is about maximising throughput, not
> minimising latency. The "sync latency" problem with caching too much
> dirty data on slow block devices is really a corner case behaviour
> and should not compromise the common case for bulk writeback
> throughput.
  Agreed. As a primary goal we want to maximise throughput. But we want
to maintain sane latency as well (e.g. because we have a "promise" of
"dirty_writeback_centisecs" we have to cycle through dirty inodes
reasonably frequently).

> >   Agreed. But the ability to limit amount of dirty pages outstanding
> > against a particular BDI seems as a sane one to me. It's not as flexible
> > and automatic as the approach you suggested but it's much simpler and
> > solves most of problems we currently have.
> 
> That's true, but....
> 
> > The biggest objection against the sysfs-tunable approach is that most
> > people won't have a clue meaning that the tunable is useless for them.
> 
> .... that's the big problem I see - nobody is going to know how to
> use it, when to use it, or be able to tell if it's the root cause of
> some weird performance problem they are seeing.
> 
> > But I
> > wonder if something like:
> > 1) turn on strictlimit by default
> > 2) don't allow dirty cache of BDI to grow over 5s of measured writeback
> >    speed
> > 
> > won't go a long way into solving our current problems without too much
> > complication...
> 
> Turning on strict limit by default is going to change behaviour
> quite markedly. Again, it's not something I'd want to see done
> without a bunch of data showing that it doesn't cause regressions
> for common workloads...
  Agreed.

							Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  8:18 ` Linus Torvalds
  2013-11-05  0:50   ` Andreas Dilger
@ 2013-11-05  6:32   ` Figo.zhang
  1 sibling, 0 replies; 21+ messages in thread
From: Figo.zhang @ 2013-11-05  6:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Artem S. Tashkinov, Wu Fengguang, Andrew Morton,
	Linux Kernel Mailing List, linux-fsdevel, Jens Axboe, linux-mm

[-- Attachment #1: Type: text/plain, Size: 596 bytes --]

> Yeah, I think we default to a 10% "dirty background memory" (and
> allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
> of dirty memory for writeout before we even start writing, and twice
> that before we start *waiting* for it.
>
> On 32-bit x86, we only count the memory in the low 1GB (really
> actually up to about 890MB), so "10% dirty" really means just about
> 90MB of buffering (and a "hard limit" of ~180MB of dirty).
>
=> On 32-bit system, the page cache also can use the high memory, so  the
size of 10% "dirty background memory" maybe 1.6GB for this case.

>
>

[-- Attachment #2: Type: text/html, Size: 1026 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25  7:25 Disabling in-memory write cache for x86-64 in Linux II Artem S. Tashkinov
  2013-10-25  8:18 ` Linus Torvalds
@ 2013-10-25 10:49 ` NeilBrown
  2013-10-25 11:26   ` David Lang
  1 sibling, 1 reply; 21+ messages in thread
From: NeilBrown @ 2013-10-25 10:49 UTC (permalink / raw)
  To: Artem S. Tashkinov; +Cc: linux-kernel, torvalds, linux-fsdevel, axboe, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2094 bytes --]

On Fri, 25 Oct 2013 07:25:13 +0000 (UTC) "Artem S. Tashkinov"
<t.artem@lycos.com> wrote:

> Hello!
> 
> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
> is that the x86-64 kernel has the following problem:
> 
> When I copy large files to any storage device, be it my HDD with ext4 partitions
> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
> then flushes them some time later (quite unpredictably though) or immediately upon
> invoking "sync".
> 
> How can I disable this memory cache altogether (or at least minimize caching)? When
> running the i686 kernel with the same configuration I don't observe this effect - files get
> written out almost immediately (for instance "sync" takes less than a second, whereas
> on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
> performance).

What exactly is bothering you about this?  The amount of memory used or the
time until data is flushed?

If the later, then /proc/sys/vm/dirty_expire_centisecs is where you want to
look.
This defaults to 30 seconds (3000 centisecs).
You could make it smaller (providing you also shrink
dirty_writeback_centisecs in a similar ratio) and the VM will flush out data
more quickly.

NeilBrown


> 
> I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX)
> - firstly this command is detrimental to the performance of my PC, secondly, it won't help
> in this instance.
> 
> Swap is totally disabled, usually my memory is entirely free.
> 
> My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531
> 
> Please, advise.
> 
> Best regards,
> 
> Artem 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 10:49 ` NeilBrown
@ 2013-10-25 11:26   ` David Lang
  2013-10-25 18:26     ` Artem S. Tashkinov
  0 siblings, 1 reply; 21+ messages in thread
From: David Lang @ 2013-10-25 11:26 UTC (permalink / raw)
  To: NeilBrown
  Cc: Artem S. Tashkinov, linux-kernel, torvalds, linux-fsdevel, axboe,
	linux-mm

On Fri, 25 Oct 2013, NeilBrown wrote:

> On Fri, 25 Oct 2013 07:25:13 +0000 (UTC) "Artem S. Tashkinov"
> <t.artem@lycos.com> wrote:
>
>> Hello!
>>
>> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
>> built for the i686 (with PAE) and x86-64 architectures. What's really troubling me
>> is that the x86-64 kernel has the following problem:
>>
>> When I copy large files to any storage device, be it my HDD with ext4 partitions
>> or flash drive with FAT32 partitions, the kernel first caches them in memory entirely
>> then flushes them some time later (quite unpredictably though) or immediately upon
>> invoking "sync".
>>
>> How can I disable this memory cache altogether (or at least minimize caching)? When
>> running the i686 kernel with the same configuration I don't observe this effect - files get
>> written out almost immediately (for instance "sync" takes less than a second, whereas
>> on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
>> performance).
>
> What exactly is bothering you about this?  The amount of memory used or the
> time until data is flushed?

actually, I think the problem is more the impact of the huge write later on.

David Lang

> If the later, then /proc/sys/vm/dirty_expire_centisecs is where you want to
> look.
> This defaults to 30 seconds (3000 centisecs).
> You could make it smaller (providing you also shrink
> dirty_writeback_centisecs in a similar ratio) and the VM will flush out data
> more quickly.
>
> NeilBrown
>
>
>>
>> I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX)
>> - firstly this command is detrimental to the performance of my PC, secondly, it won't help
>> in this instance.
>>
>> Swap is totally disabled, usually my memory is entirely free.
>>
>> My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531
>>
>> Please, advise.
>>
>> Best regards,
>>
>> Artem
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 11:26   ` David Lang
@ 2013-10-25 18:26     ` Artem S. Tashkinov
  2013-10-25 19:40       ` Diego Calleja
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Artem S. Tashkinov @ 2013-10-25 18:26 UTC (permalink / raw)
  To: david; +Cc: neilb, linux-kernel, torvalds, linux-fsdevel, axboe, linux-mm

Oct 25, 2013 05:26:45 PM, david wrote:
On Fri, 25 Oct 2013, NeilBrown wrote:
>
>>
>> What exactly is bothering you about this?  The amount of memory used or the
>> time until data is flushed?
>
>actually, I think the problem is more the impact of the huge write later on.

Exactly. And not being able to use applications which show you IO performance
like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
my life without being able to see the progress of a copying operation. With the current
dirty cache there's no way to understand how you storage media actually behaves.

Hopefully this issue won't dissolve into obscurity and someone will actually make
up a plan (and a patch) how to make dirty write cache behave in a sane manner
considering the fact that there are devices with very different write speeds and
requirements. It'd be ever better, if I could specify dirty cache as a mount option
(though sane defaults or semi-automatic values based on runtime estimates
won't hurt).

Per device dirty cache seems like a nice idea, I, for one, would like to disable it
altogether or make it an absolute minimum for things like USB flash drives - because
I don't care about multithreaded performance or delayed allocation on such devices -
I'm interested in my data reaching my USB stick ASAP - because it's how most people
use them.

Regards,

Artem

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 18:26     ` Artem S. Tashkinov
@ 2013-10-25 19:40       ` Diego Calleja
  2013-10-25 23:32         ` Fengguang Wu
  2013-10-25 20:43       ` NeilBrown
  2013-10-29 20:49       ` Jan Kara
  2 siblings, 1 reply; 21+ messages in thread
From: Diego Calleja @ 2013-10-25 19:40 UTC (permalink / raw)
  To: Artem S. Tashkinov
  Cc: david, neilb, linux-kernel, torvalds, linux-fsdevel, axboe,
	linux-mm

El Viernes, 25 de octubre de 2013 18:26:23 Artem S. Tashkinov escribió:
> Oct 25, 2013 05:26:45 PM, david wrote:
> >actually, I think the problem is more the impact of the huge write later
> >on.
> Exactly. And not being able to use applications which show you IO
> performance like Midnight Commander. You might prefer to use "cp -a" but I
> cannot imagine my life without being able to see the progress of a copying
> operation. With the current dirty cache there's no way to understand how
> you storage media actually behaves.

This is a problem I also have been suffering for a long time. It's not so much 
how much and when the systems syncs dirty data, but how unreponsive the 
desktop becomes when it happens (usually, with rsync + large files). Most 
programs become completely unreponsive, specially if they have a large memory 
consumption (ie. the browser). I need to pause rsync and wait until the 
systems writes out all dirty data if I want to do simple things like scrolling 
or do any action that uses I/O, otherwise I need to wait minutes.

I have 16 GB of RAM and excluding the browser (which usually uses about half 
of a GB) and KDE itself, there are no memory hogs, so it seem like it's 
something that shouldn't happen. I can understand that I/O operations are 
laggy when there is some other intensive I/O ongoing, but right now the system 
becomes completely unreponsive. If I am unlucky and Konsole also becomes 
unreponsive, I need to switch to a VT (which also takes time).

I haven't reported it before in part because I didn't know how to do it, "my 
browser stalls" is not a very useful description and I didn't know what kind 
of data I'm supposed to report.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 19:40       ` Diego Calleja
@ 2013-10-25 23:32         ` Fengguang Wu
  2013-11-15 15:48           ` Diego Calleja
  0 siblings, 1 reply; 21+ messages in thread
From: Fengguang Wu @ 2013-10-25 23:32 UTC (permalink / raw)
  To: Diego Calleja
  Cc: Artem S. Tashkinov, david, neilb, linux-kernel, torvalds,
	linux-fsdevel, axboe, linux-mm

On Fri, Oct 25, 2013 at 09:40:13PM +0200, Diego Calleja wrote:
> El Viernes, 25 de octubre de 2013 18:26:23 Artem S. Tashkinov escribió:
> > Oct 25, 2013 05:26:45 PM, david wrote:
> > >actually, I think the problem is more the impact of the huge write later
> > >on.
> > Exactly. And not being able to use applications which show you IO
> > performance like Midnight Commander. You might prefer to use "cp -a" but I
> > cannot imagine my life without being able to see the progress of a copying
> > operation. With the current dirty cache there's no way to understand how
> > you storage media actually behaves.
> 
> 
> This is a problem I also have been suffering for a long time. It's not so much 
> how much and when the systems syncs dirty data, but how unreponsive the 
> desktop becomes when it happens (usually, with rsync + large files). Most 
> programs become completely unreponsive, specially if they have a large memory 
> consumption (ie. the browser). I need to pause rsync and wait until the 
> systems writes out all dirty data if I want to do simple things like scrolling 
> or do any action that uses I/O, otherwise I need to wait minutes.

That's a problem. And it's kind of independent of the dirty threshold
-- if you are doing large file copies in the background, it will lead
to continuous disk writes and stalls anyway -- the large dirty threshold
merely delays the write IO time.

> I have 16 GB of RAM and excluding the browser (which usually uses about half 
> of a GB) and KDE itself, there are no memory hogs, so it seem like it's 
> something that shouldn't happen. I can understand that I/O operations are 
> laggy when there is some other intensive I/O ongoing, but right now the system 
> becomes completely unreponsive. If I am unlucky and Konsole also becomes 
> unreponsive, I need to switch to a VT (which also takes time).
> 
> I haven't reported it before in part because I didn't know how to do it, "my 
> browser stalls" is not a very useful description and I didn't know what kind 
> of data I'm supposed to report.

What's the kernel you are running? And it's writing to a hard disk?
The stalls are most likely caused by either one of

1) write IO starves read IO
2) direct page reclaim blocked when
   - trying to writeout PG_dirty pages
   - trying to lock PG_writeback pages

Which may be confirmed by running

        ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32
or
        echo w > /proc/sysrq-trigger    # and check dmesg

during the stalls. The latter command works more reliably.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 23:32         ` Fengguang Wu
@ 2013-11-15 15:48           ` Diego Calleja
  0 siblings, 0 replies; 21+ messages in thread
From: Diego Calleja @ 2013-11-15 15:48 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Artem S. Tashkinov, david, neilb, linux-kernel, torvalds,
	linux-fsdevel, axboe, linux-mm

El Sábado, 26 de octubre de 2013 00:32:25 Fengguang Wu escribió:
> What's the kernel you are running? And it's writing to a hard disk?
> The stalls are most likely caused by either one of
> 
> 1) write IO starves read IO
> 2) direct page reclaim blocked when
>    - trying to writeout PG_dirty pages
>    - trying to lock PG_writeback pages
> 
> Which may be confirmed by running
> 
>         ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32
> or
>         echo w > /proc/sysrq-trigger    # and check dmesg
> 
> during the stalls. The latter command works more reliably.


Sorry for the delay (background: rsync'ing large files from/to a hard disk
in a desktop with 16GB of RAM makes the whole desktop unreponsive)

I just triggered it today (running 3.12), and run sysrq-w:

[ 5547.001505] SysRq : Show Blocked State
[ 5547.001509]   task                        PC stack   pid father
[ 5547.001516] btrfs-transacti D ffff880425d7a8a0     0   193      2 0x00000000
[ 5547.001519]  ffff880425eede10 0000000000000002 ffff880425eedfd8 0000000000012e40
[ 5547.001521]  ffff880425eedfd8 0000000000012e40 ffff880425d7a8a0 ffffea00104baa80
[ 5547.001523]  ffff880425eedd90 ffff880425eedd68 ffff880425eedd70 ffffffff81080edd
[ 5547.001525] Call Trace:
[ 5547.001530]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001533]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001535]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001552]  [<ffffffffa008a742>] ? btrfs_run_ordered_operations+0x212/0x2c0 [btrfs]
[ 5547.001554]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001556]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001557]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001559]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001566]  [<ffffffffa0072215>] btrfs_commit_transaction+0x265/0x9d0 [btrfs]
[ 5547.001569]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.001575]  [<ffffffffa006982d>] transaction_kthread+0x19d/0x220 [btrfs]
[ 5547.001581]  [<ffffffffa0069690>] ? free_fs_root+0xc0/0xc0 [btrfs]
[ 5547.001583]  [<ffffffff81072e70>] kthread+0xc0/0xd0
[ 5547.001585]  [<ffffffff81072db0>] ? kthread_create_on_node+0x120/0x120
[ 5547.001587]  [<ffffffff81564bac>] ret_from_fork+0x7c/0xb0
[ 5547.001588]  [<ffffffff81072db0>] ? kthread_create_on_node+0x120/0x120
[ 5547.001590] systemd-journal D ffff880426e19860     0   234      1 0x00000000
[ 5547.001592]  ffff880426d77d90 0000000000000002 ffff880426d77fd8 0000000000012e40
[ 5547.001593]  ffff880426d77fd8 0000000000012e40 ffff880426e19860 ffffffff8155d7cd
[ 5547.001595]  0000000000000001 0000000000000001 0000000000000000 ffffffff81572560
[ 5547.001596] Call Trace:
[ 5547.001598]  [<ffffffff8155d7cd>] ? retint_restore_args+0xe/0xe
[ 5547.001601]  [<ffffffff8122b47b>] ? queue_unplugged+0x3b/0xe0
[ 5547.001602]  [<ffffffff8122da9b>] ? blk_flush_plug_list+0x1eb/0x230
[ 5547.001604]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001606]  [<ffffffff8155bb88>] schedule_preempt_disabled+0x18/0x30
[ 5547.001607]  [<ffffffff8155a2f4>] __mutex_lock_slowpath+0x124/0x1f0
[ 5547.001613]  [<ffffffffa0071c9b>] ? btrfs_write_marked_extents+0xbb/0xe0 [btrfs]
[ 5547.001615]  [<ffffffff8155a3d7>] mutex_lock+0x17/0x30
[ 5547.001623]  [<ffffffffa00ae06a>] btrfs_sync_log+0x22a/0x690 [btrfs]
[ 5547.001630]  [<ffffffffa0082f47>] btrfs_sync_file+0x287/0x2e0 [btrfs]
[ 5547.001632]  [<ffffffff811abb96>] do_fsync+0x56/0x80
[ 5547.001634]  [<ffffffff811abe20>] SyS_fsync+0x10/0x20
[ 5547.001635]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001644] mysqld          D ffff8803f0901860     0   643    579 0x00000000
[ 5547.001645]  ffff8803f090de18 0000000000000002 ffff8803f090dfd8 0000000000012e40
[ 5547.001647]  ffff8803f090dfd8 0000000000012e40 ffff8803f0901860 ffff88016d038000
[ 5547.001648]  ffff880426908d00 0000000024119d80 0000000000000000 0000000000000000
[ 5547.001650] Call Trace:
[ 5547.001657]  [<ffffffffa0074d14>] ? btrfs_submit_bio_hook+0x84/0x1f0 [btrfs]
[ 5547.001659]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001660]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001662]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001663]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001669]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.001671]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.001677]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.001680]  [<ffffffff8112632e>] ? do_writepages+0x1e/0x40
[ 5547.001686]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.001693]  [<ffffffffa0082e3f>] btrfs_sync_file+0x17f/0x2e0 [btrfs]
[ 5547.001694]  [<ffffffff811abb96>] do_fsync+0x56/0x80
[ 5547.001696]  [<ffffffff811abe43>] SyS_fdatasync+0x13/0x20
[ 5547.001697]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001701] virtuoso-t      D ffff88000310b0c0     0   617    609 0x00000000
[ 5547.001702]  ffff8803f4867c20 0000000000000002 ffff8803f4867fd8 0000000000012e40
[ 5547.001704]  ffff8803f4867fd8 0000000000012e40 ffff88000310b0c0 ffffffff813ce4af
[ 5547.001705]  ffffffff81860520 ffff8802d8ad8a00 ffff8803f4867ba0 ffffffff81231a0e
[ 5547.001707] Call Trace:
[ 5547.001709]  [<ffffffff813ce4af>] ? scsi_pool_alloc_command+0x3f/0x80
[ 5547.001712]  [<ffffffff81231a0e>] ? __blk_segment_map_sg+0x4e/0x120
[ 5547.001713]  [<ffffffff81231b6b>] ? blk_rq_map_sg+0x8b/0x1f0
[ 5547.001716]  [<ffffffff812481da>] ? cfq_dispatch_requests+0xba/0xc40
[ 5547.001718]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001721]  [<ffffffff81119d70>] ? filemap_fdatawait+0x30/0x30
[ 5547.001722]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001723]  [<ffffffff8155b9bf>] io_schedule+0x8f/0xe0
[ 5547.001725]  [<ffffffff81119d7e>] sleep_on_page+0xe/0x20
[ 5547.001727]  [<ffffffff81559142>] __wait_on_bit+0x62/0x90
[ 5547.001728]  [<ffffffff81119b2f>] wait_on_page_bit+0x7f/0x90
[ 5547.001730]  [<ffffffff81073da0>] ? wake_atomic_t_function+0x40/0x40
[ 5547.001732]  [<ffffffff81119cbb>] filemap_fdatawait_range+0x11b/0x1a0
[ 5547.001734]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001740]  [<ffffffffa0071d47>] btrfs_wait_marked_extents+0x87/0xe0 [btrfs]
[ 5547.001747]  [<ffffffffa00ae328>] btrfs_sync_log+0x4e8/0x690 [btrfs]
[ 5547.001754]  [<ffffffffa0082f47>] btrfs_sync_file+0x287/0x2e0 [btrfs]
[ 5547.001756]  [<ffffffff811abb96>] do_fsync+0x56/0x80
[ 5547.001758]  [<ffffffff811abe20>] SyS_fsync+0x10/0x20
[ 5547.001759]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001761] pool            D ffff88040db1c100     0   657    477 0x00000000
[ 5547.001763]  ffff8803ee809ba0 0000000000000002 ffff8803ee809fd8 0000000000012e40
[ 5547.001764]  ffff8803ee809fd8 0000000000012e40 ffff88040db1c100 0000000000000004
[ 5547.001766]  ffff8803ee809ae8 ffffffff8155cc86 ffff8803ee809bd0 ffffffffa005ada4
[ 5547.001767] Call Trace:
[ 5547.001769]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001775]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.001776]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001778]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001779]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001781]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001783]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001784]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001790]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.001792]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.001798]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.001804]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.001810]  [<ffffffffa0080b8b>] btrfs_create+0x3b/0x200 [btrfs]
[ 5547.001813]  [<ffffffff8120ce3c>] ? security_inode_permission+0x1c/0x30
[ 5547.001815]  [<ffffffff81189634>] vfs_create+0xb4/0x120
[ 5547.001817]  [<ffffffff8118bcd4>] do_last+0x904/0xea0
[ 5547.001818]  [<ffffffff81188cc0>] ? link_path_walk+0x70/0x930
[ 5547.001820]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001822]  [<ffffffff8120d0e6>] ? security_file_alloc+0x16/0x20
[ 5547.001824]  [<ffffffff8118c32b>] path_openat+0xbb/0x6b0
[ 5547.001827]  [<ffffffff810dd64f>] ? __acct_update_integrals+0x7f/0x100
[ 5547.001829]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.001831]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001833]  [<ffffffff8118d7ca>] do_filp_open+0x3a/0x90
[ 5547.001834]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001836]  [<ffffffff81199e47>] ? __alloc_fd+0xa7/0x130
[ 5547.001839]  [<ffffffff8117ce89>] do_sys_open+0x129/0x220
[ 5547.001842]  [<ffffffff8100e795>] ? syscall_trace_enter+0x135/0x230
[ 5547.001844]  [<ffffffff8117cf9e>] SyS_open+0x1e/0x20
[ 5547.001845]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001850] akregator       D ffff8803ed1d4100     0   875      1 0x00000000
[ 5547.001851]  ffff8803c7f1bba0 0000000000000002 ffff8803c7f1bfd8 0000000000012e40
[ 5547.001853]  ffff8803c7f1bfd8 0000000000012e40 ffff8803ed1d4100 0000000000000004
[ 5547.001854]  ffff8803c7f1bae8 ffffffff8155cc86 ffff8803c7f1bbd0 ffffffffa005ada4
[ 5547.001856] Call Trace:
[ 5547.001858]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001863]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.001865]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001866]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001868]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001870]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001871]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001873]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001879]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.001881]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.001886]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.001888]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001894]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.001900]  [<ffffffffa0080b8b>] btrfs_create+0x3b/0x200 [btrfs]
[ 5547.001902]  [<ffffffff8120ce3c>] ? security_inode_permission+0x1c/0x30
[ 5547.001904]  [<ffffffff81189634>] vfs_create+0xb4/0x120
[ 5547.001906]  [<ffffffff8118bcd4>] do_last+0x904/0xea0
[ 5547.001907]  [<ffffffff81188cc0>] ? link_path_walk+0x70/0x930
[ 5547.001909]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001911]  [<ffffffff8120d0e6>] ? security_file_alloc+0x16/0x20
[ 5547.001912]  [<ffffffff8118c32b>] path_openat+0xbb/0x6b0
[ 5547.001914]  [<ffffffff810dd64f>] ? __acct_update_integrals+0x7f/0x100
[ 5547.001916]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.001918]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001920]  [<ffffffff8118d7ca>] do_filp_open+0x3a/0x90
[ 5547.001921]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001923]  [<ffffffff81199e47>] ? __alloc_fd+0xa7/0x130
[ 5547.001925]  [<ffffffff8117ce89>] do_sys_open+0x129/0x220
[ 5547.001927]  [<ffffffff8100e795>] ? syscall_trace_enter+0x135/0x230
[ 5547.001928]  [<ffffffff8117cf9e>] SyS_open+0x1e/0x20
[ 5547.001930]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001931] mpegaudioparse3 D ffff880341d10820     0  5917      1 0x00000000
[ 5547.001933]  ffff88030f779ce0 0000000000000002 ffff88030f779fd8 0000000000012e40
[ 5547.001934]  ffff88030f779fd8 0000000000012e40 ffff880341d10820 ffffffff81122a28
[ 5547.001936]  ffff88043e5ddc00 ffff880400000002 ffff88043e2138d0 0000000000000000
[ 5547.001938] Call Trace:
[ 5547.001939]  [<ffffffff81122a28>] ? __alloc_pages_nodemask+0x158/0xb00
[ 5547.001941]  [<ffffffff8102af55>] ? native_send_call_func_single_ipi+0x35/0x40
[ 5547.001943]  [<ffffffff810b31a8>] ? generic_exec_single+0x98/0xa0
[ 5547.001945]  [<ffffffff81086a18>] ? __enqueue_entity+0x78/0x80
[ 5547.001947]  [<ffffffff8108a837>] ? enqueue_entity+0x197/0x780
[ 5547.001948]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001950]  [<ffffffff81119d90>] ? sleep_on_page+0x20/0x20
[ 5547.001951]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.001953]  [<ffffffff8155b9bf>] io_schedule+0x8f/0xe0
[ 5547.001954]  [<ffffffff81119d9e>] sleep_on_page_killable+0xe/0x40
[ 5547.001956]  [<ffffffff8155925d>] __wait_on_bit_lock+0x5d/0xc0
[ 5547.001958]  [<ffffffff81119f2a>] __lock_page_killable+0x6a/0x70
[ 5547.001960]  [<ffffffff81073da0>] ? wake_atomic_t_function+0x40/0x40
[ 5547.001961]  [<ffffffff8111b9e5>] generic_file_aio_read+0x435/0x700
[ 5547.001963]  [<ffffffff8117d2ba>] do_sync_read+0x5a/0x90
[ 5547.001965]  [<ffffffff8117d85a>] vfs_read+0x9a/0x170
[ 5547.001967]  [<ffffffff8117e039>] SyS_read+0x49/0xa0
[ 5547.001968]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.001970] mozStorage #2   D ffff8803b7aa1860     0   920    477 0x00000000
[ 5547.001972]  ffff8803b1473d80 0000000000000002 ffff8803b1473fd8 0000000000012e40
[ 5547.001974]  ffff8803b1473fd8 0000000000012e40 ffff8803b7aa1860 0000000000000004
[ 5547.001975]  ffff8803b1473cc8 ffffffff8155cc86 ffff8803b1473db0 ffffffffa005ada4
[ 5547.001977] Call Trace:
[ 5547.001978]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.001984]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.001990]  [<ffffffffa0084729>] ? __btrfs_buffered_write+0x3d9/0x490 [btrfs]
[ 5547.001992]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.001994]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.001995]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.001997]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002003]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.002004]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002010]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.002016]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.002023]  [<ffffffffa007c8a1>] btrfs_setattr+0x101/0x290 [btrfs]
[ 5547.002025]  [<ffffffff810d675c>] ? rcu_eqs_enter+0x5c/0xa0
[ 5547.002027]  [<ffffffff81198a6c>] notify_change+0x1dc/0x360
[ 5547.002029]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002030]  [<ffffffff8117bdcb>] do_truncate+0x6b/0xa0
[ 5547.002032]  [<ffffffff8117f8b9>] ? __sb_start_write+0x49/0x100
[ 5547.002033]  [<ffffffff8117c12b>] SyS_ftruncate+0x10b/0x160
[ 5547.002035]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002036] Cache I/O       D ffff8803b7aa28a0     0   922    477 0x00000000
[ 5547.002038]  ffff8803b1495e18 0000000000000002 ffff8803b1495fd8 0000000000012e40
[ 5547.002039]  ffff8803b1495fd8 0000000000012e40 ffff8803b7aa28a0 ffff8803b1495e08
[ 5547.002041]  ffff8803b1495db0 ffffffff8111a25a ffff8803b1495e40 ffff8803b1495df0
[ 5547.002043] Call Trace:
[ 5547.002045]  [<ffffffff8111a25a>] ? find_get_pages_tag+0xea/0x180
[ 5547.002047]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002048]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002050]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.002051]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002057]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.002059]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002065]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.002071]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.002077]  [<ffffffffa0082e3f>] btrfs_sync_file+0x17f/0x2e0 [btrfs]
[ 5547.002079]  [<ffffffff811abb96>] do_fsync+0x56/0x80
[ 5547.002080]  [<ffffffff811abe20>] SyS_fsync+0x10/0x20
[ 5547.002081]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002083] mozStorage #6   D ffff8803c0cfa8a0     0   982    477 0x00000000
[ 5547.002085]  ffff8803a10f5ba0 0000000000000002 ffff8803a10f5fd8 0000000000012e40
[ 5547.002086]  ffff8803a10f5fd8 0000000000012e40 ffff8803c0cfa8a0 0000000000000004
[ 5547.002088]  ffff8803a10f5ae8 ffffffff8155cc86 ffff8803a10f5bd0 ffffffffa005ada4
[ 5547.002089] Call Trace:
[ 5547.002091]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002096]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.002098]  [<ffffffff8102b067>] ? native_smp_send_reschedule+0x47/0x60
[ 5547.002100]  [<ffffffff8107f7bc>] ? resched_task+0x5c/0x60
[ 5547.002101]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002103]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002104]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.002106]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002112]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.002113]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002119]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.002125]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.002131]  [<ffffffffa0080b8b>] btrfs_create+0x3b/0x200 [btrfs]
[ 5547.002133]  [<ffffffff8120ce3c>] ? security_inode_permission+0x1c/0x30
[ 5547.002134]  [<ffffffff81189634>] vfs_create+0xb4/0x120
[ 5547.002136]  [<ffffffff8118bcd4>] do_last+0x904/0xea0
[ 5547.002138]  [<ffffffff81188cc0>] ? link_path_walk+0x70/0x930
[ 5547.002139]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002141]  [<ffffffff8120d0e6>] ? security_file_alloc+0x16/0x20
[ 5547.002143]  [<ffffffff8118c32b>] path_openat+0xbb/0x6b0
[ 5547.002145]  [<ffffffff810dd64f>] ? __acct_update_integrals+0x7f/0x100
[ 5547.002147]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.002148]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002150]  [<ffffffff8118d7ca>] do_filp_open+0x3a/0x90
[ 5547.002152]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002153]  [<ffffffff81199e47>] ? __alloc_fd+0xa7/0x130
[ 5547.002155]  [<ffffffff8117ce89>] do_sys_open+0x129/0x220
[ 5547.002157]  [<ffffffff8100e795>] ? syscall_trace_enter+0x135/0x230
[ 5547.002159]  [<ffffffff8117cf9e>] SyS_open+0x1e/0x20
[ 5547.002160]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002164] rsync           D ffff8802dcde0820     0  5803   5802 0x00000000
[ 5547.002165]  ffff8802daeb1a90 0000000000000002 ffff8802daeb1fd8 0000000000012e40
[ 5547.002167]  ffff8802daeb1fd8 0000000000012e40 ffff8802dcde0820 ffff880100000002
[ 5547.002169]  ffff8802daeb19e0 ffffffff81080edd ffff880308b337e0 0000000000000000
[ 5547.002170] Call Trace:
[ 5547.002172]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002173]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002175]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002177]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002178]  [<ffffffff81560e8d>] ? add_preempt_count+0x3d/0x40
[ 5547.002180]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002181]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002182]  [<ffffffff81558f6a>] schedule_timeout+0x11a/0x230
[ 5547.002185]  [<ffffffff8105e0c0>] ? detach_if_pending+0x120/0x120
[ 5547.002187]  [<ffffffff810a5078>] ? ktime_get_ts+0x48/0xe0
[ 5547.002189]  [<ffffffff8155bd2b>] io_schedule_timeout+0x9b/0xf0
[ 5547.002191]  [<ffffffff811259a9>] balance_dirty_pages_ratelimited+0x3d9/0xa10
[ 5547.002198]  [<ffffffffa0c9ad84>] ? ext4_dirty_inode+0x54/0x60 [ext4]
[ 5547.002200]  [<ffffffff8111a8c8>] generic_file_buffered_write+0x1b8/0x290
[ 5547.002202]  [<ffffffff8111bfd9>] __generic_file_aio_write+0x1a9/0x3b0
[ 5547.002203]  [<ffffffff8111c238>] generic_file_aio_write+0x58/0xa0
[ 5547.002208]  [<ffffffffa0c8ef79>] ext4_file_write+0x99/0x3e0 [ext4]
[ 5547.002210]  [<ffffffff810ddaac>] ? acct_account_cputime+0x1c/0x20
[ 5547.002212]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.002213]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002215]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002216]  [<ffffffff8117d34a>] do_sync_write+0x5a/0x90
[ 5547.002218]  [<ffffffff8117d9ed>] vfs_write+0xbd/0x1e0
[ 5547.002220]  [<ffffffff8117e0d9>] SyS_write+0x49/0xa0
[ 5547.002221]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002223] ktorrent        D ffff8802e7680820     0  5806      1 0x00000000
[ 5547.002224]  ffff8802daf7fba0 0000000000000002 ffff8802daf7ffd8 0000000000012e40
[ 5547.002226]  ffff8802daf7ffd8 0000000000012e40 ffff8802e7680820 0000000000000004
[ 5547.002227]  ffff8802daf7fae8 ffffffff8155cc86 ffff8802daf7fbd0 ffffffffa005ada4
[ 5547.002229] Call Trace:
[ 5547.002230]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002236]  [<ffffffffa005ada4>] ? reserve_metadata_bytes+0x184/0x930 [btrfs]
[ 5547.002241]  [<ffffffffa004ae49>] ? btrfs_set_path_blocking+0x39/0x80 [btrfs]
[ 5547.002246]  [<ffffffffa004fe78>] ? btrfs_search_slot+0x498/0x970 [btrfs]
[ 5547.002247]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002249]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002251]  [<ffffffff8155d006>] ? _raw_spin_unlock_irqrestore+0x26/0x60
[ 5547.002252]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002258]  [<ffffffffa007170f>] wait_current_trans.isra.17+0xbf/0x120 [btrfs]
[ 5547.002260]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002266]  [<ffffffffa0072cff>] start_transaction+0x37f/0x570 [btrfs]
[ 5547.002268]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002273]  [<ffffffffa0072f0b>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[ 5547.002280]  [<ffffffffa0080b8b>] btrfs_create+0x3b/0x200 [btrfs]
[ 5547.002281]  [<ffffffff8120ce3c>] ? security_inode_permission+0x1c/0x30
[ 5547.002283]  [<ffffffff81189634>] vfs_create+0xb4/0x120
[ 5547.002285]  [<ffffffff8118bcd4>] do_last+0x904/0xea0
[ 5547.002287]  [<ffffffff81188cc0>] ? link_path_walk+0x70/0x930
[ 5547.002288]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002290]  [<ffffffff8120d0e6>] ? security_file_alloc+0x16/0x20
[ 5547.002292]  [<ffffffff8118c32b>] path_openat+0xbb/0x6b0
[ 5547.002293]  [<ffffffff810dd64f>] ? __acct_update_integrals+0x7f/0x100
[ 5547.002295]  [<ffffffff81085782>] ? account_system_time+0xa2/0x180
[ 5547.002297]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002299]  [<ffffffff8118d7ca>] do_filp_open+0x3a/0x90
[ 5547.002300]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002302]  [<ffffffff81199e47>] ? __alloc_fd+0xa7/0x130
[ 5547.002304]  [<ffffffff8117ce89>] do_sys_open+0x129/0x220
[ 5547.002306]  [<ffffffff8100e795>] ? syscall_trace_enter+0x135/0x230
[ 5547.002307]  [<ffffffff8117cf9e>] SyS_open+0x1e/0x20
[ 5547.002309]  [<ffffffff81564e5f>] tracesys+0xdd/0xe2
[ 5547.002311] kworker/u16:0   D ffff88035c5ac920     0  6043      2 0x00000000
[ 5547.002313] Workqueue: writeback bdi_writeback_workfn (flush-8:32)
[ 5547.002315]  ffff88036c9cb898 0000000000000002 ffff88036c9cbfd8 0000000000012e40
[ 5547.002316]  ffff88036c9cbfd8 0000000000012e40 ffff88035c5ac920 ffff8804281de048
[ 5547.002318]  ffff88036c9cb7e8 ffffffff81080edd 0000000000000001 ffff88036c9cb800
[ 5547.002319] Call Trace:
[ 5547.002321]  [<ffffffff81080edd>] ? get_parent_ip+0xd/0x50
[ 5547.002323]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002324]  [<ffffffff8155cc86>] ? _raw_spin_unlock+0x16/0x40
[ 5547.002326]  [<ffffffff8122b47b>] ? queue_unplugged+0x3b/0xe0
[ 5547.002328]  [<ffffffff8155b719>] schedule+0x29/0x70
[ 5547.002329]  [<ffffffff8155b9bf>] io_schedule+0x8f/0xe0
[ 5547.002331]  [<ffffffff8122b8aa>] get_request+0x1aa/0x780
[ 5547.002332]  [<ffffffff8123099e>] ? ioc_lookup_icq+0x4e/0x80
[ 5547.002334]  [<ffffffff81073d20>] ? wake_up_atomic_t+0x30/0x30
[ 5547.002336]  [<ffffffff8122db58>] blk_queue_bio+0x78/0x3e0
[ 5547.002337]  [<ffffffff8122c5c2>] generic_make_request+0xc2/0x110
[ 5547.002338]  [<ffffffff8122c683>] submit_bio+0x73/0x160
[ 5547.002344]  [<ffffffffa0c9bae5>] ext4_io_submit+0x25/0x50 [ext4]
[ 5547.002348]  [<ffffffffa0c981d3>] ext4_writepages+0x823/0xe00 [ext4]
[ 5547.002350]  [<ffffffff8112632e>] do_writepages+0x1e/0x40
[ 5547.002352]  [<ffffffff811a6340>] __writeback_single_inode+0x40/0x330
[ 5547.002353]  [<ffffffff811a7392>] writeback_sb_inodes+0x262/0x450
[ 5547.002355]  [<ffffffff811a761f>] __writeback_inodes_wb+0x9f/0xd0
[ 5547.002357]  [<ffffffff811a797b>] wb_writeback+0x32b/0x360
[ 5547.002358]  [<ffffffff811a8111>] bdi_writeback_workfn+0x221/0x510
[ 5547.002361]  [<ffffffff8106b917>] process_one_work+0x167/0x450
[ 5547.002362]  [<ffffffff8106c6a1>] worker_thread+0x121/0x3a0
[ 5547.002364]  [<ffffffff81560ed9>] ? sub_preempt_count+0x49/0x50
[ 5547.002366]  [<ffffffff8106c580>] ? manage_workers.isra.25+0x2a0/0x2a0
[ 5547.002367]  [<ffffffff81072e70>] kthread+0xc0/0xd0
[ 5547.002369]  [<ffffffff81072db0>] ? kthread_create_on_node+0x120/0x120
[ 5547.002371]  [<ffffffff81564bac>] ret_from_fork+0x7c/0xb0
[ 5547.002372]  [<ffffffff81072db0>] ? kthread_create_on_node+0x120/0x120



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 18:26     ` Artem S. Tashkinov
  2013-10-25 19:40       ` Diego Calleja
@ 2013-10-25 20:43       ` NeilBrown
  2013-10-25 21:03         ` Artem S. Tashkinov
  2013-10-29 20:49       ` Jan Kara
  2 siblings, 1 reply; 21+ messages in thread
From: NeilBrown @ 2013-10-25 20:43 UTC (permalink / raw)
  To: Artem S. Tashkinov
  Cc: david, linux-kernel, torvalds, linux-fsdevel, axboe, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1836 bytes --]

On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov"
<t.artem@lycos.com> wrote:

> Oct 25, 2013 05:26:45 PM, david wrote:
> On Fri, 25 Oct 2013, NeilBrown wrote:
> >
> >>
> >> What exactly is bothering you about this?  The amount of memory used or the
> >> time until data is flushed?
> >
> >actually, I think the problem is more the impact of the huge write later on.
> 
> Exactly. And not being able to use applications which show you IO performance
> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
> my life without being able to see the progress of a copying operation. With the current
> dirty cache there's no way to understand how you storage media actually behaves.

So fix Midnight Commander.  If you want the copy to be actually finished when
it says  it is finished, then it needs to call 'fsync()' at the end.

> 
> Hopefully this issue won't dissolve into obscurity and someone will actually make
> up a plan (and a patch) how to make dirty write cache behave in a sane manner
> considering the fact that there are devices with very different write speeds and
> requirements. It'd be ever better, if I could specify dirty cache as a mount option
> (though sane defaults or semi-automatic values based on runtime estimates
> won't hurt).
> 
> Per device dirty cache seems like a nice idea, I, for one, would like to disable it
> altogether or make it an absolute minimum for things like USB flash drives - because
> I don't care about multithreaded performance or delayed allocation on such devices -
> I'm interested in my data reaching my USB stick ASAP - because it's how most people
> use them.
>

As has already been said, you can substantially disable  the cache by tuning
down various values in /proc/sys/vm/.
Have you tried?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 20:43       ` NeilBrown
@ 2013-10-25 21:03         ` Artem S. Tashkinov
  2013-10-25 22:11           ` NeilBrown
  0 siblings, 1 reply; 21+ messages in thread
From: Artem S. Tashkinov @ 2013-10-25 21:03 UTC (permalink / raw)
  To: neilb; +Cc: david, linux-kernel, torvalds, linux-fsdevel, axboe, linux-mm

Oct 26, 2013 02:44:07 AM, neil wrote:
On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov"
>> 
>> Exactly. And not being able to use applications which show you IO performance
>> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
>> my life without being able to see the progress of a copying operation. With the current
>> dirty cache there's no way to understand how you storage media actually behaves.
>
>So fix Midnight Commander.  If you want the copy to be actually finished when
>it says  it is finished, then it needs to call 'fsync()' at the end.

This sounds like a very bad joke. How applications are supposed to show and
calculate an _average_ write speed if there are no kernel calls/ioctls to actually
make the kernel flush dirty buffers _during_ copying? Actually it's a good way to
solve this problem in user space - alas, even if such calls are implemented, user
space will start using them only in 2018 if not further from that.

>> 
>> Per device dirty cache seems like a nice idea, I, for one, would like to disable it
>> altogether or make it an absolute minimum for things like USB flash drives - because
>> I don't care about multithreaded performance or delayed allocation on such devices -
>> I'm interested in my data reaching my USB stick ASAP - because it's how most people
>> use them.
>>
>
>As has already been said, you can substantially disable  the cache by tuning
>down various values in /proc/sys/vm/.
>Have you tried?

I don't understand who you are replying to. I asked about per device settings, you are
again referring me to system wide settings - they don't look that good if we're talking
about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense
to allocate 20% of physical RAM for things which don't belong to it in the first place.

I don't know any other OS which has a similar behaviour.

And like people (including me) have already mentioned, such a huge dirty cache can
stall their PCs/servers for a considerable amount of time.

Of course, if you don't use Linux on the desktop you don't really care - well, I do. Also
not everyone in this world has an UPS - which means such a huge buffer can lead to a
serious data loss in case of a power blackout.

Regards,

Artem

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 21:03         ` Artem S. Tashkinov
@ 2013-10-25 22:11           ` NeilBrown
  2013-11-05  1:40             ` Figo.zhang
  0 siblings, 1 reply; 21+ messages in thread
From: NeilBrown @ 2013-10-25 22:11 UTC (permalink / raw)
  To: Artem S. Tashkinov
  Cc: david, linux-kernel, torvalds, linux-fsdevel, axboe, linux-mm

[-- Attachment #1: Type: text/plain, Size: 4860 bytes --]

On Fri, 25 Oct 2013 21:03:44 +0000 (UTC) "Artem S. Tashkinov"
<t.artem@lycos.com> wrote:

> Oct 26, 2013 02:44:07 AM, neil wrote:
> On Fri, 25 Oct 2013 18:26:23 +0000 (UTC) "Artem S. Tashkinov"
> >> 
> >> Exactly. And not being able to use applications which show you IO performance
> >> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
> >> my life without being able to see the progress of a copying operation. With the current
> >> dirty cache there's no way to understand how you storage media actually behaves.
> >
> >So fix Midnight Commander.  If you want the copy to be actually finished when
> >it says  it is finished, then it needs to call 'fsync()' at the end.
> 
> This sounds like a very bad joke. How applications are supposed to show and
> calculate an _average_ write speed if there are no kernel calls/ioctls to actually
> make the kernel flush dirty buffers _during_ copying? Actually it's a good way to
> solve this problem in user space - alas, even if such calls are implemented, user
> space will start using them only in 2018 if not further from that.

But there is a way to flush dirty buffers *during* copies.  
  man 2 sync_file_range

if giving precise feedback is is paramount importance to you, then this would
be the interface to use.
> 
> >> 
> >> Per device dirty cache seems like a nice idea, I, for one, would like to disable it
> >> altogether or make it an absolute minimum for things like USB flash drives - because
> >> I don't care about multithreaded performance or delayed allocation on such devices -
> >> I'm interested in my data reaching my USB stick ASAP - because it's how most people
> >> use them.
> >>
> >
> >As has already been said, you can substantially disable  the cache by tuning
> >down various values in /proc/sys/vm/.
> >Have you tried?
> 
> I don't understand who you are replying to. I asked about per device settings, you are
> again referring me to system wide settings - they don't look that good if we're talking
> about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense
> to allocate 20% of physical RAM for things which don't belong to it in the first place.

Sorry, missed the per-device bit.
You could try playing with
  /sys/class/bdi/XX:YY/max_ratio

where XX:YY is the major/minor number of the device, so 8:0 for /dev/sda.
Wind it right down for slow devices and you might get something like what you
want.

> 
> I don't know any other OS which has a similar behaviour.

I don't know about the internal details of any other OS, so I cannot really
comment.

> 
> And like people (including me) have already mentioned, such a huge dirty cache can
> stall their PCs/servers for a considerable amount of time.

Yes.  But this is a different issue.
There are two very different issues that should be kept separate.

One is that when "cp" or similar complete, the data hasn't all be written out
yet.  It typically takes another 30 seconds before the flush will complete.
You seemed to primarily complain about this, so that is what I originally
address.  That is where in the "dirty_*_centisecs" values apply.

The other, quite separate, issue is that Linux will cache more dirty data
than it can write out in a reasonable time.  All the tuning parameters refer
to the amount of data (whether as a percentage of RAM or as a number of
bytes), but what people really care about is a number of seconds.

As you might imagine, estimating how long it will take to write out a certain
amount of data is highly non-trivial.  The relationship between megabytes and
seconds can be non-linear and can change over time.

Caching nothing at all can hurt a lot of workloads.  Caching too much can
obviously hurt too.  Caching "5 seconds" worth of data would be ideal, but
would be incredibly difficult to implement.
It is possible that keeping a sliding estimate of device throughput for each
device would be possible, and using that to automatically adjust the
"max_ratio" value (or some related internal thing) might be a 70% solution.

Certainly it would be an interesting project for someone.

> 
> Of course, if you don't use Linux on the desktop you don't really care - well, I do. Also
> not everyone in this world has an UPS - which means such a huge buffer can lead to a
> serious data loss in case of a power blackout.

I don't have a desk (just a lap), but I use Linux on all my computers and
I've never really noticed the problem.  Maybe I'm just very patient, or maybe
I don't work with large data sets and slow devices.

However I don't think data-loss is really a related issue.  Any process that
cares about data safety *must* use fsync at appropriate places.  This has
always been true.

NeilBrown

> 
> Regards,
> 
> Artem

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 22:11           ` NeilBrown
@ 2013-11-05  1:40             ` Figo.zhang
  2013-11-05  1:47               ` David Lang
  2013-11-05  2:08               ` NeilBrown
  0 siblings, 2 replies; 21+ messages in thread
From: Figo.zhang @ 2013-11-05  1:40 UTC (permalink / raw)
  To: NeilBrown
  Cc: Artem S. Tashkinov, david, lkml, Linus Torvalds, linux-fsdevel,
	axboe, Linux-MM

[-- Attachment #1: Type: text/plain, Size: 882 bytes --]

> >
> > Of course, if you don't use Linux on the desktop you don't really care -
> well, I do. Also
> > not everyone in this world has an UPS - which means such a huge buffer
> can lead to a
> > serious data loss in case of a power blackout.
>
> I don't have a desk (just a lap), but I use Linux on all my computers and
> I've never really noticed the problem.  Maybe I'm just very patient, or
> maybe
> I don't work with large data sets and slow devices.
>
> However I don't think data-loss is really a related issue.  Any process
> that
> cares about data safety *must* use fsync at appropriate places.  This has
> always been true.
>
> =>May i ask question that, some like ext4 filesystem, if some app motify
the files, it create some dirty data. if some meta-data writing to the
journal disk when a power backout,
it will be lose some serious data and the the file will damage?

[-- Attachment #2: Type: text/html, Size: 1204 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-11-05  1:40             ` Figo.zhang
@ 2013-11-05  1:47               ` David Lang
  2013-11-05  2:08               ` NeilBrown
  1 sibling, 0 replies; 21+ messages in thread
From: David Lang @ 2013-11-05  1:47 UTC (permalink / raw)
  To: Figo.zhang
  Cc: NeilBrown, Artem S. Tashkinov, lkml, Linus Torvalds,
	linux-fsdevel, axboe, Linux-MM

On Tue, 5 Nov 2013, Figo.zhang wrote:

>>>
>>> Of course, if you don't use Linux on the desktop you don't really care -
>> well, I do. Also
>>> not everyone in this world has an UPS - which means such a huge buffer
>> can lead to a
>>> serious data loss in case of a power blackout.
>>
>> I don't have a desk (just a lap), but I use Linux on all my computers and
>> I've never really noticed the problem.  Maybe I'm just very patient, or
>> maybe
>> I don't work with large data sets and slow devices.
>>
>> However I don't think data-loss is really a related issue.  Any process
>> that
>> cares about data safety *must* use fsync at appropriate places.  This has
>> always been true.
>>
>> =>May i ask question that, some like ext4 filesystem, if some app motify
> the files, it create some dirty data. if some meta-data writing to the
> journal disk when a power backout,
> it will be lose some serious data and the the file will damage?
>

with any filesystem and any OS, if you create dirty data but do not f*sync() the 
data, there isa possibility that the system can go down between the time the 
application creates the dirty data and the time the OS actually gets it on disk. 
If the system goes down in this timeframe, the data will be lost and it may 
corrupt the file if only some of the data got written.

David Lang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-11-05  1:40             ` Figo.zhang
  2013-11-05  1:47               ` David Lang
@ 2013-11-05  2:08               ` NeilBrown
  1 sibling, 0 replies; 21+ messages in thread
From: NeilBrown @ 2013-11-05  2:08 UTC (permalink / raw)
  To: Figo.zhang
  Cc: Artem S. Tashkinov, david, lkml, Linus Torvalds, linux-fsdevel,
	axboe, Linux-MM

[-- Attachment #1: Type: text/plain, Size: 1632 bytes --]

On Tue, 5 Nov 2013 09:40:55 +0800 "Figo.zhang" <figo1802@gmail.com> wrote:

> > >
> > > Of course, if you don't use Linux on the desktop you don't really care -
> > well, I do. Also
> > > not everyone in this world has an UPS - which means such a huge buffer
> > can lead to a
> > > serious data loss in case of a power blackout.
> >
> > I don't have a desk (just a lap), but I use Linux on all my computers and
> > I've never really noticed the problem.  Maybe I'm just very patient, or
> > maybe
> > I don't work with large data sets and slow devices.
> >
> > However I don't think data-loss is really a related issue.  Any process
> > that
> > cares about data safety *must* use fsync at appropriate places.  This has
> > always been true.
> >
> > =>May i ask question that, some like ext4 filesystem, if some app motify
> the files, it create some dirty data. if some meta-data writing to the
> journal disk when a power backout,
> it will be lose some serious data and the the file will damage?

If you modify a file, then you must take care that you can recover from a
crash at any point in the process.

If the file is small, the usual approach is to create a copy of the file with
the appropriate changes made, then 'fsync' the file and rename the new file
over the old file.

If the file is large you might need some sort of update log (in a small file)
so you can replay recent updates after a crash.

The  journalling that the filesystem provides only protects the filesystem
metadata.  It does not protect the consistency of the data in your file.

I hope  that helps.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Disabling in-memory write cache for x86-64 in Linux II
  2013-10-25 18:26     ` Artem S. Tashkinov
  2013-10-25 19:40       ` Diego Calleja
  2013-10-25 20:43       ` NeilBrown
@ 2013-10-29 20:49       ` Jan Kara
  2 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2013-10-29 20:49 UTC (permalink / raw)
  To: Artem S. Tashkinov
  Cc: david, neilb, linux-kernel, torvalds, linux-fsdevel, axboe,
	linux-mm

On Fri 25-10-13 18:26:23, Artem S. Tashkinov wrote:
> Oct 25, 2013 05:26:45 PM, david wrote:
> On Fri, 25 Oct 2013, NeilBrown wrote:
> >
> >>
> >> What exactly is bothering you about this?  The amount of memory used or the
> >> time until data is flushed?
> >
> >actually, I think the problem is more the impact of the huge write later on.
> 
> Exactly. And not being able to use applications which show you IO
> performance like Midnight Commander. You might prefer to use "cp -a" but
> I cannot imagine my life without being able to see the progress of a
> copying operation. With the current dirty cache there's no way to
> understand how you storage media actually behaves.
  Large writes shouldn't stall your desktop, that's certain and we must fix
that. I don't find the problem with copy progress indicators that
pressing...

> Hopefully this issue won't dissolve into obscurity and someone will
> actually make up a plan (and a patch) how to make dirty write cache
> behave in a sane manner considering the fact that there are devices with
> very different write speeds and requirements. It'd be ever better, if I
> could specify dirty cache as a mount option (though sane defaults or
> semi-automatic values based on runtime estimates won't hurt).
> 
> Per device dirty cache seems like a nice idea, I, for one, would like to
> disable it altogether or make it an absolute minimum for things like USB
> flash drives - because I don't care about multithreaded performance or
> delayed allocation on such devices - I'm interested in my data reaching
> my USB stick ASAP - because it's how most people use them.
  See my other emails in this thread. There are ways to tune the amount of
dirty data allowed per device. Currently the result isn't very satisfactory
but we should have something usable after the next merge window.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-11-15 15:48 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-25  7:25 Disabling in-memory write cache for x86-64 in Linux II Artem S. Tashkinov
2013-10-25  8:18 ` Linus Torvalds
2013-11-05  0:50   ` Andreas Dilger
2013-11-05  4:12     ` Dave Chinner
2013-11-07 13:48       ` Jan Kara
2013-11-11  3:22         ` Dave Chinner
2013-11-11 19:31           ` Jan Kara
2013-11-05  6:32   ` Figo.zhang
2013-10-25 10:49 ` NeilBrown
2013-10-25 11:26   ` David Lang
2013-10-25 18:26     ` Artem S. Tashkinov
2013-10-25 19:40       ` Diego Calleja
2013-10-25 23:32         ` Fengguang Wu
2013-11-15 15:48           ` Diego Calleja
2013-10-25 20:43       ` NeilBrown
2013-10-25 21:03         ` Artem S. Tashkinov
2013-10-25 22:11           ` NeilBrown
2013-11-05  1:40             ` Figo.zhang
2013-11-05  1:47               ` David Lang
2013-11-05  2:08               ` NeilBrown
2013-10-29 20:49       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).