* ext4 performance falloff @ 2014-04-04 17:00 Daniel J Blueman 2014-04-04 20:56 ` Theodore Ts'o 0 siblings, 1 reply; 7+ messages in thread From: Daniel J Blueman @ 2014-04-04 17:00 UTC (permalink / raw) To: linux-ext4, LKML; +Cc: Steffen Persvold, Theodore Ts'o, Andreas Dilger On a larger system 1728 cores/4.5TB memory and 3.13.9, I'm seeing very low 600KB/s cached write performance to a local ext4 filesystem: # mkfs.ext4 /dev/sda5 # mount /dev/sda5 /mnt # dd if=/dev/zero of=/mnt/test bs=1M count=10 10+0 records in 10+0 records out 10485760 bytes (10 MB) copied, 17.4307 s, 602 kB/s Whereas eg on XFS, performance is much more reasonable: # mkfs.xfs /dev/sda5 # mount /dev/sda5 /mnt # dd if=/dev/zero of=/mnt/test bs=1M count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 2.39329 s, 43.8 MB/s Perf shows the time spent in bitmask iteration: 98.77% dd [kernel.kallsyms] [k] find_next_bit | --- find_next_bit | |--99.92%-- __percpu_counter_sum | ext4_has_free_clusters | ext4_claim_free_clusters | ext4_mb_new_blocks | ext4_ext_map_blocks | ext4_map_blocks | _ext4_get_block | ext4_get_block | __block_write_begin | ext4_write_begin | ext4_da_write_begin | generic_file_buffered_write | __generic_file_aio_write | generic_file_aio_write | ext4_file_write | do_sync_write | vfs_write | sys_write | system_call_fastpath | __write_nocancel | 0x0 --0.08%-- [...] Analysis shows that ext4 is reading from all cores' cpu-local data (thus expensive off-NUMA-node access) for each block written: if (free_clusters - (nclusters + rsv + dirty_clusters) < EXT4_FREECLUSTERS_WATERMARK) { free_clusters = percpu_counter_sum_positive(fcc); dirty_clusters = percpu_counter_sum_positive(dcc); } This threshold is defined as: #define EXT4_FREECLUSTERS_WATERMARK (4 * (percpu_counter_batch * nr_cpu_ids)) I can see why this may get overlooked for systems with commensurate local storage, but some filesystems reasonably don't need to scale with core count. The filesystem I'm testing on and the rootfs (as it has /tmp) are 50GB. There must be a good rationale for this being dependent on the number of cores rather than just the ratio of used space, right? Thanks, Daniel -- Daniel J Blueman Principal Software Engineer, Numascale ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ext4 performance falloff 2014-04-04 17:00 ext4 performance falloff Daniel J Blueman @ 2014-04-04 20:56 ` Theodore Ts'o 2014-04-05 3:28 ` Daniel J Blueman 0 siblings, 1 reply; 7+ messages in thread From: Theodore Ts'o @ 2014-04-04 20:56 UTC (permalink / raw) To: Daniel J Blueman; +Cc: linux-ext4, LKML, Steffen Persvold, Andreas Dilger On Sat, Apr 05, 2014 at 01:00:55AM +0800, Daniel J Blueman wrote: > On a larger system 1728 cores/4.5TB memory and 3.13.9, I'm seeing very low > 600KB/s cached write performance to a local ext4 filesystem: Hi Daniel, Thanks for the heads up. Most (all?) of the ext4 don't have systems with thousands of cores, so these issues generally don't come up for us, and so we're not likely (hell, very unlikely!) to notice potential problems cause by these sorts of uber-large systems. > Analysis shows that ext4 is reading from all cores' cpu-local data (thus > expensive off-NUMA-node access) for each block written: > > if (free_clusters - (nclusters + rsv + dirty_clusters) < > EXT4_FREECLUSTERS_WATERMARK) { > free_clusters = percpu_counter_sum_positive(fcc); > dirty_clusters = percpu_counter_sum_positive(dcc); > } > > This threshold is defined as: > > #define EXT4_FREECLUSTERS_WATERMARK (4 * (percpu_counter_batch * > nr_cpu_ids)) > > I can see why this may get overlooked for systems with commensurate local > storage, but some filesystems reasonably don't need to scale with core > count. The filesystem I'm testing on and the rootfs (as it has /tmp) are > 50GB. The problem we are trying to solve here is that when we do delayed allocation, we're making an implicit promise that there will be space available, even though we haven't allocated the space yet. The reason why we are using percpu counters is precisely so that we don't have to take a global lock in order to protect the free space counter for the file system. The problem is that when we start getting close to full, there is the possibility that all of the cpus might simultaneously try allocate space at exactly the same time (and while that might sound unlikely, Murphy's law will dictate that if the downside is that the user will lose data, and curse the day the file system developers were born, it *will* happen :-). So when the free space, minus the space we have already promised, drops below EXT4_FREE_CLUSTERS_WATERMARK, we start being super careful. I've done the calculations, and 4 * 32 * 1728 cores = 221184 blocks, or 864 megabytes. That would mean that the file system is over 98% full, so that's actually pretty reasonable; most of the time there's more free space than that. It looks like the real problem is that we're using nr_cpu_ids, which is the maximum possible number of cpu's that the system can support, which is different from the number of cpu's that you currently have. For normal kernels nr_cpu_ids is small, so that has never been a problem, but I bet you have nr_cpu_ids set to something really large, right? If you change nr_cpu_ids to total_cpus in the definition of EXT4_FREECLUSTERS_WATERMARK, does that make things better for your system? Thanks, - Ted ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ext4 performance falloff 2014-04-04 20:56 ` Theodore Ts'o @ 2014-04-05 3:28 ` Daniel J Blueman 2014-04-07 14:19 ` Jan Kara 0 siblings, 1 reply; 7+ messages in thread From: Daniel J Blueman @ 2014-04-05 3:28 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4, LKML, Steffen Persvold, Andreas Dilger On 04/05/2014 04:56 AM, Theodore Ts'o wrote: > On Sat, Apr 05, 2014 at 01:00:55AM +0800, Daniel J Blueman wrote: >> On a larger system 1728 cores/4.5TB memory and 3.13.9, I'm seeing very low >> 600KB/s cached write performance to a local ext4 filesystem: > Thanks for the heads up. Most (all?) of the ext4 don't have systems > with thousands of cores, so these issues generally don't come up for > us, and so we're not likely (hell, very unlikely!) to notice potential > problems cause by these sorts of uber-large systems. Hehe. It's not every day we get access to these systems also. >> Analysis shows that ext4 is reading from all cores' cpu-local data (thus >> expensive off-NUMA-node access) for each block written: >> >> if (free_clusters - (nclusters + rsv + dirty_clusters) < >> EXT4_FREECLUSTERS_WATERMARK) { >> free_clusters = percpu_counter_sum_positive(fcc); >> dirty_clusters = percpu_counter_sum_positive(dcc); >> } >> >> This threshold is defined as: >> >> #define EXT4_FREECLUSTERS_WATERMARK (4 * (percpu_counter_batch * >> nr_cpu_ids)) ... > The problem we are trying to solve here is that when we do delayed > allocation, we're making an implicit promise that there will be space > available > > I've done the calculations, and 4 * 32 * 1728 cores = 221184 blocks, > or 864 megabytes. That would mean that the file system is over 98% > full, so that's actually pretty reasonable; most of the time there's > more free space than that. The filesystem is empty after the mkfs; the approach here may make sense if we want to allow all cores to write to this FS, but here we have one. Instrumenting shows that free_clusters=16464621 nclusters=1 rsv=842790 dirty_clusters=0 percpu_counter_batch=3456 nr_cpu_ids=1728; below 91GB space, we'd hit this issue. It feels more sensible to start this behaviour when the FS is say 98% full, irrespective of the number of cores, but that's not why the behaviour is there. Since these block devices are attached to a single NUMA node's IO link, there is a scaling limitation there anyway, so there may be rationale in limiting this to use min(256,nr_cpu_ids) maybe? > It looks like the real problem is that we're using nr_cpu_ids, which > is the maximum possible number of cpu's that the system can support, > which is different from the number of cpu's that you currently have. > For normal kernels nr_cpu_ids is small, so that has never been a > problem, but I bet you have nr_cpu_ids set to something really large, > right? > > If you change nr_cpu_ids to total_cpus in the definition of > EXT4_FREECLUSTERS_WATERMARK, does that make things better for your > system? I have reproduced this with CPU hotplug disabled, so nr_cpu_ids is nicely at 1728. Thanks, Daniel -- Daniel J Blueman Principal Software Engineer, Numascale ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ext4 performance falloff 2014-04-05 3:28 ` Daniel J Blueman @ 2014-04-07 14:19 ` Jan Kara 2014-04-07 16:40 ` Andi Kleen 0 siblings, 1 reply; 7+ messages in thread From: Jan Kara @ 2014-04-07 14:19 UTC (permalink / raw) To: Daniel J Blueman Cc: Theodore Ts'o, linux-ext4, LKML, Steffen Persvold, Andreas Dilger On Sat 05-04-14 11:28:17, Daniel J Blueman wrote: > On 04/05/2014 04:56 AM, Theodore Ts'o wrote: > >On Sat, Apr 05, 2014 at 01:00:55AM +0800, Daniel J Blueman wrote: > >>On a larger system 1728 cores/4.5TB memory and 3.13.9, I'm seeing very low > >>600KB/s cached write performance to a local ext4 filesystem: > > > Thanks for the heads up. Most (all?) of the ext4 don't have systems > > with thousands of cores, so these issues generally don't come up for > > us, and so we're not likely (hell, very unlikely!) to notice potential > > problems cause by these sorts of uber-large systems. > > Hehe. It's not every day we get access to these systems also. > > >>Analysis shows that ext4 is reading from all cores' cpu-local data (thus > >>expensive off-NUMA-node access) for each block written: > >> > >>if (free_clusters - (nclusters + rsv + dirty_clusters) < > >> EXT4_FREECLUSTERS_WATERMARK) { > >> free_clusters = percpu_counter_sum_positive(fcc); > >> dirty_clusters = percpu_counter_sum_positive(dcc); > >>} > >> > >>This threshold is defined as: > >> > >>#define EXT4_FREECLUSTERS_WATERMARK (4 * (percpu_counter_batch * > >>nr_cpu_ids)) > ... > >The problem we are trying to solve here is that when we do delayed > >allocation, we're making an implicit promise that there will be space > >available > > > >I've done the calculations, and 4 * 32 * 1728 cores = 221184 blocks, > >or 864 megabytes. That would mean that the file system is over 98% > >full, so that's actually pretty reasonable; most of the time there's > >more free space than that. > > The filesystem is empty after the mkfs; the approach here may make > sense if we want to allow all cores to write to this FS, but here we > have one. > > Instrumenting shows that free_clusters=16464621 nclusters=1 > rsv=842790 dirty_clusters=0 percpu_counter_batch=3456 > nr_cpu_ids=1728; below 91GB space, we'd hit this issue. It feels > more sensible to start this behaviour when the FS is say 98% full, > irrespective of the number of cores, but that's not why the > behaviour is there. Yeah, percpu_counter_batch = max(32, nr*2) so the value you observe is correct and EXT4_FREECLUSTERS_WATERMARK is then 23887872 ~= 95 GB. Clearly we have to try to be more clever on these large systems. > Since these block devices are attached to a single NUMA node's IO > link, there is a scaling limitation there anyway, so there may be > rationale in limiting this to use min(256,nr_cpu_ids) maybe? Well, but when you get something "allocated" from the counter, we rely on the space being really available in the filesystem (so that delayed allocated blocks can be allocated and written out). With this limitation to 256 if there is more that 256*percpu_counter_patch accumulated in the percpu part of the counter, we could promise allocating something we don't really have space for. And I understand this is unlikely but when we speak about "your data is lost", even unlikely doesn't sound good to people. They want "this can never happen" promises :) What we really need is a counter where we can better estimate counts accumulated in the percpu part of it. As the counter approaches zero, it's CPU overhead will have to become that of a single locked variable but when the value of counter is relatively high, we want it to be fast as the percpu one. Possibly, each CPU could "reserve" part of the value in the counter (by just decrementing the total value; how large that part should be really needs to depend to the total value of the counter and number of CPUs - in this regard we really differ from classical percpu couters) and allocate/free using that part. If CPU cannot reserve what it is asked for anymore, it would go and steal from parts other CPUs have accumulated, returning them to global pool until it can satisfy the allocation. But someone would need to try whether this really works out reasonably fast :). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ext4 performance falloff 2014-04-07 14:19 ` Jan Kara @ 2014-04-07 16:40 ` Andi Kleen 2014-04-07 20:08 ` Jan Kara 2014-04-08 10:30 ` Dave Chinner 0 siblings, 2 replies; 7+ messages in thread From: Andi Kleen @ 2014-04-07 16:40 UTC (permalink / raw) To: Jan Kara Cc: Daniel J Blueman, Theodore Ts'o, linux-ext4, LKML, Steffen Persvold, Andreas Dilger Jan Kara <jack@suse.cz> writes: > > What we really need is a counter where we can better estimate counts > accumulated in the percpu part of it. As the counter approaches zero, it's > CPU overhead will have to become that of a single locked variable but when > the value of counter is relatively high, we want it to be fast as the > percpu one. Possibly, each CPU could "reserve" part of the value in the > counter (by just decrementing the total value; how large that part should > be really needs to depend to the total value of the counter and number of > CPUs - in this regard we really differ from classical percpu couters) and > allocate/free using that part. If CPU cannot reserve what it is asked for > anymore, it would go and steal from parts other CPUs have accumulated, > returning them to global pool until it can satisfy the allocation. That's a percpu_counter() isn't it? (or cookie jar) The MM uses similar techniques. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ext4 performance falloff 2014-04-07 16:40 ` Andi Kleen @ 2014-04-07 20:08 ` Jan Kara 2014-04-08 10:30 ` Dave Chinner 1 sibling, 0 replies; 7+ messages in thread From: Jan Kara @ 2014-04-07 20:08 UTC (permalink / raw) To: Andi Kleen Cc: Jan Kara, Daniel J Blueman, Theodore Ts'o, linux-ext4, LKML, Steffen Persvold, Andreas Dilger On Mon 07-04-14 09:40:28, Andi Kleen wrote: > Jan Kara <jack@suse.cz> writes: > > > > What we really need is a counter where we can better estimate counts > > accumulated in the percpu part of it. As the counter approaches zero, it's > > CPU overhead will have to become that of a single locked variable but when > > the value of counter is relatively high, we want it to be fast as the > > percpu one. Possibly, each CPU could "reserve" part of the value in the > > counter (by just decrementing the total value; how large that part should > > be really needs to depend to the total value of the counter and number of > > CPUs - in this regard we really differ from classical percpu couters) and > > allocate/free using that part. If CPU cannot reserve what it is asked for > > anymore, it would go and steal from parts other CPUs have accumulated, > > returning them to global pool until it can satisfy the allocation. > > That's a percpu_counter() isn't it? (or cookie jar) Not quite. We could use __percpu_counter_add() to set batch size for each operation depending on the current counter value. But still we don't want any cpu-local count to go negative (as then we cannot rely on global counter to give us a lower bound on number of free blocks). Also stealing from different cpu needs to be implemented... > The MM uses similar techniques. Where exactly? I'd be happy to be inspired :). Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: ext4 performance falloff 2014-04-07 16:40 ` Andi Kleen 2014-04-07 20:08 ` Jan Kara @ 2014-04-08 10:30 ` Dave Chinner 1 sibling, 0 replies; 7+ messages in thread From: Dave Chinner @ 2014-04-08 10:30 UTC (permalink / raw) To: Andi Kleen Cc: Jan Kara, Daniel J Blueman, Theodore Ts'o, linux-ext4, LKML, Steffen Persvold, Andreas Dilger On Mon, Apr 07, 2014 at 09:40:28AM -0700, Andi Kleen wrote: > Jan Kara <jack@suse.cz> writes: > > > > What we really need is a counter where we can better estimate counts > > accumulated in the percpu part of it. As the counter approaches zero, it's > > CPU overhead will have to become that of a single locked variable but when > > the value of counter is relatively high, we want it to be fast as the > > percpu one. Possibly, each CPU could "reserve" part of the value in the > > counter (by just decrementing the total value; how large that part should > > be really needs to depend to the total value of the counter and number of > > CPUs - in this regard we really differ from classical percpu couters) and > > allocate/free using that part. If CPU cannot reserve what it is asked for > > anymore, it would go and steal from parts other CPUs have accumulated, > > returning them to global pool until it can satisfy the allocation. Yup, that's pretty much what the slow path/fast path breakdown of the xfs_icsb_* (XFS In-Core Super Block) code in fs/xfs/xfs_mount.c does. :) It distributes free space across all the CPUs and rebalances them when a per-CPu counter runs out. And to avoid lots of rebalances when ENOSPC approaches (512 blocks per CPU, IIRC), it disables the per-CPU counters completely and falls back to a global counter protected by a mutex to avoid wasting hundreds of CPUs spinning on a contended global lock. When the free space goes back above that threshold, it returns to per-cpu mode (the fast path code). > That's a percpu_counter() isn't it? (or cookie jar) No. percpu_counters do not guarantee accuracy nor can the counters be externally serialised for things like concurrent ENOSPC detection that require a guarantee that the counter never, ever goes below zero. > The MM uses similar techniques. I haven't seen anything else that uses similar techniques to the XFS code - I wrote it back in 2005 before there was generic per-cpu counter infrastructure, and I've been keeping an eye out as to whether it could be replaced with generic code ever since.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-04-08 10:30 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-04-04 17:00 ext4 performance falloff Daniel J Blueman 2014-04-04 20:56 ` Theodore Ts'o 2014-04-05 3:28 ` Daniel J Blueman 2014-04-07 14:19 ` Jan Kara 2014-04-07 16:40 ` Andi Kleen 2014-04-07 20:08 ` Jan Kara 2014-04-08 10:30 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).