From: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
To: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org>,
linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Xiong Zhou <xzhou-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org
Subject: question about ext4 block allocation
Date: Mon, 6 Feb 2017 16:14:09 -0700 [thread overview]
Message-ID: <20170206231409.GA16676@linux.intel.com> (raw)
I recently hit an issue in my DAX testing where I was unable to get ext4 to
give me 2 MiB sized and aligned block allocations in a situation where I
thought I should be able to. I'm using a PMEM ramdisk of size 16 GiB, created
using the memmap kernel command line parameter.
# fdisk -l /dev/pmem0
Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
The very simple test program I used to reproduce this can be found at the
bottom of this mail. Here is the quick function that I used to recreate my
filesystem each run:
# type go_ext4
go_ext4 is a function
go_ext4 ()
{
umount /dev/pmem0 2> /dev/null;
mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem0;
mount -o dax /dev/pmem0 ~/dax;
cd ~/fsync
}
To be able to easily see whether DAX is able to use PMDs instead of PTEs, you
can run with the mmots tree (git://git.cmpxchg.org/linux-mmots.git), tag
v4.10-rc4-mmots-2017-01-17-16-32.
Okay, so here's the interesting part. If I create a filesystem and run the
test so it creates a file of size 32 MiB or 128 MiB, I get a PMD fault.
Here's the corresponding tracepoint output:
test-1429 [008] .... 10573.026699: dax_pmd_fault: dev 259:0 ino 0xc shared
WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000 vm_end
0x40400000 pgoff 0x280 max_pgoff 0x7fff
test-1429 [008] .... 10573.026912: dax_pmd_insert_mapping: dev 259:0 ino 0xc
shared write address 0x40280000 length 0x200000 pfn 0x108a00 DEV|MAP
radix_entry 0x114000e
test-1429 [008] .... 10573.026917: dax_pmd_fault_done: dev 259:0 ino 0xc
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000
vm_end 0x40400000 pgoff 0x280 max_pgoff 0x7fff NOPAGE
Great. That's what I want. But, if I create the filesystem and use the test
to create a file that is 64 MiB in size, the PMD fault fails because the PFN I
get from the filesystem isn't 2MiB aligned:
test-1475 [006] .... 11809.982188: dax_pmd_fault: dev 259:0 ino 0xc shared
WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000 vm_end
0x40400000 pgoff 0x280 max_pgoff 0x3fff
test-1475 [006] .... 11809.982398: dax_pmd_insert_mapping_fallback: dev 259:0
ino 0xc shared write address 0x40280000 length 0x200000 pfn 0x108601 DEV|MAP
radix_entry 0x0
test-1475 [006] .... 11809.982399: dax_pmd_fault_done: dev 259:0 ino 0xc
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000
vm_end 0x40400000 pgoff 0x280 max_pgoff 0x3fff FALLBACK
The PFN for the block allocation I get from ext4 is 0x108601, which isn't
aligned, so we fail the PG_PMD_COLOUR alignment check in
dax_iomap_pmd_fault(), and use PTEs instead.
I initially saw this in a test from Xiong:
https://www.mail-archive.com/linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org/msg02615.html
and created the attached test to have a simpler reproducer. With Xiong's
test, a test on a 128 MiB sized file will have all PMDs, an on a 64 MiB file
we'll use all PTEs.
This question is important because eventually we'd like to say to customers
"do X and you should get PMDs when you use DAX", but right now I'm not sure
what X is. :)
Thanks,
- Ross
--- >8 ---
#define _GNU_SOURCE
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <linux/falloc.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#define GiB(a) ((a)*1024ULL*1024*1024)
#define MiB(a) ((a)*1024ULL*1024)
#define PAGE(a) ((a)*0x1000)
void usage(char *prog)
{
fprintf(stderr, "usage: %s <size in MiB>\n", prog);
exit(1);
}
void err_exit(char *op, unsigned long len)
{
fprintf(stderr, "%s(%s) len %lu\n", op, strerror(errno), len);
exit(1);
}
int main(int argc, char *argv[])
{
char *data_array = (char*) GiB(1); /* request a 2MiB aligned address with mmap() */
unsigned long len;
int fd;
if (argc < 2)
usage(basename(argv[0]));
len = strtoul(argv[1], NULL, 10);
if (errno == ERANGE)
err_exit("strtoul", 0);
fd = open("/root/dax/data", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
if (fd < 0) {
perror("fd");
return 1;
}
ftruncate(fd, 0);
fallocate(fd, 0, 0, MiB(len));
data_array = mmap(data_array, PAGE(0x400), PROT_READ|PROT_WRITE,
MAP_SHARED, fd, PAGE(0));
data_array[PAGE(0x280)] = 142;
fsync(fd);
close(fd);
return 0;
}
WARNING: multiple messages have this Message-ID (diff)
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>, Theodore Ts'o <tytso@mit.edu>,
linux-ext4@vger.kernel.org, Xiong Zhou <xzhou@redhat.com>
Cc: linux-nvdimm@lists.01.org
Subject: question about ext4 block allocation
Date: Mon, 6 Feb 2017 16:14:09 -0700 [thread overview]
Message-ID: <20170206231409.GA16676@linux.intel.com> (raw)
I recently hit an issue in my DAX testing where I was unable to get ext4 to
give me 2 MiB sized and aligned block allocations in a situation where I
thought I should be able to. I'm using a PMEM ramdisk of size 16 GiB, created
using the memmap kernel command line parameter.
# fdisk -l /dev/pmem0
Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
The very simple test program I used to reproduce this can be found at the
bottom of this mail. Here is the quick function that I used to recreate my
filesystem each run:
# type go_ext4
go_ext4 is a function
go_ext4 ()
{
umount /dev/pmem0 2> /dev/null;
mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem0;
mount -o dax /dev/pmem0 ~/dax;
cd ~/fsync
}
To be able to easily see whether DAX is able to use PMDs instead of PTEs, you
can run with the mmots tree (git://git.cmpxchg.org/linux-mmots.git), tag
v4.10-rc4-mmots-2017-01-17-16-32.
Okay, so here's the interesting part. If I create a filesystem and run the
test so it creates a file of size 32 MiB or 128 MiB, I get a PMD fault.
Here's the corresponding tracepoint output:
test-1429 [008] .... 10573.026699: dax_pmd_fault: dev 259:0 ino 0xc shared
WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000 vm_end
0x40400000 pgoff 0x280 max_pgoff 0x7fff
test-1429 [008] .... 10573.026912: dax_pmd_insert_mapping: dev 259:0 ino 0xc
shared write address 0x40280000 length 0x200000 pfn 0x108a00 DEV|MAP
radix_entry 0x114000e
test-1429 [008] .... 10573.026917: dax_pmd_fault_done: dev 259:0 ino 0xc
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000
vm_end 0x40400000 pgoff 0x280 max_pgoff 0x7fff NOPAGE
Great. That's what I want. But, if I create the filesystem and use the test
to create a file that is 64 MiB in size, the PMD fault fails because the PFN I
get from the filesystem isn't 2MiB aligned:
test-1475 [006] .... 11809.982188: dax_pmd_fault: dev 259:0 ino 0xc shared
WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000 vm_end
0x40400000 pgoff 0x280 max_pgoff 0x3fff
test-1475 [006] .... 11809.982398: dax_pmd_insert_mapping_fallback: dev 259:0
ino 0xc shared write address 0x40280000 length 0x200000 pfn 0x108601 DEV|MAP
radix_entry 0x0
test-1475 [006] .... 11809.982399: dax_pmd_fault_done: dev 259:0 ino 0xc
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000
vm_end 0x40400000 pgoff 0x280 max_pgoff 0x3fff FALLBACK
The PFN for the block allocation I get from ext4 is 0x108601, which isn't
aligned, so we fail the PG_PMD_COLOUR alignment check in
dax_iomap_pmd_fault(), and use PTEs instead.
I initially saw this in a test from Xiong:
https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg02615.html
and created the attached test to have a simpler reproducer. With Xiong's
test, a test on a 128 MiB sized file will have all PMDs, an on a 64 MiB file
we'll use all PTEs.
This question is important because eventually we'd like to say to customers
"do X and you should get PMDs when you use DAX", but right now I'm not sure
what X is. :)
Thanks,
- Ross
--- >8 ---
#define _GNU_SOURCE
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <linux/falloc.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>
#define GiB(a) ((a)*1024ULL*1024*1024)
#define MiB(a) ((a)*1024ULL*1024)
#define PAGE(a) ((a)*0x1000)
void usage(char *prog)
{
fprintf(stderr, "usage: %s <size in MiB>\n", prog);
exit(1);
}
void err_exit(char *op, unsigned long len)
{
fprintf(stderr, "%s(%s) len %lu\n", op, strerror(errno), len);
exit(1);
}
int main(int argc, char *argv[])
{
char *data_array = (char*) GiB(1); /* request a 2MiB aligned address with mmap() */
unsigned long len;
int fd;
if (argc < 2)
usage(basename(argv[0]));
len = strtoul(argv[1], NULL, 10);
if (errno == ERANGE)
err_exit("strtoul", 0);
fd = open("/root/dax/data", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
if (fd < 0) {
perror("fd");
return 1;
}
ftruncate(fd, 0);
fallocate(fd, 0, 0, MiB(len));
data_array = mmap(data_array, PAGE(0x400), PROT_READ|PROT_WRITE,
MAP_SHARED, fd, PAGE(0));
data_array[PAGE(0x280)] = 142;
fsync(fd);
close(fd);
return 0;
}
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
next reply other threads:[~2017-02-06 23:14 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-06 23:14 Ross Zwisler [this message]
2017-02-06 23:14 ` question about ext4 block allocation Ross Zwisler
[not found] ` <20170206231409.GA16676-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2017-02-09 15:30 ` Jan Kara
2017-02-09 15:30 ` Jan Kara
[not found] ` <20170209153009.GB3009-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>
2017-02-09 17:52 ` Ross Zwisler
2017-02-09 17:52 ` Ross Zwisler
[not found] ` <20170209175228.GA15524-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2017-02-09 19:29 ` Theodore Ts'o
[not found] ` <20170209192948.wy4yubzfss5hu7cl-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>
2017-02-09 20:21 ` Ross Zwisler
[not found] ` <20170209202154.GB15524-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2017-02-09 22:54 ` Theodore Ts'o
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170206231409.GA16676@linux.intel.com \
--to=ross.zwisler-vuqaysv1563yd54fqh9/ca@public.gmane.org \
--cc=jack-AlSwsSmVLrQ@public.gmane.org \
--cc=linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org \
--cc=tytso-3s7WtUTddSA@public.gmane.org \
--cc=xzhou-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.