* Re: [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages
2024-12-18 21:01 ` [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages Joanne Koong
@ 2024-12-19 14:29 ` Brian Foster
2024-12-19 22:34 ` Joanne Koong
2024-12-19 17:51 ` Darrick J. Wong
2024-12-20 7:47 ` Nirjhar Roy
2 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2024-12-19 14:29 UTC (permalink / raw)
To: Joanne Koong; +Cc: fstests, linux-fsdevel, kernel-team
On Wed, Dec 18, 2024 at 01:01:21PM -0800, Joanne Koong wrote:
> Add support for reads/writes from buffers backed by hugepages.
> This can be enabled through the '-h' flag. This flag should only be used
> on systems where THP capabilities are enabled.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
Firstly, thanks for taking the time to add this. This seems like a nice
idea. It might be nice to have an extra sentence or two in the commit
log on the purpose/motivation. For example, has this been used to detect
a certain class of problem?
A few other quick comments below...
> ltp/fsx.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 92 insertions(+), 8 deletions(-)
>
> diff --git a/ltp/fsx.c b/ltp/fsx.c
> index 41933354..3656fd9f 100644
> --- a/ltp/fsx.c
> +++ b/ltp/fsx.c
> @@ -190,6 +190,7 @@ int o_direct; /* -Z */
> int aio = 0;
> int uring = 0;
> int mark_nr = 0;
> +int hugepages = 0; /* -h flag */
>
> int page_size;
> int page_mask;
> @@ -2471,7 +2472,7 @@ void
> usage(void)
> {
> fprintf(stdout, "usage: %s",
> - "fsx [-dfknqxyzBEFHIJKLORWXZ0]\n\
> + "fsx [-dfhknqxyzBEFHIJKLORWXZ0]\n\
> [-b opnum] [-c Prob] [-g filldata] [-i logdev] [-j logid]\n\
> [-l flen] [-m start:end] [-o oplen] [-p progressinterval]\n\
> [-r readbdy] [-s style] [-t truncbdy] [-w writebdy]\n\
> @@ -2484,6 +2485,7 @@ usage(void)
> -e: pollute post-eof on size changes (default 0)\n\
> -f: flush and invalidate cache after I/O\n\
> -g X: write character X instead of random generated data\n\
> + -h hugepages: use buffers backed by hugepages for reads/writes\n\
> -i logdev: do integrity testing, logdev is the dm log writes device\n\
> -j logid: prefix debug log messsages with this id\n\
> -k: do not truncate existing file and use its size as upper bound on file size\n\
> @@ -2856,6 +2858,56 @@ keep_running(void)
> return numops-- != 0;
> }
>
> +static long
> +get_hugepage_size(void)
> +{
> + const char *str = "Hugepagesize:";
> + long hugepage_size = -1;
> + char buffer[64];
> + FILE *file;
> +
> + file = fopen("/proc/meminfo", "r");
> + if (!file) {
> + prterr("get_hugepage_size: fopen /proc/meminfo");
> + return -1;
> + }
> + while (fgets(buffer, sizeof(buffer), file)) {
> + if (strncmp(buffer, str, strlen(str)) == 0) {
> + sscanf(buffer + strlen(str), "%ld", &hugepage_size);
> + break;
> + }
> + }
> + fclose(file);
> + if (hugepage_size == -1) {
> + prterr("get_hugepage_size: failed to find "
> + "hugepage size in /proc/meminfo\n");
> + return -1;
> + }
> +
> + /* convert from KiB to bytes */
> + return hugepage_size * 1024;
> +}
> +
> +static void *
> +init_hugepages_buf(unsigned len, long hugepage_size)
> +{
> + void *buf;
> + long buf_size = roundup(len, hugepage_size);
> +
> + if (posix_memalign(&buf, hugepage_size, buf_size)) {
> + prterr("posix_memalign for buf");
> + return NULL;
> + }
> + memset(buf, '\0', len);
I'm assuming it doesn't matter, but did you want to use buf_size here to
clear the whole buffer?
> + if (madvise(buf, buf_size, MADV_COLLAPSE)) {
> + prterr("madvise collapse for buf");
> + free(buf);
> + return NULL;
> + }
> +
> + return buf;
> +}
> +
> static struct option longopts[] = {
> {"replay-ops", required_argument, 0, 256},
> {"record-ops", optional_argument, 0, 255},
> @@ -2883,7 +2935,7 @@ main(int argc, char **argv)
> setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout */
>
> while ((ch = getopt_long(argc, argv,
> - "0b:c:de:fg:i:j:kl:m:no:p:qr:s:t:uw:xyABD:EFJKHzCILN:OP:RS:UWXZ",
> + "0b:c:de:fg:hi:j:kl:m:no:p:qr:s:t:uw:xyABD:EFJKHzCILN:OP:RS:UWXZ",
> longopts, NULL)) != EOF)
> switch (ch) {
> case 'b':
> @@ -2916,6 +2968,9 @@ main(int argc, char **argv)
> case 'g':
> filldata = *optarg;
> break;
> + case 'h':
> + hugepages = 1;
> + break;
> case 'i':
> integrity = 1;
> logdev = strdup(optarg);
> @@ -3232,12 +3287,41 @@ main(int argc, char **argv)
> original_buf = (char *) malloc(maxfilelen);
> for (i = 0; i < maxfilelen; i++)
> original_buf[i] = random() % 256;
> - good_buf = (char *) malloc(maxfilelen + writebdy);
> - good_buf = round_ptr_up(good_buf, writebdy, 0);
> - memset(good_buf, '\0', maxfilelen);
> - temp_buf = (char *) malloc(maxoplen + readbdy);
> - temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> - memset(temp_buf, '\0', maxoplen);
> + if (hugepages) {
> + long hugepage_size;
> +
> + hugepage_size = get_hugepage_size();
> + if (hugepage_size == -1) {
> + prterr("get_hugepage_size()");
> + exit(99);
> + }
> +
> + if (writebdy != 1 && writebdy != hugepage_size)
> + prt("ignoring write alignment (since -h is enabled)");
> +
> + if (readbdy != 1 && readbdy != hugepage_size)
> + prt("ignoring read alignment (since -h is enabled)");
I'm a little unclear on what these warnings mean. The alignments are
still used in the read/write paths afaics. The non-huge mode seems to
only really care about the max size of the buffers in this code.
If your test doesn't actually use read/write alignments and the goal is
just to keep things simple, perhaps it would be cleaner to add something
like an if (hugepages && (writebdy != 1 || readbdy != 1)) check after
option processing and exit out as an unsupported combination..?
BTW, it might also be nice to factor out this whole section of buffer
initialization code (including original_buf) into an init_buffers() or
some such. That could be done as a prep patch, but just a suggestion
either way.
Brian
> +
> + good_buf = init_hugepages_buf(maxfilelen, hugepage_size);
> + if (!good_buf) {
> + prterr("init_hugepages_buf failed for good_buf");
> + exit(100);
> + }
> +
> + temp_buf = init_hugepages_buf(maxoplen, hugepage_size);
> + if (!temp_buf) {
> + prterr("init_hugepages_buf failed for temp_buf");
> + exit(101);
> + }
> + } else {
> + good_buf = (char *) malloc(maxfilelen + writebdy);
> + good_buf = round_ptr_up(good_buf, writebdy, 0);
> + memset(good_buf, '\0', maxfilelen);
> +
> + temp_buf = (char *) malloc(maxoplen + readbdy);
> + temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> + memset(temp_buf, '\0', maxoplen);
> + }
> if (lite) { /* zero entire existing file */
> ssize_t written;
>
> --
> 2.47.1
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages
2024-12-19 14:29 ` Brian Foster
@ 2024-12-19 22:34 ` Joanne Koong
2024-12-20 11:52 ` Brian Foster
0 siblings, 1 reply; 15+ messages in thread
From: Joanne Koong @ 2024-12-19 22:34 UTC (permalink / raw)
To: Brian Foster; +Cc: fstests, linux-fsdevel, kernel-team
On Thu, Dec 19, 2024 at 6:27 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Wed, Dec 18, 2024 at 01:01:21PM -0800, Joanne Koong wrote:
> > Add support for reads/writes from buffers backed by hugepages.
> > This can be enabled through the '-h' flag. This flag should only be used
> > on systems where THP capabilities are enabled.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
>
> Firstly, thanks for taking the time to add this. This seems like a nice
> idea. It might be nice to have an extra sentence or two in the commit
> log on the purpose/motivation. For example, has this been used to detect
> a certain class of problem?
Hi Brian,
Thanks for reviewing this. That's a good idea - I'll include the
sentence from the cover letter to this commit message as well: "This
is motivated by a recent bug that was due to faulty handling for
userspace buffers backed by hugepages."
>
> A few other quick comments below...
>
> > ltp/fsx.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++-----
> > 1 file changed, 92 insertions(+), 8 deletions(-)
> >
> > diff --git a/ltp/fsx.c b/ltp/fsx.c
> > index 41933354..3656fd9f 100644
> > --- a/ltp/fsx.c
> > +++ b/ltp/fsx.c
> > @@ -190,6 +190,7 @@ int o_direct; /* -Z */
> > int aio = 0;
> > +}
> > +
> > +static void *
> > +init_hugepages_buf(unsigned len, long hugepage_size)
> > +{
> > + void *buf;
> > + long buf_size = roundup(len, hugepage_size);
> > +
> > + if (posix_memalign(&buf, hugepage_size, buf_size)) {
> > + prterr("posix_memalign for buf");
> > + return NULL;
> > + }
> > + memset(buf, '\0', len);
>
> I'm assuming it doesn't matter, but did you want to use buf_size here to
> clear the whole buffer?
I only saw buf being used up to len in the rest of the code so I
didn't think it was necessary, but I also don't feel strongly about
this and am happy to change this to clear the entire buffer if
preferred.
>
> > + if (madvise(buf, buf_size, MADV_COLLAPSE)) {
> > + prterr("madvise collapse for buf");
> > + free(buf);
> > + return NULL;
> > + }
> > +
> > + return buf;
> > +}
> > @@ -3232,12 +3287,41 @@ main(int argc, char **argv)
> > original_buf = (char *) malloc(maxfilelen);
> > for (i = 0; i < maxfilelen; i++)
> > original_buf[i] = random() % 256;
> > - good_buf = (char *) malloc(maxfilelen + writebdy);
> > - good_buf = round_ptr_up(good_buf, writebdy, 0);
> > - memset(good_buf, '\0', maxfilelen);
> > - temp_buf = (char *) malloc(maxoplen + readbdy);
> > - temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> > - memset(temp_buf, '\0', maxoplen);
> > + if (hugepages) {
> > + long hugepage_size;
> > +
> > + hugepage_size = get_hugepage_size();
> > + if (hugepage_size == -1) {
> > + prterr("get_hugepage_size()");
> > + exit(99);
> > + }
> > +
> > + if (writebdy != 1 && writebdy != hugepage_size)
> > + prt("ignoring write alignment (since -h is enabled)");
> > +
> > + if (readbdy != 1 && readbdy != hugepage_size)
> > + prt("ignoring read alignment (since -h is enabled)");
>
> I'm a little unclear on what these warnings mean. The alignments are
> still used in the read/write paths afaics. The non-huge mode seems to
> only really care about the max size of the buffers in this code.
>
> If your test doesn't actually use read/write alignments and the goal is
> just to keep things simple, perhaps it would be cleaner to add something
> like an if (hugepages && (writebdy != 1 || readbdy != 1)) check after
> option processing and exit out as an unsupported combination..?
My understanding of the 'writebdy' and 'readbdy' options are that
they're for making reads/writes aligned to the passed-in value, which
depends on the starting address of the buffer being aligned to that
value as well. However for hugepages buffers, they must be aligned to
the system hugepage size (eg 2 MiB) or the madvise(... MADV_COLLAPSE)
call will fail. As such, it is not guaranteed that the requested
alignment will actually be abided by. For that reason, I thought it'd
be useful to print this out to the user so they know requested
alignments will be ignored, but it didn't seem severe enough of an
issue to error out and exit altogether. But maybe it'd be less
confusing for the user if this instead does just error out if the
alignment isn't a multiple of the hugepage size.
>
> BTW, it might also be nice to factor out this whole section of buffer
> initialization code (including original_buf) into an init_buffers() or
> some such. That could be done as a prep patch, but just a suggestion
> either way.
Good idea - i'll do this refactoring for v2.
Thanks,
Joanne
>
> Brian
>
> > +
> > + good_buf = init_hugepages_buf(maxfilelen, hugepage_size);
> > + if (!good_buf) {
> > + prterr("init_hugepages_buf failed for good_buf");
> > + exit(100);
> > + }
> > +
> > + temp_buf = init_hugepages_buf(maxoplen, hugepage_size);
> > + if (!temp_buf) {
> > + prterr("init_hugepages_buf failed for temp_buf");
> > + exit(101);
> > + }
> > + } else {
> > + good_buf = (char *) malloc(maxfilelen + writebdy);
> > + good_buf = round_ptr_up(good_buf, writebdy, 0);
> > + memset(good_buf, '\0', maxfilelen);
> > +
> > + temp_buf = (char *) malloc(maxoplen + readbdy);
> > + temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> > + memset(temp_buf, '\0', maxoplen);
> > + }
> > if (lite) { /* zero entire existing file */
> > ssize_t written;
> >
> > --
> > 2.47.1
> >
> >
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages
2024-12-19 22:34 ` Joanne Koong
@ 2024-12-20 11:52 ` Brian Foster
2024-12-23 20:59 ` Joanne Koong
2024-12-27 19:22 ` Joanne Koong
0 siblings, 2 replies; 15+ messages in thread
From: Brian Foster @ 2024-12-20 11:52 UTC (permalink / raw)
To: Joanne Koong; +Cc: fstests, linux-fsdevel, kernel-team
On Thu, Dec 19, 2024 at 02:34:01PM -0800, Joanne Koong wrote:
> On Thu, Dec 19, 2024 at 6:27 AM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Wed, Dec 18, 2024 at 01:01:21PM -0800, Joanne Koong wrote:
> > > Add support for reads/writes from buffers backed by hugepages.
> > > This can be enabled through the '-h' flag. This flag should only be used
> > > on systems where THP capabilities are enabled.
> > >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > ---
> >
> > Firstly, thanks for taking the time to add this. This seems like a nice
> > idea. It might be nice to have an extra sentence or two in the commit
> > log on the purpose/motivation. For example, has this been used to detect
> > a certain class of problem?
>
> Hi Brian,
>
> Thanks for reviewing this. That's a good idea - I'll include the
> sentence from the cover letter to this commit message as well: "This
> is motivated by a recent bug that was due to faulty handling for
> userspace buffers backed by hugepages."
>
Thanks. Got a link or anything, for my own curiosity?
Also, I presume the followup fstest is a reproducer?
> >
> > A few other quick comments below...
> >
> > > ltp/fsx.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++-----
> > > 1 file changed, 92 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/ltp/fsx.c b/ltp/fsx.c
> > > index 41933354..3656fd9f 100644
> > > --- a/ltp/fsx.c
> > > +++ b/ltp/fsx.c
> > > @@ -190,6 +190,7 @@ int o_direct; /* -Z */
> > > int aio = 0;
> > > +}
> > > +
> > > +static void *
> > > +init_hugepages_buf(unsigned len, long hugepage_size)
> > > +{
> > > + void *buf;
> > > + long buf_size = roundup(len, hugepage_size);
> > > +
> > > + if (posix_memalign(&buf, hugepage_size, buf_size)) {
> > > + prterr("posix_memalign for buf");
> > > + return NULL;
> > > + }
> > > + memset(buf, '\0', len);
> >
> > I'm assuming it doesn't matter, but did you want to use buf_size here to
> > clear the whole buffer?
>
> I only saw buf being used up to len in the rest of the code so I
> didn't think it was necessary, but I also don't feel strongly about
> this and am happy to change this to clear the entire buffer if
> preferred.
>
Yeah.. at first it looked like a bug to me, then I realized the same
thing later. I suspect it might be wise to just clear it entirely to
avoid any future landmines, but that could just be my internal bias
talking too. No big deal either way.
> >
> > > + if (madvise(buf, buf_size, MADV_COLLAPSE)) {
> > > + prterr("madvise collapse for buf");
> > > + free(buf);
> > > + return NULL;
> > > + }
> > > +
> > > + return buf;
> > > +}
> > > @@ -3232,12 +3287,41 @@ main(int argc, char **argv)
> > > original_buf = (char *) malloc(maxfilelen);
> > > for (i = 0; i < maxfilelen; i++)
> > > original_buf[i] = random() % 256;
> > > - good_buf = (char *) malloc(maxfilelen + writebdy);
> > > - good_buf = round_ptr_up(good_buf, writebdy, 0);
> > > - memset(good_buf, '\0', maxfilelen);
> > > - temp_buf = (char *) malloc(maxoplen + readbdy);
> > > - temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> > > - memset(temp_buf, '\0', maxoplen);
> > > + if (hugepages) {
> > > + long hugepage_size;
> > > +
> > > + hugepage_size = get_hugepage_size();
> > > + if (hugepage_size == -1) {
> > > + prterr("get_hugepage_size()");
> > > + exit(99);
> > > + }
> > > +
> > > + if (writebdy != 1 && writebdy != hugepage_size)
> > > + prt("ignoring write alignment (since -h is enabled)");
> > > +
> > > + if (readbdy != 1 && readbdy != hugepage_size)
> > > + prt("ignoring read alignment (since -h is enabled)");
> >
> > I'm a little unclear on what these warnings mean. The alignments are
> > still used in the read/write paths afaics. The non-huge mode seems to
> > only really care about the max size of the buffers in this code.
> >
> > If your test doesn't actually use read/write alignments and the goal is
> > just to keep things simple, perhaps it would be cleaner to add something
> > like an if (hugepages && (writebdy != 1 || readbdy != 1)) check after
> > option processing and exit out as an unsupported combination..?
>
> My understanding of the 'writebdy' and 'readbdy' options are that
> they're for making reads/writes aligned to the passed-in value, which
> depends on the starting address of the buffer being aligned to that
> value as well. However for hugepages buffers, they must be aligned to
> the system hugepage size (eg 2 MiB) or the madvise(... MADV_COLLAPSE)
> call will fail. As such, it is not guaranteed that the requested
> alignment will actually be abided by. For that reason, I thought it'd
> be useful to print this out to the user so they know requested
> alignments will be ignored, but it didn't seem severe enough of an
> issue to error out and exit altogether. But maybe it'd be less
> confusing for the user if this instead does just error out if the
> alignment isn't a multiple of the hugepage size.
>
Ahh, I see. I missed the round_ptr_up() adjustments. That makes more
sense now.
IMO it would be a little cleaner to just bail out earlier as such. But
either way, I suppose if you could add a small comment with this
alignment context you've explained above with the error checks then that
is good enough for me. Thanks!
Brian
> >
> > BTW, it might also be nice to factor out this whole section of buffer
> > initialization code (including original_buf) into an init_buffers() or
> > some such. That could be done as a prep patch, but just a suggestion
> > either way.
>
> Good idea - i'll do this refactoring for v2.
>
>
> Thanks,
> Joanne
> >
> > Brian
> >
> > > +
> > > + good_buf = init_hugepages_buf(maxfilelen, hugepage_size);
> > > + if (!good_buf) {
> > > + prterr("init_hugepages_buf failed for good_buf");
> > > + exit(100);
> > > + }
> > > +
> > > + temp_buf = init_hugepages_buf(maxoplen, hugepage_size);
> > > + if (!temp_buf) {
> > > + prterr("init_hugepages_buf failed for temp_buf");
> > > + exit(101);
> > > + }
> > > + } else {
> > > + good_buf = (char *) malloc(maxfilelen + writebdy);
> > > + good_buf = round_ptr_up(good_buf, writebdy, 0);
> > > + memset(good_buf, '\0', maxfilelen);
> > > +
> > > + temp_buf = (char *) malloc(maxoplen + readbdy);
> > > + temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> > > + memset(temp_buf, '\0', maxoplen);
> > > + }
> > > if (lite) { /* zero entire existing file */
> > > ssize_t written;
> > >
> > > --
> > > 2.47.1
> > >
> > >
> >
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages
2024-12-20 11:52 ` Brian Foster
@ 2024-12-23 20:59 ` Joanne Koong
2024-12-27 19:22 ` Joanne Koong
1 sibling, 0 replies; 15+ messages in thread
From: Joanne Koong @ 2024-12-23 20:59 UTC (permalink / raw)
To: Brian Foster; +Cc: fstests, linux-fsdevel, kernel-team
On Fri, Dec 20, 2024 at 3:50 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Dec 19, 2024 at 02:34:01PM -0800, Joanne Koong wrote:
> > On Thu, Dec 19, 2024 at 6:27 AM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Wed, Dec 18, 2024 at 01:01:21PM -0800, Joanne Koong wrote:
> > > > Add support for reads/writes from buffers backed by hugepages.
> > > > This can be enabled through the '-h' flag. This flag should only be used
> > > > on systems where THP capabilities are enabled.
> > > >
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > ---
> > >
> > > Firstly, thanks for taking the time to add this. This seems like a nice
> > > idea. It might be nice to have an extra sentence or two in the commit
> > > log on the purpose/motivation. For example, has this been used to detect
> > > a certain class of problem?
> >
> > Hi Brian,
> >
> > Thanks for reviewing this. That's a good idea - I'll include the
> > sentence from the cover letter to this commit message as well: "This
> > is motivated by a recent bug that was due to faulty handling for
> > userspace buffers backed by hugepages."
> >
>
> Thanks. Got a link or anything, for my own curiosity?
>
> Also, I presume the followup fstest is a reproducer?
This is the link to the bug and the ensuing discussion:
https://lore.kernel.org/linux-fsdevel/p3iss6hssbvtdutnwmuddvdadubrhfkdoosgmbewvo674f7f3y@cwnwffjqltzw/
. The tldr is that even if a filesystem does not support hugepages, it
could still encounter hugepages in the direct io case for large folios
backing userspace buffers. I missed this in commit 3b97c3652d91, which
resulted in incorrect data being forwarded in fuse. There's currently
no fstest checking against hugepages-backed userspace buffers, so this
patchset is a followup for this case and would have caught the bug.
>
> > >
> > > A few other quick comments below...
> > >
> > > > ltp/fsx.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++-----
> > > > 1 file changed, 92 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/ltp/fsx.c b/ltp/fsx.c
> > > > index 41933354..3656fd9f 100644
> > > > --- a/ltp/fsx.c
> > > > +++ b/ltp/fsx.c
> > > > @@ -190,6 +190,7 @@ int o_direct; /* -Z */
> > > > int aio = 0;
> > > > +}
> > > > +
> > > > +static void *
> > > > +init_hugepages_buf(unsigned len, long hugepage_size)
> > > > +{
> > > > + void *buf;
> > > > + long buf_size = roundup(len, hugepage_size);
> > > > +
> > > > + if (posix_memalign(&buf, hugepage_size, buf_size)) {
> > > > + prterr("posix_memalign for buf");
> > > > + return NULL;
> > > > + }
> > > > + memset(buf, '\0', len);
> > >
> > > I'm assuming it doesn't matter, but did you want to use buf_size here to
> > > clear the whole buffer?
> >
> > I only saw buf being used up to len in the rest of the code so I
> > didn't think it was necessary, but I also don't feel strongly about
> > this and am happy to change this to clear the entire buffer if
> > preferred.
> >
>
> Yeah.. at first it looked like a bug to me, then I realized the same
> thing later. I suspect it might be wise to just clear it entirely to
> avoid any future landmines, but that could just be my internal bias
> talking too. No big deal either way.
>
Sounds great, I'll clear it entirely in v2.
> > >
> > > > + if (madvise(buf, buf_size, MADV_COLLAPSE)) {
> > > > + prterr("madvise collapse for buf");
> > > > + free(buf);
> > > > + return NULL;
> > > > + }
> > > > +
> > > > + return buf;
> > > > +}
> > > > @@ -3232,12 +3287,41 @@ main(int argc, char **argv)
> > > > original_buf = (char *) malloc(maxfilelen);
> > > > for (i = 0; i < maxfilelen; i++)
> > > > original_buf[i] = random() % 256;
> > > > - good_buf = (char *) malloc(maxfilelen + writebdy);
> > > > - good_buf = round_ptr_up(good_buf, writebdy, 0);
> > > > - memset(good_buf, '\0', maxfilelen);
> > > > - temp_buf = (char *) malloc(maxoplen + readbdy);
> > > > - temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> > > > - memset(temp_buf, '\0', maxoplen);
> > > > + if (hugepages) {
> > > > + long hugepage_size;
> > > > +
> > > > + hugepage_size = get_hugepage_size();
> > > > + if (hugepage_size == -1) {
> > > > + prterr("get_hugepage_size()");
> > > > + exit(99);
> > > > + }
> > > > +
> > > > + if (writebdy != 1 && writebdy != hugepage_size)
> > > > + prt("ignoring write alignment (since -h is enabled)");
> > > > +
> > > > + if (readbdy != 1 && readbdy != hugepage_size)
> > > > + prt("ignoring read alignment (since -h is enabled)");
> > >
> > > I'm a little unclear on what these warnings mean. The alignments are
> > > still used in the read/write paths afaics. The non-huge mode seems to
> > > only really care about the max size of the buffers in this code.
> > >
> > > If your test doesn't actually use read/write alignments and the goal is
> > > just to keep things simple, perhaps it would be cleaner to add something
> > > like an if (hugepages && (writebdy != 1 || readbdy != 1)) check after
> > > option processing and exit out as an unsupported combination..?
> >
> > My understanding of the 'writebdy' and 'readbdy' options are that
> > they're for making reads/writes aligned to the passed-in value, which
> > depends on the starting address of the buffer being aligned to that
> > value as well. However for hugepages buffers, they must be aligned to
> > the system hugepage size (eg 2 MiB) or the madvise(... MADV_COLLAPSE)
> > call will fail. As such, it is not guaranteed that the requested
> > alignment will actually be abided by. For that reason, I thought it'd
> > be useful to print this out to the user so they know requested
> > alignments will be ignored, but it didn't seem severe enough of an
> > issue to error out and exit altogether. But maybe it'd be less
> > confusing for the user if this instead does just error out if the
> > alignment isn't a multiple of the hugepage size.
> >
>
> Ahh, I see. I missed the round_ptr_up() adjustments. That makes more
> sense now.
>
> IMO it would be a little cleaner to just bail out earlier as such. But
> either way, I suppose if you could add a small comment with this
> alignment context you've explained above with the error checks then that
> is good enough for me. Thanks!
>
Sounds great, will do for v2.
Thanks,
Joanne
> Brian
>
> > >
> > > BTW, it might also be nice to factor out this whole section of buffer
> > > initialization code (including original_buf) into an init_buffers() or
> > > some such. That could be done as a prep patch, but just a suggestion
> > > either way.
> >
> > Good idea - i'll do this refactoring for v2.
> >
> >
> > Thanks,
> > Joanne
> > >
> > > Brian
> > >
> > > > +
> > > > + good_buf = init_hugepages_buf(maxfilelen, hugepage_size);
> > > > + if (!good_buf) {
> > > > + prterr("init_hugepages_buf failed for good_buf");
> > > > + exit(100);
> > > > + }
> > > > +
> > > > + temp_buf = init_hugepages_buf(maxoplen, hugepage_size);
> > > > + if (!temp_buf) {
> > > > + prterr("init_hugepages_buf failed for temp_buf");
> > > > + exit(101);
> > > > + }
> > > > + } else {
> > > > + good_buf = (char *) malloc(maxfilelen + writebdy);
> > > > + good_buf = round_ptr_up(good_buf, writebdy, 0);
> > > > + memset(good_buf, '\0', maxfilelen);
> > > > +
> > > > + temp_buf = (char *) malloc(maxoplen + readbdy);
> > > > + temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> > > > + memset(temp_buf, '\0', maxoplen);
> > > > + }
> > > > if (lite) { /* zero entire existing file */
> > > > ssize_t written;
> > > >
> > > > --
> > > > 2.47.1
> > > >
> > > >
> > >
> >
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages
2024-12-20 11:52 ` Brian Foster
2024-12-23 20:59 ` Joanne Koong
@ 2024-12-27 19:22 ` Joanne Koong
1 sibling, 0 replies; 15+ messages in thread
From: Joanne Koong @ 2024-12-27 19:22 UTC (permalink / raw)
To: Brian Foster; +Cc: fstests, linux-fsdevel, kernel-team
On Fri, Dec 20, 2024 at 3:50 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Dec 19, 2024 at 02:34:01PM -0800, Joanne Koong wrote:
> > On Thu, Dec 19, 2024 at 6:27 AM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Wed, Dec 18, 2024 at 01:01:21PM -0800, Joanne Koong wrote:
> > > > Add support for reads/writes from buffers backed by hugepages.
> > > > This can be enabled through the '-h' flag. This flag should only be used
> > > > on systems where THP capabilities are enabled.
> > > >
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > ---
> > >
> > > Firstly, thanks for taking the time to add this. This seems like a nice
> > > idea. It might be nice to have an extra sentence or two in the commit
> > > log on the purpose/motivation. For example, has this been used to detect
> > > a certain class of problem?
> >
> > Hi Brian,
> >
> > Thanks for reviewing this. That's a good idea - I'll include the
> > sentence from the cover letter to this commit message as well: "This
> > is motivated by a recent bug that was due to faulty handling for
> > userspace buffers backed by hugepages."
> >
>
> Thanks. Got a link or anything, for my own curiosity?
>
> Also, I presume the followup fstest is a reproducer?
>
> > >
> > > A few other quick comments below...
> > >
> > > > ltp/fsx.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++-----
> > > > 1 file changed, 92 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/ltp/fsx.c b/ltp/fsx.c
> > > > index 41933354..3656fd9f 100644
> > > > --- a/ltp/fsx.c
> > > > +++ b/ltp/fsx.c
> > > > @@ -190,6 +190,7 @@ int o_direct; /* -Z */
> > > > int aio = 0;
> > > > +}
> > > > +
> > > > +static void *
> > > > +init_hugepages_buf(unsigned len, long hugepage_size)
> > > > +{
> > > > + void *buf;
> > > > + long buf_size = roundup(len, hugepage_size);
> > > > +
> > > > + if (posix_memalign(&buf, hugepage_size, buf_size)) {
> > > > + prterr("posix_memalign for buf");
> > > > + return NULL;
> > > > + }
> > > > + memset(buf, '\0', len);
> > >
> > > I'm assuming it doesn't matter, but did you want to use buf_size here to
> > > clear the whole buffer?
> >
> > I only saw buf being used up to len in the rest of the code so I
> > didn't think it was necessary, but I also don't feel strongly about
> > this and am happy to change this to clear the entire buffer if
> > preferred.
> >
>
> Yeah.. at first it looked like a bug to me, then I realized the same
> thing later. I suspect it might be wise to just clear it entirely to
> avoid any future landmines, but that could just be my internal bias
> talking too. No big deal either way.
>
> > >
> > > > + if (madvise(buf, buf_size, MADV_COLLAPSE)) {
> > > > + prterr("madvise collapse for buf");
> > > > + free(buf);
> > > > + return NULL;
> > > > + }
> > > > +
> > > > + return buf;
> > > > +}
> > > > @@ -3232,12 +3287,41 @@ main(int argc, char **argv)
> > > > original_buf = (char *) malloc(maxfilelen);
> > > > for (i = 0; i < maxfilelen; i++)
> > > > original_buf[i] = random() % 256;
> > > > - good_buf = (char *) malloc(maxfilelen + writebdy);
> > > > - good_buf = round_ptr_up(good_buf, writebdy, 0);
> > > > - memset(good_buf, '\0', maxfilelen);
> > > > - temp_buf = (char *) malloc(maxoplen + readbdy);
> > > > - temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> > > > - memset(temp_buf, '\0', maxoplen);
> > > > + if (hugepages) {
> > > > + long hugepage_size;
> > > > +
> > > > + hugepage_size = get_hugepage_size();
> > > > + if (hugepage_size == -1) {
> > > > + prterr("get_hugepage_size()");
> > > > + exit(99);
> > > > + }
> > > > +
> > > > + if (writebdy != 1 && writebdy != hugepage_size)
> > > > + prt("ignoring write alignment (since -h is enabled)");
> > > > +
> > > > + if (readbdy != 1 && readbdy != hugepage_size)
> > > > + prt("ignoring read alignment (since -h is enabled)");
> > >
> > > I'm a little unclear on what these warnings mean. The alignments are
> > > still used in the read/write paths afaics. The non-huge mode seems to
> > > only really care about the max size of the buffers in this code.
> > >
> > > If your test doesn't actually use read/write alignments and the goal is
> > > just to keep things simple, perhaps it would be cleaner to add something
> > > like an if (hugepages && (writebdy != 1 || readbdy != 1)) check after
> > > option processing and exit out as an unsupported combination..?
> >
> > My understanding of the 'writebdy' and 'readbdy' options are that
> > they're for making reads/writes aligned to the passed-in value, which
> > depends on the starting address of the buffer being aligned to that
> > value as well. However for hugepages buffers, they must be aligned to
> > the system hugepage size (eg 2 MiB) or the madvise(... MADV_COLLAPSE)
> > call will fail. As such, it is not guaranteed that the requested
> > alignment will actually be abided by. For that reason, I thought it'd
> > be useful to print this out to the user so they know requested
> > alignments will be ignored, but it didn't seem severe enough of an
> > issue to error out and exit altogether. But maybe it'd be less
> > confusing for the user if this instead does just error out if the
> > alignment isn't a multiple of the hugepage size.
> >
>
> Ahh, I see. I missed the round_ptr_up() adjustments. That makes more
> sense now.
For v2, I'll be integrating writebdy and readbdy into the hugepages
buffers. I mistakenly thought that "madvise(, ... MADV_COLLAPSE)"
requires a buffer size that is aligned to the hugepage size, but
AFAICT, it only requires that the buffer size is at least as large as
the hugepage size.
>
> IMO it would be a little cleaner to just bail out earlier as such. But
> either way, I suppose if you could add a small comment with this
> alignment context you've explained above with the error checks then that
> is good enough for me. Thanks!
>
> Brian
>
> > >
> > > BTW, it might also be nice to factor out this whole section of buffer
> > > initialization code (including original_buf) into an init_buffers() or
> > > some such. That could be done as a prep patch, but just a suggestion
> > > either way.
> >
> > Good idea - i'll do this refactoring for v2.
> >
> >
> > Thanks,
> > Joanne
> > >
> > > Brian
> > >
> > > > +
> > > > --
> > > > 2.47.1
> > > >
> > > >
> > >
> >
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages
2024-12-18 21:01 ` [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages Joanne Koong
2024-12-19 14:29 ` Brian Foster
@ 2024-12-19 17:51 ` Darrick J. Wong
2024-12-19 22:51 ` Joanne Koong
2024-12-20 7:47 ` Nirjhar Roy
2 siblings, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2024-12-19 17:51 UTC (permalink / raw)
To: Joanne Koong; +Cc: fstests, linux-fsdevel, kernel-team
On Wed, Dec 18, 2024 at 01:01:21PM -0800, Joanne Koong wrote:
> Add support for reads/writes from buffers backed by hugepages.
> This can be enabled through the '-h' flag. This flag should only be used
> on systems where THP capabilities are enabled.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> ltp/fsx.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 92 insertions(+), 8 deletions(-)
>
> diff --git a/ltp/fsx.c b/ltp/fsx.c
> index 41933354..3656fd9f 100644
> --- a/ltp/fsx.c
> +++ b/ltp/fsx.c
> @@ -190,6 +190,7 @@ int o_direct; /* -Z */
> int aio = 0;
> int uring = 0;
> int mark_nr = 0;
> +int hugepages = 0; /* -h flag */
>
> int page_size;
> int page_mask;
> @@ -2471,7 +2472,7 @@ void
> usage(void)
> {
> fprintf(stdout, "usage: %s",
> - "fsx [-dfknqxyzBEFHIJKLORWXZ0]\n\
> + "fsx [-dfhknqxyzBEFHIJKLORWXZ0]\n\
> [-b opnum] [-c Prob] [-g filldata] [-i logdev] [-j logid]\n\
> [-l flen] [-m start:end] [-o oplen] [-p progressinterval]\n\
> [-r readbdy] [-s style] [-t truncbdy] [-w writebdy]\n\
> @@ -2484,6 +2485,7 @@ usage(void)
> -e: pollute post-eof on size changes (default 0)\n\
> -f: flush and invalidate cache after I/O\n\
> -g X: write character X instead of random generated data\n\
> + -h hugepages: use buffers backed by hugepages for reads/writes\n\
> -i logdev: do integrity testing, logdev is the dm log writes device\n\
> -j logid: prefix debug log messsages with this id\n\
> -k: do not truncate existing file and use its size as upper bound on file size\n\
> @@ -2856,6 +2858,56 @@ keep_running(void)
> return numops-- != 0;
> }
>
> +static long
> +get_hugepage_size(void)
> +{
> + const char *str = "Hugepagesize:";
> + long hugepage_size = -1;
> + char buffer[64];
> + FILE *file;
> +
> + file = fopen("/proc/meminfo", "r");
> + if (!file) {
> + prterr("get_hugepage_size: fopen /proc/meminfo");
> + return -1;
> + }
> + while (fgets(buffer, sizeof(buffer), file)) {
> + if (strncmp(buffer, str, strlen(str)) == 0) {
> + sscanf(buffer + strlen(str), "%ld", &hugepage_size);
> + break;
> + }
> + }
> + fclose(file);
> + if (hugepage_size == -1) {
> + prterr("get_hugepage_size: failed to find "
> + "hugepage size in /proc/meminfo\n");
> + return -1;
> + }
> +
> + /* convert from KiB to bytes */
> + return hugepage_size * 1024;
> +}
> +
> +static void *
> +init_hugepages_buf(unsigned len, long hugepage_size)
> +{
> + void *buf;
> + long buf_size = roundup(len, hugepage_size);
> +
> + if (posix_memalign(&buf, hugepage_size, buf_size)) {
> + prterr("posix_memalign for buf");
> + return NULL;
> + }
> + memset(buf, '\0', len);
> + if (madvise(buf, buf_size, MADV_COLLAPSE)) {
> + prterr("madvise collapse for buf");
> + free(buf);
> + return NULL;
> + }
> +
> + return buf;
> +}
> +
> static struct option longopts[] = {
> {"replay-ops", required_argument, 0, 256},
> {"record-ops", optional_argument, 0, 255},
> @@ -2883,7 +2935,7 @@ main(int argc, char **argv)
> setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout */
>
> while ((ch = getopt_long(argc, argv,
> - "0b:c:de:fg:i:j:kl:m:no:p:qr:s:t:uw:xyABD:EFJKHzCILN:OP:RS:UWXZ",
> + "0b:c:de:fg:hi:j:kl:m:no:p:qr:s:t:uw:xyABD:EFJKHzCILN:OP:RS:UWXZ",
> longopts, NULL)) != EOF)
> switch (ch) {
> case 'b':
> @@ -2916,6 +2968,9 @@ main(int argc, char **argv)
> case 'g':
> filldata = *optarg;
> break;
> + case 'h':
> + hugepages = 1;
> + break;
> case 'i':
> integrity = 1;
> logdev = strdup(optarg);
> @@ -3232,12 +3287,41 @@ main(int argc, char **argv)
> original_buf = (char *) malloc(maxfilelen);
> for (i = 0; i < maxfilelen; i++)
> original_buf[i] = random() % 256;
> - good_buf = (char *) malloc(maxfilelen + writebdy);
> - good_buf = round_ptr_up(good_buf, writebdy, 0);
> - memset(good_buf, '\0', maxfilelen);
> - temp_buf = (char *) malloc(maxoplen + readbdy);
> - temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> - memset(temp_buf, '\0', maxoplen);
> + if (hugepages) {
> + long hugepage_size;
> +
> + hugepage_size = get_hugepage_size();
> + if (hugepage_size == -1) {
> + prterr("get_hugepage_size()");
> + exit(99);
> + }
> +
> + if (writebdy != 1 && writebdy != hugepage_size)
> + prt("ignoring write alignment (since -h is enabled)");
> +
> + if (readbdy != 1 && readbdy != hugepage_size)
> + prt("ignoring read alignment (since -h is enabled)");
What if readbdy is a multiple of the hugepage size?
> + good_buf = init_hugepages_buf(maxfilelen, hugepage_size);
> + if (!good_buf) {
> + prterr("init_hugepages_buf failed for good_buf");
> + exit(100);
> + }
Why is it necessary for the good_buf to be backed by a hugepage?
I thought good_buf was only used to compare file contents?
--D
> +
> + temp_buf = init_hugepages_buf(maxoplen, hugepage_size);
> + if (!temp_buf) {
> + prterr("init_hugepages_buf failed for temp_buf");
> + exit(101);
> + }
> + } else {
> + good_buf = (char *) malloc(maxfilelen + writebdy);
> + good_buf = round_ptr_up(good_buf, writebdy, 0);
> + memset(good_buf, '\0', maxfilelen);
> +
> + temp_buf = (char *) malloc(maxoplen + readbdy);
> + temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> + memset(temp_buf, '\0', maxoplen);
> + }
> if (lite) { /* zero entire existing file */
> ssize_t written;
>
> --
> 2.47.1
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages
2024-12-19 17:51 ` Darrick J. Wong
@ 2024-12-19 22:51 ` Joanne Koong
0 siblings, 0 replies; 15+ messages in thread
From: Joanne Koong @ 2024-12-19 22:51 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: fstests, linux-fsdevel, kernel-team
On Thu, Dec 19, 2024 at 9:51 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Wed, Dec 18, 2024 at 01:01:21PM -0800, Joanne Koong wrote:
> > Add support for reads/writes from buffers backed by hugepages.
> > This can be enabled through the '-h' flag. This flag should only be used
> > on systems where THP capabilities are enabled.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > ltp/fsx.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++-----
> > 1 file changed, 92 insertions(+), 8 deletions(-)
> >
> > diff --git a/ltp/fsx.c b/ltp/fsx.c
> > index 41933354..3656fd9f 100644
> > --- a/ltp/fsx.c
> > +++ b/ltp/fsx.c
> > +
> > static struct option longopts[] = {
> > {"replay-ops", required_argument, 0, 256},
> > {"record-ops", optional_argument, 0, 255},
> > @@ -2883,7 +2935,7 @@ main(int argc, char **argv)
> > setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout */
> >
> > while ((ch = getopt_long(argc, argv,
> > - "0b:c:de:fg:i:j:kl:m:no:p:qr:s:t:uw:xyABD:EFJKHzCILN:OP:RS:UWXZ",
> > + "0b:c:de:fg:hi:j:kl:m:no:p:qr:s:t:uw:xyABD:EFJKHzCILN:OP:RS:UWXZ",
> > longopts, NULL)) != EOF)
> > switch (ch) {
> > case 'b':
> > @@ -2916,6 +2968,9 @@ main(int argc, char **argv)
> > case 'g':
> > filldata = *optarg;
> > break;
> > + case 'h':
> > + hugepages = 1;
> > + break;
> > case 'i':
> > integrity = 1;
> > logdev = strdup(optarg);
> > @@ -3232,12 +3287,41 @@ main(int argc, char **argv)
> > original_buf = (char *) malloc(maxfilelen);
> > for (i = 0; i < maxfilelen; i++)
> > original_buf[i] = random() % 256;
> > - good_buf = (char *) malloc(maxfilelen + writebdy);
> > - good_buf = round_ptr_up(good_buf, writebdy, 0);
> > - memset(good_buf, '\0', maxfilelen);
> > - temp_buf = (char *) malloc(maxoplen + readbdy);
> > - temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> > - memset(temp_buf, '\0', maxoplen);
> > + if (hugepages) {
> > + long hugepage_size;
> > +
> > + hugepage_size = get_hugepage_size();
> > + if (hugepage_size == -1) {
> > + prterr("get_hugepage_size()");
> > + exit(99);
> > + }
> > +
> > + if (writebdy != 1 && writebdy != hugepage_size)
> > + prt("ignoring write alignment (since -h is enabled)");
> > +
> > + if (readbdy != 1 && readbdy != hugepage_size)
> > + prt("ignoring read alignment (since -h is enabled)");
>
> What if readbdy is a multiple of the hugepage size?
Good point, the user could potentially request an alignment that's a
multiple. I'll account for this in v2.
>
> > + good_buf = init_hugepages_buf(maxfilelen, hugepage_size);
> > + if (!good_buf) {
> > + prterr("init_hugepages_buf failed for good_buf");
> > + exit(100);
> > + }
>
> Why is it necessary for the good_buf to be backed by a hugepage?
> I thought good_buf was only used to compare file contents?
good_buf is used too as the source buffer for the write in dowrite().
Thanks,
Joanne
>
> --D
>
> > +
> > + temp_buf = init_hugepages_buf(maxoplen, hugepage_size);
> > + if (!temp_buf) {
> > + prterr("init_hugepages_buf failed for temp_buf");
> > + exit(101);
> > + }
> > + } else {
> > + good_buf = (char *) malloc(maxfilelen + writebdy);
> > + good_buf = round_ptr_up(good_buf, writebdy, 0);
> > + memset(good_buf, '\0', maxfilelen);
> > +
> > + temp_buf = (char *) malloc(maxoplen + readbdy);
> > + temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> > + memset(temp_buf, '\0', maxoplen);
> > + }
> > if (lite) { /* zero entire existing file */
> > ssize_t written;
> >
> > --
> > 2.47.1
> >
> >
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages
2024-12-18 21:01 ` [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages Joanne Koong
2024-12-19 14:29 ` Brian Foster
2024-12-19 17:51 ` Darrick J. Wong
@ 2024-12-20 7:47 ` Nirjhar Roy
2024-12-23 21:07 ` Joanne Koong
2 siblings, 1 reply; 15+ messages in thread
From: Nirjhar Roy @ 2024-12-20 7:47 UTC (permalink / raw)
To: Joanne Koong, fstests; +Cc: linux-fsdevel, kernel-team
On Wed, 2024-12-18 at 13:01 -0800, Joanne Koong wrote:
> Add support for reads/writes from buffers backed by hugepages.
> This can be enabled through the '-h' flag. This flag should only be
> used
> on systems where THP capabilities are enabled.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
> ltp/fsx.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++---
> --
> 1 file changed, 92 insertions(+), 8 deletions(-)
>
> diff --git a/ltp/fsx.c b/ltp/fsx.c
> index 41933354..3656fd9f 100644
> --- a/ltp/fsx.c
> +++ b/ltp/fsx.c
> @@ -190,6 +190,7 @@ int o_direct; /* -Z */
> int aio = 0;
> int uring = 0;
> int mark_nr = 0;
> +int hugepages = 0; /* -h flag */
>
> int page_size;
> int page_mask;
> @@ -2471,7 +2472,7 @@ void
> usage(void)
> {
> fprintf(stdout, "usage: %s",
> - "fsx [-dfknqxyzBEFHIJKLORWXZ0]\n\
> + "fsx [-dfhknqxyzBEFHIJKLORWXZ0]\n\
> [-b opnum] [-c Prob] [-g filldata] [-i logdev] [-j logid]\n\
> [-l flen] [-m start:end] [-o oplen] [-p progressinterval]\n\
> [-r readbdy] [-s style] [-t truncbdy] [-w writebdy]\n\
> @@ -2484,6 +2485,7 @@ usage(void)
> -e: pollute post-eof on size changes (default 0)\n\
> -f: flush and invalidate cache after I/O\n\
> -g X: write character X instead of random generated data\n\
> + -h hugepages: use buffers backed by hugepages for
> reads/writes\n\
> -i logdev: do integrity testing, logdev is the dm log writes
> device\n\
> -j logid: prefix debug log messsages with this id\n\
> -k: do not truncate existing file and use its size as upper
> bound on file size\n\
> @@ -2856,6 +2858,56 @@ keep_running(void)
> return numops-- != 0;
> }
>
> +static long
> +get_hugepage_size(void)
> +{
> + const char *str = "Hugepagesize:";
> + long hugepage_size = -1;
> + char buffer[64];
> + FILE *file;
> +
> + file = fopen("/proc/meminfo", "r");
> + if (!file) {
> + prterr("get_hugepage_size: fopen /proc/meminfo");
> + return -1;
> + }
> + while (fgets(buffer, sizeof(buffer), file)) {
> + if (strncmp(buffer, str, strlen(str)) == 0) {
Extremely minor: Since str is a fixed string, why not calculate the
length outside the loop and not re-use strlen(str) multiple times?
> + sscanf(buffer + strlen(str), "%ld",
> &hugepage_size);
> + break;
> + }
> + }
> + fclose(file);
> + if (hugepage_size == -1) {
> + prterr("get_hugepage_size: failed to find "
> + "hugepage size in /proc/meminfo\n");
> + return -1;
> + }
> +
> + /* convert from KiB to bytes */
> + return hugepage_size * 1024;
Minor: << 10 might be faster instead of '*' ?
> +}
> +
> +static void *
> +init_hugepages_buf(unsigned len, long hugepage_size)
> +{
> + void *buf;
> + long buf_size = roundup(len, hugepage_size);
> +
> + if (posix_memalign(&buf, hugepage_size, buf_size)) {
> + prterr("posix_memalign for buf");
> + return NULL;
> + }
> + memset(buf, '\0', len);
> + if (madvise(buf, buf_size, MADV_COLLAPSE)) {
> + prterr("madvise collapse for buf");
> + free(buf);
> + return NULL;
> + }
> +
> + return buf;
> +}
> +
> static struct option longopts[] = {
> {"replay-ops", required_argument, 0, 256},
> {"record-ops", optional_argument, 0, 255},
> @@ -2883,7 +2935,7 @@ main(int argc, char **argv)
> setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout
> */
>
> while ((ch = getopt_long(argc, argv,
> - "0b:c:de:fg:i:j:kl:m:no:p:qr:s:t:uw:xy
> ABD:EFJKHzCILN:OP:RS:UWXZ",
> + "0b:c:de:fg:hi:j:kl:m:no:p:qr:s:t:uw:x
> yABD:EFJKHzCILN:OP:RS:UWXZ",
> longopts, NULL)) != EOF)
> switch (ch) {
> case 'b':
> @@ -2916,6 +2968,9 @@ main(int argc, char **argv)
> case 'g':
> filldata = *optarg;
> break;
> + case 'h':
> + hugepages = 1;
> + break;
> case 'i':
> integrity = 1;
> logdev = strdup(optarg);
> @@ -3232,12 +3287,41 @@ main(int argc, char **argv)
> original_buf = (char *) malloc(maxfilelen);
> for (i = 0; i < maxfilelen; i++)
> original_buf[i] = random() % 256;
> - good_buf = (char *) malloc(maxfilelen + writebdy);
> - good_buf = round_ptr_up(good_buf, writebdy, 0);
> - memset(good_buf, '\0', maxfilelen);
> - temp_buf = (char *) malloc(maxoplen + readbdy);
> - temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> - memset(temp_buf, '\0', maxoplen);
> + if (hugepages) {
> + long hugepage_size;
> +
> + hugepage_size = get_hugepage_size();
> + if (hugepage_size == -1) {
> + prterr("get_hugepage_size()");
> + exit(99);
> + }
> +
> + if (writebdy != 1 && writebdy != hugepage_size)
> + prt("ignoring write alignment (since -h is
> enabled)");
> +
> + if (readbdy != 1 && readbdy != hugepage_size)
> + prt("ignoring read alignment (since -h is
> enabled)");
> +
> + good_buf = init_hugepages_buf(maxfilelen,
> hugepage_size);
> + if (!good_buf) {
> + prterr("init_hugepages_buf failed for
> good_buf");
> + exit(100);
> + }
> +
> + temp_buf = init_hugepages_buf(maxoplen, hugepage_size);
> + if (!temp_buf) {
> + prterr("init_hugepages_buf failed for
> temp_buf");
> + exit(101);
> + }
> + } else {
> + good_buf = (char *) malloc(maxfilelen + writebdy);
> + good_buf = round_ptr_up(good_buf, writebdy, 0);
Not sure if it would matter but aren't we seeing a small memory leak
here since good_buf's original will be lost after rounding up?
> + memset(good_buf, '\0', maxfilelen);
> +
> + temp_buf = (char *) malloc(maxoplen + readbdy);
> + temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> + memset(temp_buf, '\0', maxoplen);
> + }
> if (lite) { /* zero entire existing file */
> ssize_t written;
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages
2024-12-20 7:47 ` Nirjhar Roy
@ 2024-12-23 21:07 ` Joanne Koong
2024-12-24 5:09 ` Nirjhar Roy
0 siblings, 1 reply; 15+ messages in thread
From: Joanne Koong @ 2024-12-23 21:07 UTC (permalink / raw)
To: Nirjhar Roy; +Cc: fstests, linux-fsdevel, kernel-team
On Thu, Dec 19, 2024 at 11:47 PM Nirjhar Roy <nirjhar@linux.ibm.com> wrote:
>
> On Wed, 2024-12-18 at 13:01 -0800, Joanne Koong wrote:
> > Add support for reads/writes from buffers backed by hugepages.
> > This can be enabled through the '-h' flag. This flag should only be
> > used
> > on systems where THP capabilities are enabled.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> > ltp/fsx.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++---
> > --
> > 1 file changed, 92 insertions(+), 8 deletions(-)
> >
> > diff --git a/ltp/fsx.c b/ltp/fsx.c
> > index 41933354..3656fd9f 100644
> > --- a/ltp/fsx.c
> > +++ b/ltp/fsx.c
> > @@ -190,6 +190,7 @@ int o_direct; /* -Z */
> > +static long
> > +get_hugepage_size(void)
> > +{
> > + const char *str = "Hugepagesize:";
> > + long hugepage_size = -1;
> > + char buffer[64];
> > + FILE *file;
> > +
> > + file = fopen("/proc/meminfo", "r");
> > + if (!file) {
> > + prterr("get_hugepage_size: fopen /proc/meminfo");
> > + return -1;
> > + }
> > + while (fgets(buffer, sizeof(buffer), file)) {
> > + if (strncmp(buffer, str, strlen(str)) == 0) {
> Extremely minor: Since str is a fixed string, why not calculate the
> length outside the loop and not re-use strlen(str) multiple times?
Thinking about this some more, maybe it'd be best to define it as
const char str[] = "Hugepagesize:" as an array of chars and use sizeof
which would be at compile-time instead of runtime.
I'll do this for v2.
> > + sscanf(buffer + strlen(str), "%ld",
> > &hugepage_size);
> > + break;
> > + }
> > + }
> > + fclose(file);
> > + if (hugepage_size == -1) {
> > + prterr("get_hugepage_size: failed to find "
> > + "hugepage size in /proc/meminfo\n");
> > + return -1;
> > + }
> > +
> > + /* convert from KiB to bytes */
> > + return hugepage_size * 1024;
> Minor: << 10 might be faster instead of '*' ?
Will do for v2.
> > +}
> > +
> > +static void *
> > +init_hugepages_buf(unsigned len, long hugepage_size)
> > +{
> > + void *buf;
> > + long buf_size = roundup(len, hugepage_size);
> > +
> > + if (posix_memalign(&buf, hugepage_size, buf_size)) {
> > + prterr("posix_memalign for buf");
> > + return NULL;
> > + }
> > + memset(buf, '\0', len);
> > + if (madvise(buf, buf_size, MADV_COLLAPSE)) {
> > + prterr("madvise collapse for buf");
> > + free(buf);
> > + return NULL;
> > + }
> > +
> > + return buf;
> > +}
> > +
> > static struct option longopts[] = {
> > {"replay-ops", required_argument, 0, 256},
> > {"record-ops", optional_argument, 0, 255},
> > @@ -2883,7 +2935,7 @@ main(int argc, char **argv)
> > setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout
> > */
> >
> > while ((ch = getopt_long(argc, argv,
> > - "0b:c:de:fg:i:j:kl:m:no:p:qr:s:t:uw:xy
> > ABD:EFJKHzCILN:OP:RS:UWXZ",
> > + "0b:c:de:fg:hi:j:kl:m:no:p:qr:s:t:uw:x
> > yABD:EFJKHzCILN:OP:RS:UWXZ",
> > longopts, NULL)) != EOF)
> > switch (ch) {
> > case 'b':
> > @@ -2916,6 +2968,9 @@ main(int argc, char **argv)
> > case 'g':
> > filldata = *optarg;
> > break;
> > + case 'h':
> > + hugepages = 1;
> > + break;
> > case 'i':
> > integrity = 1;
> > logdev = strdup(optarg);
> > @@ -3232,12 +3287,41 @@ main(int argc, char **argv)
> > original_buf = (char *) malloc(maxfilelen);
> > for (i = 0; i < maxfilelen; i++)
> > original_buf[i] = random() % 256;
> > - good_buf = (char *) malloc(maxfilelen + writebdy);
> > - good_buf = round_ptr_up(good_buf, writebdy, 0);
> > - memset(good_buf, '\0', maxfilelen);
> > - temp_buf = (char *) malloc(maxoplen + readbdy);
> > - temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> > - memset(temp_buf, '\0', maxoplen);
> > + if (hugepages) {
> > + long hugepage_size;
> > +
> > + hugepage_size = get_hugepage_size();
> > + if (hugepage_size == -1) {
> > + prterr("get_hugepage_size()");
> > + exit(99);
> > + }
> > +
> > + if (writebdy != 1 && writebdy != hugepage_size)
> > + prt("ignoring write alignment (since -h is
> > enabled)");
> > +
> > + if (readbdy != 1 && readbdy != hugepage_size)
> > + prt("ignoring read alignment (since -h is
> > enabled)");
> > +
> > + good_buf = init_hugepages_buf(maxfilelen,
> > hugepage_size);
> > + if (!good_buf) {
> > + prterr("init_hugepages_buf failed for
> > good_buf");
> > + exit(100);
> > + }
> > +
> > + temp_buf = init_hugepages_buf(maxoplen, hugepage_size);
> > + if (!temp_buf) {
> > + prterr("init_hugepages_buf failed for
> > temp_buf");
> > + exit(101);
> > + }
> > + } else {
> > + good_buf = (char *) malloc(maxfilelen + writebdy);
> > + good_buf = round_ptr_up(good_buf, writebdy, 0);
> Not sure if it would matter but aren't we seeing a small memory leak
> here since good_buf's original will be lost after rounding up?
This is inherited from the original code but AFAICT, it relies on the
memory being cleaned up at exit time (eg free() is never called on
good_buf and temp_buf either).
Thanks,
Joanne
> > + memset(good_buf, '\0', maxfilelen);
> > +
> > + temp_buf = (char *) malloc(maxoplen + readbdy);
> > + temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> > + memset(temp_buf, '\0', maxoplen);
> > + }
> > if (lite) { /* zero entire existing file */
> > ssize_t written;
> >
>
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages
2024-12-23 21:07 ` Joanne Koong
@ 2024-12-24 5:09 ` Nirjhar Roy
0 siblings, 0 replies; 15+ messages in thread
From: Nirjhar Roy @ 2024-12-24 5:09 UTC (permalink / raw)
To: Joanne Koong; +Cc: fstests, linux-fsdevel, kernel-team
On 12/24/24 02:37, Joanne Koong wrote:
> On Thu, Dec 19, 2024 at 11:47 PM Nirjhar Roy <nirjhar@linux.ibm.com> wrote:
>> On Wed, 2024-12-18 at 13:01 -0800, Joanne Koong wrote:
>>> Add support for reads/writes from buffers backed by hugepages.
>>> This can be enabled through the '-h' flag. This flag should only be
>>> used
>>> on systems where THP capabilities are enabled.
>>>
>>> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
>>> ---
>>> ltp/fsx.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++---
>>> --
>>> 1 file changed, 92 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/ltp/fsx.c b/ltp/fsx.c
>>> index 41933354..3656fd9f 100644
>>> --- a/ltp/fsx.c
>>> +++ b/ltp/fsx.c
>>> @@ -190,6 +190,7 @@ int o_direct; /* -Z */
>>> +static long
>>> +get_hugepage_size(void)
>>> +{
>>> + const char *str = "Hugepagesize:";
>>> + long hugepage_size = -1;
>>> + char buffer[64];
>>> + FILE *file;
>>> +
>>> + file = fopen("/proc/meminfo", "r");
>>> + if (!file) {
>>> + prterr("get_hugepage_size: fopen /proc/meminfo");
>>> + return -1;
>>> + }
>>> + while (fgets(buffer, sizeof(buffer), file)) {
>>> + if (strncmp(buffer, str, strlen(str)) == 0) {
>> Extremely minor: Since str is a fixed string, why not calculate the
>> length outside the loop and not re-use strlen(str) multiple times?
> Thinking about this some more, maybe it'd be best to define it as
> const char str[] = "Hugepagesize:" as an array of chars and use sizeof
> which would be at compile-time instead of runtime.
> I'll do this for v2.
Yes, that is a good idea too. Thanks.
>
>>> + sscanf(buffer + strlen(str), "%ld",
>>> &hugepage_size);
>>> + break;
>>> + }
>>> + }
>>> + fclose(file);
>>> + if (hugepage_size == -1) {
>>> + prterr("get_hugepage_size: failed to find "
>>> + "hugepage size in /proc/meminfo\n");
>>> + return -1;
>>> + }
>>> +
>>> + /* convert from KiB to bytes */
>>> + return hugepage_size * 1024;
>> Minor: << 10 might be faster instead of '*' ?
> Will do for v2.
Thanks.
>
>>> +}
>>> +
>>> +static void *
>>> +init_hugepages_buf(unsigned len, long hugepage_size)
>>> +{
>>> + void *buf;
>>> + long buf_size = roundup(len, hugepage_size);
>>> +
>>> + if (posix_memalign(&buf, hugepage_size, buf_size)) {
>>> + prterr("posix_memalign for buf");
>>> + return NULL;
>>> + }
>>> + memset(buf, '\0', len);
>>> + if (madvise(buf, buf_size, MADV_COLLAPSE)) {
>>> + prterr("madvise collapse for buf");
>>> + free(buf);
>>> + return NULL;
>>> + }
>>> +
>>> + return buf;
>>> +}
>>> +
>>> static struct option longopts[] = {
>>> {"replay-ops", required_argument, 0, 256},
>>> {"record-ops", optional_argument, 0, 255},
>>> @@ -2883,7 +2935,7 @@ main(int argc, char **argv)
>>> setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout
>>> */
>>>
>>> while ((ch = getopt_long(argc, argv,
>>> - "0b:c:de:fg:i:j:kl:m:no:p:qr:s:t:uw:xy
>>> ABD:EFJKHzCILN:OP:RS:UWXZ",
>>> + "0b:c:de:fg:hi:j:kl:m:no:p:qr:s:t:uw:x
>>> yABD:EFJKHzCILN:OP:RS:UWXZ",
>>> longopts, NULL)) != EOF)
>>> switch (ch) {
>>> case 'b':
>>> @@ -2916,6 +2968,9 @@ main(int argc, char **argv)
>>> case 'g':
>>> filldata = *optarg;
>>> break;
>>> + case 'h':
>>> + hugepages = 1;
>>> + break;
>>> case 'i':
>>> integrity = 1;
>>> logdev = strdup(optarg);
>>> @@ -3232,12 +3287,41 @@ main(int argc, char **argv)
>>> original_buf = (char *) malloc(maxfilelen);
>>> for (i = 0; i < maxfilelen; i++)
>>> original_buf[i] = random() % 256;
>>> - good_buf = (char *) malloc(maxfilelen + writebdy);
>>> - good_buf = round_ptr_up(good_buf, writebdy, 0);
>>> - memset(good_buf, '\0', maxfilelen);
>>> - temp_buf = (char *) malloc(maxoplen + readbdy);
>>> - temp_buf = round_ptr_up(temp_buf, readbdy, 0);
>>> - memset(temp_buf, '\0', maxoplen);
>>> + if (hugepages) {
>>> + long hugepage_size;
>>> +
>>> + hugepage_size = get_hugepage_size();
>>> + if (hugepage_size == -1) {
>>> + prterr("get_hugepage_size()");
>>> + exit(99);
>>> + }
>>> +
>>> + if (writebdy != 1 && writebdy != hugepage_size)
>>> + prt("ignoring write alignment (since -h is
>>> enabled)");
>>> +
>>> + if (readbdy != 1 && readbdy != hugepage_size)
>>> + prt("ignoring read alignment (since -h is
>>> enabled)");
>>> +
>>> + good_buf = init_hugepages_buf(maxfilelen,
>>> hugepage_size);
>>> + if (!good_buf) {
>>> + prterr("init_hugepages_buf failed for
>>> good_buf");
>>> + exit(100);
>>> + }
>>> +
>>> + temp_buf = init_hugepages_buf(maxoplen, hugepage_size);
>>> + if (!temp_buf) {
>>> + prterr("init_hugepages_buf failed for
>>> temp_buf");
>>> + exit(101);
>>> + }
>>> + } else {
>>> + good_buf = (char *) malloc(maxfilelen + writebdy);
>>> + good_buf = round_ptr_up(good_buf, writebdy, 0);
>> Not sure if it would matter but aren't we seeing a small memory leak
>> here since good_buf's original will be lost after rounding up?
> This is inherited from the original code but AFAICT, it relies on the
> memory being cleaned up at exit time (eg free() is never called on
> good_buf and temp_buf either).
Okay, makes sense.
--
NR
>
>
> Thanks,
> Joanne
>
>>> + memset(good_buf, '\0', maxfilelen);
>>> +
>>> + temp_buf = (char *) malloc(maxoplen + readbdy);
>>> + temp_buf = round_ptr_up(temp_buf, readbdy, 0);
>>> + memset(temp_buf, '\0', maxoplen);
>>> + }
>>> if (lite) { /* zero entire existing file */
>>> ssize_t written;
>>>
--
---
Nirjhar Roy
Linux Kernel Developer
IBM, Bangalore
^ permalink raw reply [flat|nested] 15+ messages in thread