* Re: Raid 5 - not clean and then a failure.
From: Jon Hardcastle @ 2009-08-26 11:02 UTC (permalink / raw)
To: linux-raid, Robin Hill
In-Reply-To: <20090825081617.GA8885@cthulhu.home.robinhill.me.uk>
--- On Tue, 25/8/09, Robin Hill <robin@robinhill.me.uk> wrote:
> From: Robin Hill <robin@robinhill.me.uk>
> Subject: Re: Raid 5 - not clean and then a failure.
> To: linux-raid@vger.kernel.org
> Date: Tuesday, 25 August, 2009, 9:16 AM
> On Tue Aug 25, 2009 at 12:54:49AM
> -0700, Jon Hardcastle wrote:
>
> > Guys,
> >
> > I have been having some problems with my arrays that I
> think i have
> > nailed down to a pci controller (well I say that - it
> is always the
> > drives connected to *a* controller but I have tried
> 2!) anyway the
> > latest saga is i was trying some new kernel options
> last night - which
> > didn't work.
> >
> Did they have the same chipset? I had problems with
> PCI controllers on
> one of my systems, which turned out to be some sort of
> conflict between
> the onboard chipset and the chipset on the
> controllers. I found a PCI
> card with a different chipset and have had no issues
> since.
>
> > But when i booted up again this morning it said one of
> the drives was
> > in an inconsistent state (not sure of the *exact*
> error message). I
> > then kicked off an add of the drive and it started
> syncing. It got
> > about 5% in and then the second drive in on that
> controller complained
> > and the array failed.
> >
> > Is there any hope for my data? If i get a good
> controller in there
> > will the resync continue? can I try and tell it to
> assume the drives
> > are good (which they ought to be)?
> >
> There's definitely hope. You can assemble the array
> (using the good
> drives and the last drive to fail) using the --force
> option, then re-add
> (and sync) the other drive (I'd recommend doing a fsck on
> the filesystem
> as well). I've just had to do a similar thing myself
> after two drives
> failed (overheated after a fan failure).
>
> Cheers,
> Robin
It worked! I had to force the array, to assemble.. but it did. Had some more problems with the controller that I think was caused ultimately by the two via controller conflicting. I think removing them *both* and booting up helped the computer to work out what was going on (don't know how) I also took down the 'minimum guaranteed' speed of the rebuild to 50MB as the 2 drives on the PCI/150 card were struggling I think - not sure about this as the drive does a 'check' once a week and has only ever failed last weekend. So basically i am not really 100% sure what caused this problem - but i do know i need to get a more stable way of controller these additional drives!
On a side note, if a 'repair' does everything a 'check' does but also repairs it. Is there any merit in just doing repairs?
Finally, anyone here got a port multiplier working?
-----------------------
N: Jon Hardcastle
E: Jon@eHardcastle.com
'Do not worry about tomorrow, for tomorrow will bring worries of its own.'
-----------------------
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH 1/2] Add an alternative cs89x0 driver
From: Kurt Van Dijck @ 2009-08-26 10:46 UTC (permalink / raw)
To: Sascha Hauer; +Cc: netdev
In-Reply-To: <1240387172-21818-2-git-send-email-s.hauer@pengutronix.de>
Hi Sacha,
I'm using a 2.6.25.
Converting to your platform_device based driver,
I needed to configure the irq (see patch, irq flags).
Looking in the old cs89x0.c, it's done in the driver. Should I have
configured the irq level elsewhere? Or is this patch valid to do?
Kurt
Signed-off-by: Kurt Van Dijck <kurt.van.dijck@eia.be>
---
Index: drivers/net/cirrus-cs89x0.c
===================================================================
--- drivers/net/cirrus-cs89x0.c (revision 7107)
+++ drivers/net/cirrus-cs89x0.c (working copy)
@@ -487,7 +487,8 @@
}
/* install interrupt handler */
- result = request_irq(ndev->irq, &cirrus_interrupt, 0, ndev->name, ndev);
+ result = request_irq(ndev->irq, &cirrus_interrupt,
+ IRQF_TRIGGER_HIGH, ndev->name, ndev);
if (result < 0) {
printk(KERN_ERR "%s: could not register interrupt %d\n",
ndev->name, ndev->irq);
^ permalink raw reply
* [B.A.T.M.A.N.] Configuration interface...
From: Andrew Lunn @ 2009-08-26 11:01 UTC (permalink / raw)
To: The list for a Better Approach To Mobile Ad-hoc Networking
In-Reply-To: <200908261651.16980.lindner_marek@yahoo.de>
> > > Should /proc/net/batman-adv/interface be replaced with an IOCTL interface
> > > similar to brctl?
> >
> > How to design the kernel<->userspace interface that it doesn't end like
> > wireless-tools?
>
> I'm not so familiar with the iwconfig situation you seem to refer to. Could you
> outline the issues ?
It is complex and horrible. It is also badly implemented by many
wireless devices so is inconsistent.
However, batman has a much simpler interface. There is a lot less to
configure:
originator interval
aggregation to enable/disable
vis server/client mode
interfaces to use
So it should be possible to implement a reasonably simple interface. I
think only the interfaces is somewhat tricky and needs thinking
about. The rest can be individual files in /sys.
Andrew
^ permalink raw reply
* [PATCH] tracing, documentation: Clarifications and corrections to tracepoint-analysis.txt (resend)
From: Mel Gorman @ 2009-08-24 14:32 UTC (permalink / raw)
To: Andrew Morton
Cc: riel, Peter Zijlstra, Li Ming Chun, Jonathan Corbet,
Fernando Carrijo, LKML, linux-mm
In-Reply-To: <20090812131517.GD19269@csn.ul.ie>
I think this patch might have got lost in the noise so am resending just
in case.
This patch makes a number of corrections and clarifications as pointed
out by Jonathan Corbet.
o Listed some requirement kernel config options
o Spelled out that perf has to be installed from tools/perf
o Mention that tracing_enabled must be set
o Fix numerous minor typos
o Fix tense issues
o Expand on what -c means
Fernando Carrijo also spotted that a library was misnamed libpixmap
instead of libpixman. This patch should be considered a fix to
tracing-documentation-add-a-document-describing-how-to-do-some-performance-analysis-with-tracepoints.patch.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
Documentation/trace/tracepoint-analysis.txt | 49 ++++++++++++++++++++--------
1 file changed, 36 insertions(+), 13 deletions(-)
diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
index e7a7d3e..282e8d9 100644
--- a/Documentation/trace/tracepoint-analysis.txt
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -17,8 +17,18 @@ gathering and interpreting these events. Lacking any current Best Practises,
this document describes some of the methods that can be used.
This document assumes that debugfs is mounted on /sys/kernel/debug and that
-the appropriate tracing options have been configured into the kernel. It is
-assumed that the PCL tool tools/perf has been installed and is in your path.
+at least the following tracing options have been configured into the kernel
+
+ CONFIG_EVENT_PROFILE=y
+ CONFIG_FTRACE=y
+ CONFIG_DYNAMIC_FTRACE=y
+ CONFIG_TRACEPOINTS=y
+
+It is also assumed that the PCL tool available from tools/perf has been
+installed and is in your path. This can be trivially installed as
+
+ $ make prefix=/usr/local
+ $ make prefix=/usr/local install
2. Listing Available Events
===========================
@@ -61,6 +71,11 @@ to page allocation would look something like
$ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
+To start monitoring the events then, one would do something like
+
+ $ echo 1 > tracing_enabled
+ $ cat trace_pipe | some-monitor-script
+
2.2 System-Wide Event Enabling with SystemTap
---------------------------------------------
@@ -116,7 +131,7 @@ basis using set_ftrace_pid.
2.5 Local Event Enablement with PCL
-----------------------------------
-Events can be activate and tracked for the duration of a process on a local
+Events can be activated and tracked for the duration of a process on a local
basis using PCL such as follows.
$ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
@@ -142,8 +157,8 @@ as any script reading trace_pipe.
=====================================
Any workload can exhibit variances between runs and it can be important
-to know what the standard deviation in. By and large, this is left to the
-performance analyst to do it by hand. In the event that the discrete event
+to know what the standard deviation is. By and large, this is left to the
+performance analyst to do by hand. In the event that the discrete event
occurrences are useful to the performance analyst, then perf can be used.
$ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
@@ -190,11 +205,11 @@ be gathered on-line as appropriate. Examples of post-processing might include
o Reading information from /proc for the PID that triggered the event
o Deriving a higher-level event from a series of lower-level events.
- o Calculate latencies between two events
+ o Calculating latencies between two events
Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
script that can read trace_pipe from STDIN or a copy of a trace. When used
-on-line, it can be interrupted once to generate a report without existing
+on-line, it can be interrupted once to generate a report without exiting
and twice to exit.
Simplistically, the script just reads STDIN and counts up events but it
@@ -204,7 +219,7 @@ also can do more such as
are freed to the main allocator from the per-CPU lists, it recognises
that as one per-CPU drain even though there is no specific tracepoint
for that event
- o It can aggregate based on PID or individual process number
+ o It can aggregate based on PID or process name
o In the event memory is getting externally fragmented, it reports
on whether the fragmentation event was severe or moderate.
o When receiving an event about a PID, it can record who the parent was so
@@ -217,7 +232,7 @@ also can do more such as
There may also be a requirement to identify what functions with a program
were generating events within the kernel. To begin this sort of analysis, the
-data must be recorded. At the time of writing, this required root
+data must be recorded.
$ perf record -c 1 \
-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
@@ -226,10 +241,17 @@ data must be recorded. At the time of writing, this required root
Time: 0.894
[ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
-Note the use of '-c 1' to set the event period to sample. The default sample
-period is quite high to minimise overhead but the information collected can be
+Note the use of '-c 1' to set the sample period. The default sample period
+is quite high to minimise overhead but the information collected can be
very coarse as a result.
+The sample period is in units of "events occurred". For a hardware counter,
+this would usually mean the PMU is programmed to "raise an interrupt after
+this many events occured" and the event is recorded on interrupt receipt. For
+software-events such as tracepoints, one event will be recorded every
+"sample period" number of times the tracepoint triggered. In this case,
+-c 1 means "record a sample every time this tracepoint is triggered".
+
This record outputted a file called perf.data which can be analysed using
perf report.
@@ -297,7 +319,8 @@ symbol.
0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path
0.00% Xorg [kernel] [k] ftrace_trace_userstack
-To see where within the function pixmanFillsse2 things are going wrong
+Note here that kernel symbols are marked [k]. To see where within the
+function pixmanFillsse2 things are going wrong
$ perf annotate pixmanFillsse2
[ ... ]
@@ -323,5 +346,5 @@ To see where within the function pixmanFillsse2 things are going wrong
At a glance, it looks like the time is being spent copying pixmaps to
the card. Further investigation would be needed to determine why pixmaps
are being copied around so much but a starting point would be to take an
-ancient build of libpixmap out of the library path where it was totally
+ancient build of libpixman out of the library path where it was totally
forgotten about from months ago!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
From: Richard Kennedy @ 2009-08-23 13:46 UTC (permalink / raw)
To: Wu Fengguang
Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
a.p.zijlstra@chello.nl, chris.mason@oracle.com,
jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org
In-Reply-To: <20090823130056.GA10596@localhost>
On Sun, 2009-08-23 at 21:00 +0800, Wu Fengguang wrote:
> On Sun, Aug 23, 2009 at 05:33:33PM +0800, Richard Kennedy wrote:
> > On Sat, 2009-08-22 at 10:51 +0800, Wu Fengguang wrote:
> >
> > > >
> > > > mm/page-writeback.c | 116 +++++++++++++++---------------------------
> > > > 1 file changed, 43 insertions(+), 73 deletions(-)
> > > >
> > > > diff -puN mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references mm/page-writeback.c
> > > > --- a/mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references
> > > > +++ a/mm/page-writeback.c
> > > > @@ -249,32 +249,6 @@ static void bdi_writeout_fraction(struct
> > > > }
> > > > }
> > > >
> > > > -/*
> > > > - * Clip the earned share of dirty pages to that which is actually available.
> > > > - * This avoids exceeding the total dirty_limit when the floating averages
> > > > - * fluctuate too quickly.
> > > > - */
> > > > -static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
> > > > - unsigned long dirty, unsigned long *pbdi_dirty)
> > > > -{
> > > > - unsigned long avail_dirty;
> > > > -
> > > > - avail_dirty = global_page_state(NR_FILE_DIRTY) +
> > > > - global_page_state(NR_WRITEBACK) +
> > > > - global_page_state(NR_UNSTABLE_NFS) +
> > > > - global_page_state(NR_WRITEBACK_TEMP);
> > > > -
> > > > - if (avail_dirty < dirty)
> > > > - avail_dirty = dirty - avail_dirty;
> > > > - else
> > > > - avail_dirty = 0;
> > > > -
> > > > - avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
> > > > - bdi_stat(bdi, BDI_WRITEBACK);
> > > > -
> > > > - *pbdi_dirty = min(*pbdi_dirty, avail_dirty);
> > > > -}
> > > > -
> > > > static inline void task_dirties_fraction(struct task_struct *tsk,
> > > > long *numerator, long *denominator)
> > > > {
> > > > @@ -465,7 +439,6 @@ get_dirty_limits(unsigned long *pbackgro
> > > > bdi_dirty = dirty * bdi->max_ratio / 100;
> > > >
> > > > *pbdi_dirty = bdi_dirty;
> > > > - clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
> > > > task_dirty_limit(current, pbdi_dirty);
> > > > }
> > > > }
> > > > @@ -499,45 +472,12 @@ static void balance_dirty_pages(struct a
> > > > };
> > > >
> > > > get_dirty_limits(&background_thresh, &dirty_thresh,
> > > > - &bdi_thresh, bdi);
> > > > + &bdi_thresh, bdi);
> > > >
> > > > nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > > > - global_page_state(NR_UNSTABLE_NFS);
> > > > - nr_writeback = global_page_state(NR_WRITEBACK);
> > > > -
> > > > - bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> > > > - bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > > > -
> > > > - if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > > - break;
> > > > -
> > > > - /*
> > > > - * Throttle it only when the background writeback cannot
> > > > - * catch-up. This avoids (excessively) small writeouts
> > > > - * when the bdi limits are ramping up.
> > > > - */
> > > > - if (nr_reclaimable + nr_writeback <
> > > > - (background_thresh + dirty_thresh) / 2)
> > > > - break;
> > > > -
> > > > - if (!bdi->dirty_exceeded)
> > > > - bdi->dirty_exceeded = 1;
> > > > -
> > > > - /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> > > > - * Unstable writes are a feature of certain networked
> > > > - * filesystems (i.e. NFS) in which data may have been
> > > > - * written to the server's write cache, but has not yet
> > > > - * been flushed to permanent storage.
> > > > - * Only move pages to writeback if this bdi is over its
> > > > - * threshold otherwise wait until the disk writes catch
> > > > - * up.
> > > > - */
> > > > - if (bdi_nr_reclaimable > bdi_thresh) {
> > > > - generic_sync_bdi_inodes(NULL, &wbc);
> > > > - pages_written += write_chunk - wbc.nr_to_write;
> > > > - get_dirty_limits(&background_thresh, &dirty_thresh,
> > > > - &bdi_thresh, bdi);
> > > > - }
> > > > + global_page_state(NR_UNSTABLE_NFS);
> > > > + nr_writeback = global_page_state(NR_WRITEBACK) +
> > > > + global_page_state(NR_WRITEBACK_TEMP);
> > > >
> > > > /*
> > > > * In order to avoid the stacked BDI deadlock we need
> > > > @@ -557,16 +497,48 @@ static void balance_dirty_pages(struct a
> > > > bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > > > }
> > > >
> > > > - if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > > - break;
> > > > - if (pages_written >= write_chunk)
> > > > - break; /* We've done our duty */
> > >
> > > > + /* always throttle if over threshold */
> > > > + if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > >
> > > That 'if' is a big behavior change. It effectively blocks every one
> > > and canceled Peter's proportional throttling work: the less a process
> > > dirtied, the less it should be throttled.
> > >
> > I don't think it does. the code ends up looking like
> >
> > FOR
> > IF less than dirty_thresh THEN
> > check bdi limits etc
> > ENDIF
> >
> > thottle
> > ENDFOR
> >
> > Therefore we always throttle when over the threshold otherwise we apply
> > the per bdi limits to decide if we throttle.
> >
> > In the existing code clip_bdi_dirty_limit modified the bdi_thresh so
> > that it would not let a bdi dirty enough pages to go over the
> > dirty_threshold. All I've done is to bring the check of dirty_thresh up
> > into balance_dirty_pages.
> >
> > So isn't this effectively the same ?
>
> Yes and no. For the bdi_thresh part it somehow makes the
> clip_bdi_dirty_limit() logic more simple and obvious. Which I tend to
> agree with you and Peter on doing something like this:
>
> if (nr_reclaimable + nr_writeback < dirty_thresh) {
> /* compute bdi_* */
> if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> break;
> }
>
> For other two 'if's..
>
> > > I'd propose to remove the above 'if' and liberate the following three 'if's.
> > >
> > > > +
> > > > + if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > > + break;
> > > > +
> > > > + /*
> > > > + * Throttle it only when the background writeback cannot
> > > > + * catch-up. This avoids (excessively) small writeouts
> > > > + * when the bdi limits are ramping up.
> > > > + */
> > > > + if (nr_reclaimable + nr_writeback <
> > > > + (background_thresh + dirty_thresh) / 2)
> > > > + break;
>
> That 'if' can be trivially moved out.
OK,
> > > > +
> > > > + /* done enough? */
> > > > + if (pages_written >= write_chunk)
> > > > + break;
>
> That 'if' must be moved out, otherwise it can block a light writer
> for ever, as long as there is another heavy dirtier keeps the dirty
> numbers high.
Yes, I see. But I was worried about a failing device that gets stuck.
Doesn't this let the application keep dirtying pages forever if the
pages aren't get written to the device?
Maybe something like this ?
if ( nr_writeback < background_thresh && pages_written >= write_chunk)
break;
or bdi_nr_writeback < bdi_thresh/2 ?
> > > > + }
> > > > + if (!bdi->dirty_exceeded)
> > > > + bdi->dirty_exceeded = 1;
> > > >
> > > > + /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> > > > + * Unstable writes are a feature of certain networked
> > > > + * filesystems (i.e. NFS) in which data may have been
> > > > + * written to the server's write cache, but has not yet
> > > > + * been flushed to permanent storage.
>
> > > > + * Only move pages to writeback if this bdi is over its
> > > > + * threshold otherwise wait until the disk writes catch
> > > > + * up.
> > > > + */
> > > > + if (bdi_nr_reclaimable > bdi_thresh) {
>
> I'd much prefer its original form
>
> if (bdi_nr_reclaimable) {
>
> Let's push dirty pages to disk ASAP :)
That change comes from my previous patch, and it's to stop this code
over reacting and pushing all the available dirty pages to the writeback
list.
> > > > + writeback_inodes(&wbc);
> > > > + pages_written += write_chunk - wbc.nr_to_write;
> > >
> > > > + if (wbc.nr_to_write == 0)
> > > > + continue;
> > >
> > > What's the purpose of the above 2 lines?
> >
> > This is to try to replicate the existing code as closely as possible.
> >
> > If writeback_inodes wrote write_chunk pages in one pass then skip to the
> > top of the loop to recheck the limits and decide if we can let the
> > application continue. Otherwise it's not making enough forward progress
> > due to congestion so do the congestion_wait & loop.
>
> It makes sense. We have wbc.encountered_congestion for that purpose.
> However it may not able to write enough pages for other reasons like
> lock contention. So I'd suggest to test (wbc.nr_to_write <= 0).
> Thanks,
> Fengguang
I didn't test the congestion flag directly because we don't care about
it if writeback_inodes did enough. If write_chunk pages get moved to
writeback then we don't need to do the congestion_wait.
Can writeback_inodes do more work than it was asked to do?
But OK, I can make that change if you think it worthwhile.
regards
Richard
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: Receive side performance issue with multi-10-GigE and NUMA
From: Neil Horman @ 2009-08-26 11:00 UTC (permalink / raw)
To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin
In-Reply-To: <20090826031057.375303c9.billfink@mindspring.com>
On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> On Fri, 21 Aug 2009, Neil Horman wrote:
>
> > On Fri, Aug 21, 2009 at 12:14:21AM -0400, Bill Fink wrote:
> > > On Thu, 20 Aug 2009, Neil Horman wrote:
> > >
> > > > On Thu, Aug 20, 2009 at 03:50:44AM -0400, Bill Fink wrote:
> > > >
> > > > > When I tried an actual nuttcp performance test, even when rate limiting
> > > > > to just 1 Mbps, I immediately got a kernel oops. I tried to get a
> > > > > crashdump via kexec/kdump, but the kexec kernel, instead of just
> > > > > generating a crashdump, fully booted the new kernel, which was
> > > > > extremely sluggish until I rebooted it through a BIOS re-init,
> > > > > and never produced a crashdump. I tried this several times and
> > > > > an immediate kernel oops was always the result (with either a TCP
> > > > > or UDP test). A ping test of 1000 9000-byte packets with an interval
> > > > > of 0.001 seconds (which is 72 Mbps for 1 second) on the other hand
> > > > > worked just fine.
> > > >
> > > > The sluggishness is expected, since the kdump kernel operates out of such
> > > > limited memory. don't know why you booted to a full system rather than did a
> > > > crash recovery. Don't suppose you got a backtrace did you?
> > >
> > > There was a backtrace on the screen but I didn't have a chance to
> > > record it. BTW did anyone ever think to print the backtrace in
> > > reverse (first to some reserved memory and then output to the display)
> > > so the more interesting parts wouldn't have scrolled off the top of
> > > the screen?
> > >
> > The real solution is to use a console to which the output doesn't scroll off the
> > screen. Normally people use a serial console they can log, or a RAC card that
> > they can record. Even on a regular vga monitor in text mode, you can set up the
> > vt iirc to allow for scrolling.
>
> None of our Asus P6T6 systems have serial consoles. I don't know of
> any RAC cards for them either, nor are there spare PCI slots available
> in many cases. I wouldn't think the Shift-PageUp trick would work
> with a crashed kernel, but I admit I didn't try it. I haven't checked
> out netconsole yet either, but I'm not sure it would help either in a
> case like this that was a network related kernel crash.
>
Any USB ports that you can attach a serial dongle to? That would work as well,
or, as previously mentioned, netconsole also does the trick.
> In any case, a simple kernel command line that would provide a reversed
> backtrace would be a simple thing to facilitate Linux users providing
> useful info to Linux kernel developers in helping to debug kernel
> problems. The most useful info would still be on the screen, so it
> could be transcribed or a photo image of the screen could be taken.
>
I understand what your saying, I'm just saying there are currently several
options for you that have already solved this problem in differnt ways.
> Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> does have a serial console, and after a fair amount of effort I was
> able to get it to work as desired, and was able to finally capture
> a backtrace of the kernel oops. BTW I believe the reason the
> kexec/kdump didn't work was probably because it couldn't find
> a /proc/vmcore file, although I don't know why that would be,
> and the Fedora 10 /etc/init.d/kdump script will then just boot
> up normally if it fails to find the /proc/vmcore file (or it's
> zero size).
>
I take care of kdump for fedora and RHEL. If you file a bug on this, I'd be
happy to look into it further.
> The following shows a simple ping test usage of the skb_sources
> tracing feature:
>
> [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
> PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
> 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
> 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
> 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
> 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
> 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms
>
> --- 192.168.1.10 ping statistics ---
> 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
> rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms
>
> [root@xeontest1 tracing]# cat trace
> # tracer: skb_sources
> #
> # PID ANID CNID IFC RXQ CCPU LEN
> # | | | | | | |
> 4217 1 1 eth2 0 4 1500
> 4217 1 1 eth2 0 4 1500
> 4217 1 1 eth2 0 4 1500
> 4217 1 1 eth2 0 4 1500
> 4217 1 1 eth2 0 4 1500
>
> All is as was expected.
>
> But if I try an actual nuttcp performance test (even rate limited
> to 1 Mbps), I get the following kernel oops:
>
thank you, I think I see the problem, I'll have a patch for you in just a bit
Thanks
Neil
> [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> PGD 337d12067 PUD 337d11067 PMD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
> CPU 4
> Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
> Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
> RIP: 0010:[<ffffffff810b01ab>] [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
> RSP: 0018:ffff8801a5811a88 EFLAGS: 00010213
> RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
> RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
> RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
> R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
> R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
> FS: 00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
> Stack:
> ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
> <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
> <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
> Call Trace:
> [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
> [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
> [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
> [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
> [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
> [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
> [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
> [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
> [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
> [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
> [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
> [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
> [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
> [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
> [<ffffffff810f785c>] vfs_read+0xc0/0x107
> [<ffffffff810f7971>] sys_read+0x4c/0x75
> [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
> Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
> RIP [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> RSP <ffff8801a5811a88>
> CR2: 0000000000000038
>
> -Thanks
>
> -Bill
>
^ permalink raw reply
* Re: SIP conntrack defeating Asterisk canreinvite
From: John A. Sullivan III @ 2009-08-26 10:59 UTC (permalink / raw)
To: Joerg Dorchain; +Cc: netfilter
In-Reply-To: <20090826071510.GZ6724@Redstar.dorchain.net>
On Wed, 2009-08-26 at 09:15 +0200, Joerg Dorchain wrote:
> On Tue, Aug 25, 2009 at 09:04:28PM -0400, John A. Sullivan III wrote:
> > The reinvite works by the Asterisk server sending a SIP invite after the
> > call has been set up. The new invite contains the address of the phone
> > in the SDP portion of the packet rather than the address of the PBX.
> > This should redirect the media stream to flow directly between the
> > phones. However, it appears conntrack is rewriting the SDP so that the
> > address is reverted to the PBX address.
>
> Rewriting sounds like nat. I am using conntrack_sip to be able
> to have the rtp connections accepted as related to a sip
> connection. Are you sure that you aren't using the sip nat helper
> by change?
>
> To have reinvites working, I needed sip_direct_media=0 as option
> to nf_conntrack_cip
>
> Bye,
>
> Joerg
Yes, as I was thinking after I wrote this, it is probably ip_nat_sip
since it is doing packet rewriting. So it sounds like it is a problem
without sip_direct_media which sounds like it implies upgrading my
kernel :-( Thanks - John
--
John A. Sullivan III
Open Source Development Corporation
+1 207-985-7880
jsullivan@opensourcedevel.com
http://www.spiritualoutreach.com
Making Christianity intelligible to secular society
^ permalink raw reply
* Re: [PATCH 1/2 v3] sh_mobile_ceu: add soft reset function
From: Guennadi Liakhovetski @ 2009-08-26 10:58 UTC (permalink / raw)
To: Kuninori Morimoto; +Cc: V4L-Linux
In-Reply-To: <uhbvuhn8f.wl%morimoto.kuninori@renesas.com>
On Wed, 26 Aug 2009, Kuninori Morimoto wrote:
>
> Dear Guennadi
>
> > I've updated both your patches on the top of my current tree and slightly
> > cleaned them up - mainly multi-line comments. Also fixed one error in
> > sh_mobile_ceu_add_device() (see below). Please, check if the stack at
> > http://download.open-technology.de/soc-camera/20090826/ looks ok and still
> > works for you. As usual, you find instructions on which tree and branch to
> > use in 0000-base.
>
> Thank you for your hard work.
> It looks OK for me.
> But I wounder that does your and Paul's git can merge correctly ?
> I can find Magnus patch on sh_mobile_ceu_camera.c in latest Paul's git
That will be handled when the time comes:-)
Thanks
Guennadi
---
Guennadi Liakhovetski
--
video4linux-list mailing list
Unsubscribe mailto:video4linux-list-request@redhat.com?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/video4linux-list
^ permalink raw reply
* RE: nVidia Geforce 8400 GS PCI Express x16 VGA Pass Through to Windows XP Home 32-bit HVM Virtual Machine with Intel Desktop Board DQ45CB
From: Han, Weidong @ 2009-08-26 10:56 UTC (permalink / raw)
To: 'enming.teo@asiasoftsea.net',
'djmagee@mageenet.net'
Cc: 'xen-devel@lists.xensource.com'
In-Reply-To: <2DABCD389F9741218B632A562A667FD3@ASOITIS16>
Teo En Ming (Zhang Enming) wrote:
> Hi Weidong,
>
> Could you share with us the hack codes for making Geforce 8400 GS
> work and also how to let the Windows HVM guest boot up using the real
> BIOS of Geforce 8400 GS instead of an emulated VGA BIOS?
>
What patch are you using now? Using real VGA bios of gfx card to replace emulatd VGA bios is the prerequisite of gfx passthrough. You can find it in posted gfx passthrough patches or XCI. For hack of making Geforce 8400, we reserve physical MMIO BARs in dsdt.asl, and make it 1:1 map between physical MMIO BARs and virtual MMIO BARs of the card. Currently our code is experimental, we will send out in mailing list after cleanup and more tests.
Regards,
Weidong
> Thank you.
>
> Regards,
>
> Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics Engineering)
> BEng(Hons)(Mechanical Engineering)
> Technical Support Engineer
> Information Technology Department
> Asiasoft Online Pte Ltd
> Tampines Central 1 #04-01 Tampines Plaza
> Singapore 529541
> Republic of Singapore
> Mobile: +65-9648-9798
> MSN: teoenming@hotmail.com
> Alma Maters: Singapore Polytechnic, National University of Singapore
>
> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Han,
> Weidong Sent: Wednesday, August 26, 2009 4:27 PM
> To: 'djmagee@mageenet.net'; 'enming.teo@asiasoftsea.net'
> Cc: 'xen-devel@lists.xensource.com'
> Subject: RE: [Xen-devel] nVidia Geforce 8400 GS PCI Express x16
> VGAPassThroughto Windows XP Home 32-bit HVM Virtual Machine with
> IntelDesktop BoardDQ45CB
>
> I suppose you just use the patch posted in mailing list before. nVidia
> Geforce 8400 passthrough needs extra hacks. We can make it work in our
> experiments with 1:1 map of its MMIO BARs.
>
> We are working on gfx passthrough on latest xen-unstable. Firstly, we
> want to cook a simple patch including generic changes to support
> passthrough of virtualization friendly gfx cards, such as Nvidia
> FX3800. This patch is basically done. Then, we will add some hacks
> for more gfx cards passthrough, such as iGFX and some Nvidia and ATI
> cards.
>
> Regards,
> Weidong
>
> djmagee@mageenet.net wrote:
>> As I've pointed out on this list before, there are not enough PCIe
>> lanes on the DQ45CB to drive both the internal graphics adapter and
>> the add-on adapter at the same time. I believe the onboard one may
>> be able to operate in some sort of VGA only mode when there is a card
>> installed and used as the primary adapter.
>>
>> I have a similar setup, using the same motherboard, and an ATI 4770.
>> I used the VGA passthrough patches from an earlier posting (I saw you
>> followed the same set of instructions), with limited success. I've
>> been using 3.4-testing, and 'xenified' 2.6.29.6 kernel. I made
>> modifications to the dom0 portion of the patches so they would apply
>> to my xenified 2.6.29 kernel. These patches include code that will
>> copy the VGA bios to the guest bios instead of using the emulated vga
>> bios. I've had very little success, however. I have only tried
>> passing through the ATI adapter. In all instances, the guest bios
>> messages appear on my monitor, so this much works. In some cases,
>> the guest essentially stops there; xm list show's about 2sec CPU
>> usage and nothing ever happens after that point. In other cases, the
>> guest (win xp/vista/7, as well as KNOPPIX 5.3.1 DVD) will boot all
>> the way, but in very low res/color mode, and cannot properly
>> initialize the video device. Once or twice, it actually did
>> recognize the device and had a reasonable default color/resolution
>> combination. In all cases where the guest actually boots, the system
>> eventually freezes. In some cases, I get endless streams of iommu
>> page faults.
>>
>> I have 8GB ram installed. In all cases I've limited dom0 memory to
>> 2GB. In all cases, my guest has been assigned 2GB of memory.
>>
>> I have a dual core e6600. I've tried allowing dom0 to use both
>> cores, offlining one core (using xend dom0_vcpu setting) after boot,
>> and restricting dom0 to only one core using the dom0_max_vcpus xen
>> hypervisor parameter. In all of these cases I've tried both one and
>> two vcpus for the guest. My success with VGA passthrough seems
>> somewhat random and no combination of cpu assignment seems to have
>> any effect.
>>
>> I have not tried with the 2.6.18-xen kernel as I haven't gotten it to
>> boot on my hardware; it can never find my volume group, even if I
>> create a initrd with all of the required modules, or build those
>> drivers into the kernel. I have not spent more than maybe a half an
>> hour on this problem; I suspect it may have something to do with the
>> version of mkinitrd I'm using (from Fedora 9 x64).
>>
>> If anyone else has any insight or similar experience I'd also love
>> to hear it.
>>
>> Doug Magee
>> djmagee@mageenet.com
>>
>> -----Original Message-----
>> From: xen-devel-bounces@lists.xensource.com
>> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Mr. Teo
>> En Ming (Zhang Enming) Sent: Tuesday, August 25, 2009 11:57 AM
>> To: enming.teo@asiasoftsea.net
>> Cc: xen-devel@lists.xensource.com
>> Subject: Re: [Xen-devel] nVidia Geforce 8400 GS PCI Express x16 VGA
>> PassThroughto Windows XP Home 32-bit HVM Virtual Machine with Intel
>> Desktop BoardDQ45CB
>>
>> I have uninstalled Xen 3.4.1 and installed Xen 3.5-unstable as
>> suggested by Weidong.
>>
>>
>> On 08/25/2009 11:47 PM, Mr. Teo En Ming (Zhang Enming) wrote:
>>> Dear All,
>>>
>>> I have managed to do PCI-e VGA passthrough with the open source Xen
>>> but the work is still in progress because although Windows XP guest
>>> can see the REAL PCI-e x16 graphics card instead of an emulated
>>> graphics driver, it cannot be initialized yet.
>>>
>>> Thanks to Intel Engineer Han Weidong, Pasi Kärkkäinen, Boris
>>> Derzhavets, Marc, Caz Yokoyama, and others who have helped me and
>>> shared their knowledge with me along the way.
>>>
>>> System Configuration:
>>>
>>> Intel Desktop Board DQ45CB with BIOS upgraded to 0093
>>> Onboard Intel GMA 4500 Graphics (IGD)
>>> nVidia Geforce 8400 GS PCI Express x16 Graphics Card
>>>
>>> Fedora 11 Linux 64-bit Xen paravirt operations Domain 0 Host
>>> Operating System Xen 3.5 Unstable/Development Type 1 Hypervisor
>>> Jeremy Fitzhardinge's Xen paravirt-ops domain 0 Kernel 2.6.31-rc6
>>> Primary Video Adapter in BIOS: IGD
>>>
>>> Please see the screenshots and my blog at the link here:
>>>
>>>
> http://teo-en-ming-aka-zhang-enming.blogspot.com/2009/08/nvidia-geforce-8400
> -gs-pci-express-x16.html
>>>
>>>
>>
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.392 / Virus Database: 270.13.65/2324 - Release Date:
> 08/25/09 18:07:00
>
> No virus found in this outgoing message.
> Checked by AVG - www.avg.com
> Version: 8.5.392 / Virus Database: 270.13.65/2324 - Release Date:
> 08/25/09 18:07:00
^ permalink raw reply
* Re: [PATCH 4/5] hugetlb: add per node hstate attributes
From: Mel Gorman @ 2009-08-26 10:11 UTC (permalink / raw)
To: Lee Schermerhorn
Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
Adam Litke, Andy Whitcroft, eric.whitney
In-Reply-To: <1251233369.16229.1.camel@useless.americas.hpqcorp.net>
On Tue, Aug 25, 2009 at 04:49:29PM -0400, Lee Schermerhorn wrote:
> > >
> > > +static nodemask_t *nodes_allowed_from_node(int nid)
> > > +{
> >
> > This name is a bit weird. It's creating a nodemask with just a single
> > node allowed.
> >
> > Is there something wrong with using the existing function
> > nodemask_of_node()? If stack is the problem, prehaps there is some macro
> > magic that would allow a nodemask to be either declared on the stack or
> > kmalloc'd.
>
> Yeah. nodemask_of_node() creates an on-stack mask, invisibly, in a
> block nested inside the context where it's invoked. I would be
> declaring the nodemask in the compound else clause and don't want to
> access it [via the nodes_allowed pointer] from outside of there.
>
So, the existance of the mask on the stack is the problem. I can
understand that, they are potentially quite large.
Would it be possible to add a helper along side it like
init_nodemask_of_node() that does the same work as nodemask_of_node()
but takes a nodemask parameter? nodemask_of_node() would reuse the
init_nodemask_of_node() except it declares the nodemask on the stack.
> >
> > > + nodemask_t *nodes_allowed;
> > > + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> > > + if (!nodes_allowed) {
> > > + printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> > > + "for huge page allocation.\nFalling back to default.\n",
> > > + current->comm);
> > > + } else {
> > > + nodes_clear(*nodes_allowed);
> > > + node_set(nid, *nodes_allowed);
> > > + }
> > > + return nodes_allowed;
> > > +}
> > > +
> > > #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> > > -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> > > +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > > + int nid)
> > > {
> > > unsigned long min_count, ret;
> > > nodemask_t *nodes_allowed;
> > > @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages(
> > > if (h->order >= MAX_ORDER)
> > > return h->max_huge_pages;
> > >
> > > - nodes_allowed = huge_mpol_nodes_allowed();
> > > + if (nid < 0)
> > > + nodes_allowed = huge_mpol_nodes_allowed();
> >
> > hugetlb is a bit littered with magic numbers been passed into functions.
> > Attempts have been made to clear them up as according as patches change
> > that area. Would it be possible to define something like
> >
> > #define HUGETLB_OBEY_MEMPOLICY -1
> >
> > for the nid here as opposed to passing in -1? I know -1 is used in the page
> > allocator functions but there it means "current node" and here it means
> > "obey mempolicies".
>
> Well, here it means, NO_NODE_ID_SPECIFIED or, "we didn't get here via a
> per node attribute". It means "derive nodes allowed from memory policy,
> if non-default, else use nodes_online_map" [which is not exactly the
> same as obeying memory policy].
>
> But, I can see defining a symbolic constant such as
> NO_NODE[_ID_SPECIFIED]. I'll try next spin.
>
That NO_NODE_ID_SPECIFIED was the underlying definition I was looking
for. It makes sense at both sites.
> > > -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> > > +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> > > +{
> > > + int nid;
> > > +
> > > + for (nid = 0; nid < nr_node_ids; nid++) {
> > > + struct node *node = &node_devices[nid];
> > > + int hi;
> > > + for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)
> >
> > Does that hi mean hello, high, nid or hstate_idx?
> >
> > hstate_idx would appear to be the appropriate name here.
>
> Or just plain 'i', like in the following, pre-existing function?
>
Whichever suits you best. If hstate_idx is really what it is, I see no
harm in using it but 'i' is an index and I'd sooner recognise that than
the less meaningful "hi".
> >
> > > + if (node->hstate_kobjs[hi] == kobj) {
> > > + if (nidp)
> > > + *nidp = nid;
> > > + return &hstates[hi];
> > > + }
> > > + }
> >
> > Ok.... so, there is a struct node array for the sysdev and this patch adds
> > references to the "hugepages" directory kobject and the subdirectories for
> > each page size. We walk all the objects until we find a match. Obviously,
> > this adds a dependency of base node support on hugetlbfs which feels backwards
> > and you call that out in your leader.
> >
> > Can this be the other way around? i.e. The struct hstate has an array of
> > kobjects arranged by nid that is filled in when the node is registered?
> > There will only be one kobject-per-pagesize-per-node so it seems like it
> > would work. I confess, I haven't prototyped this to be 100% sure.
>
> This will take a bit longer to sort out. I do want to change the
> registration, tho', so that hugetlb.c registers it's single node
> register/unregister functions with base/node.c to remove the source
> level dependency in that direction. node.c will only register nodes on
> hot plug as it's initialized too early, relative to hugetlb.c to
> register them at init time. This should break the call dependency of
> base/node.c on the hugetlb module.
>
> As far as moving the per node attributes' kobjects to the hugetlb global
> hstate arrays... Have to think about that. I agree that it would be
> nice to remove the source level [header] dependency.
>
FWIW, I see no problem with the mempolicy stuff going ahead separately from
this patch after the few relatively minor cleanups highlighted in the thread
and tackling this patch as a separate cycle. It's up to you really.
> >
> > > +
> > > + BUG();
> > > + return NULL;
> > > +}
> > > +
> > > +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
> > > {
> > > int i;
> > > +
> > > for (i = 0; i < HUGE_MAX_HSTATE; i++)
> > > - if (hstate_kobjs[i] == kobj)
> > > + if (hstate_kobjs[i] == kobj) {
> > > + if (nidp)
> > > + *nidp = -1;
> > > return &hstates[i];
> > > - BUG();
> > > - return NULL;
> > > + }
> > > +
> > > + return kobj_to_node_hstate(kobj, nidp);
> > > }
> > >
> > > static ssize_t nr_hugepages_show(struct kobject *kobj,
> > > struct kobj_attribute *attr, char *buf)
> > > {
> > > - struct hstate *h = kobj_to_hstate(kobj);
> > > - return sprintf(buf, "%lu\n", h->nr_huge_pages);
> > > + struct hstate *h;
> > > + unsigned long nr_huge_pages;
> > > + int nid;
> > > +
> > > + h = kobj_to_hstate(kobj, &nid);
> > > + if (nid < 0)
> > > + nr_huge_pages = h->nr_huge_pages;
> >
> > Here is another magic number except it means something slightly
> > different. It means NR_GLOBAL_HUGEPAGES or something similar. It would
> > be nice if these different special nid values could be named, preferably
> > collapsed to being one "core" thing.
>
> Again, it means "NO NODE ID specified" [via per node attribute]. Again,
> I'll address this with a single constant.
>
> >
> > > + else
> > > + nr_huge_pages = h->nr_huge_pages_node[nid];
> > > +
> > > + return sprintf(buf, "%lu\n", nr_huge_pages);
> > > }
> > > +
> > > static ssize_t nr_hugepages_store(struct kobject *kobj,
> > > struct kobj_attribute *attr, const char *buf, size_t count)
> > > {
> > > - int err;
> > > unsigned long input;
> > > - struct hstate *h = kobj_to_hstate(kobj);
> > > + struct hstate *h;
> > > + int nid;
> > > + int err;
> > >
> > > err = strict_strtoul(buf, 10, &input);
> > > if (err)
> > > return 0;
> > >
> > > - h->max_huge_pages = set_max_huge_pages(h, input);
> >
> > "input" is a bit meaningless. The function you are passing to calls this
> > parameter "count". Can you match the naming please? Otherwise, I might
> > guess that this is a "delta" which occurs elsewhere in the hugetlb code.
>
> I guess I can change that. It's the pre-exiting name, and 'count' was
> already used. Guess I can change 'count' to 'len' and 'input' to
> 'count'
Makes sense.
> >
> > > + h = kobj_to_hstate(kobj, &nid);
> > > + h->max_huge_pages = set_max_huge_pages(h, input, nid);
> > >
> > > return count;
> > > }
> > > @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages);
> > > static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
> > > struct kobj_attribute *attr, char *buf)
> > > {
> > > - struct hstate *h = kobj_to_hstate(kobj);
> > > + struct hstate *h = kobj_to_hstate(kobj, NULL);
> > > +
> > > return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
> > > }
> > > +
> > > static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
> > > struct kobj_attribute *attr, const char *buf, size_t count)
> > > {
> > > int err;
> > > unsigned long input;
> > > - struct hstate *h = kobj_to_hstate(kobj);
> > > + struct hstate *h = kobj_to_hstate(kobj, NULL);
> > >
> > > err = strict_strtoul(buf, 10, &input);
> > > if (err)
> > > @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
> > > static ssize_t free_hugepages_show(struct kobject *kobj,
> > > struct kobj_attribute *attr, char *buf)
> > > {
> > > - struct hstate *h = kobj_to_hstate(kobj);
> > > - return sprintf(buf, "%lu\n", h->free_huge_pages);
> > > + struct hstate *h;
> > > + unsigned long free_huge_pages;
> > > + int nid;
> > > +
> > > + h = kobj_to_hstate(kobj, &nid);
> > > + if (nid < 0)
> > > + free_huge_pages = h->free_huge_pages;
> > > + else
> > > + free_huge_pages = h->free_huge_pages_node[nid];
> > > +
> > > + return sprintf(buf, "%lu\n", free_huge_pages);
> > > }
> > > HSTATE_ATTR_RO(free_hugepages);
> > >
> > > static ssize_t resv_hugepages_show(struct kobject *kobj,
> > > struct kobj_attribute *attr, char *buf)
> > > {
> > > - struct hstate *h = kobj_to_hstate(kobj);
> > > + struct hstate *h = kobj_to_hstate(kobj, NULL);
> > > return sprintf(buf, "%lu\n", h->resv_huge_pages);
> > > }
> > > HSTATE_ATTR_RO(resv_hugepages);
> > > @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages);
> > > static ssize_t surplus_hugepages_show(struct kobject *kobj,
> > > struct kobj_attribute *attr, char *buf)
> > > {
> > > - struct hstate *h = kobj_to_hstate(kobj);
> > > - return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> > > + struct hstate *h;
> > > + unsigned long surplus_huge_pages;
> > > + int nid;
> > > +
> > > + h = kobj_to_hstate(kobj, &nid);
> > > + if (nid < 0)
> > > + surplus_huge_pages = h->surplus_huge_pages;
> > > + else
> > > + surplus_huge_pages = h->surplus_huge_pages_node[nid];
> > > +
> > > + return sprintf(buf, "%lu\n", surplus_huge_pages);
> > > }
> > > HSTATE_ATTR_RO(surplus_hugepages);
> > >
> > > @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att
> > > .attrs = hstate_attrs,
> > > };
> > >
> > > -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> > > +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> > > + struct kobject *parent,
> > > + struct kobject **hstate_kobjs,
> > > + struct attribute_group *hstate_attr_group)
> > > {
> > > int retval;
> > > + int hi = h - hstates;
> > >
> > > - hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> > > - hugepages_kobj);
> > > - if (!hstate_kobjs[h - hstates])
> > > + hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> > > + if (!hstate_kobjs[hi])
> > > return -ENOMEM;
> > >
> > > - retval = sysfs_create_group(hstate_kobjs[h - hstates],
> > > - &hstate_attr_group);
> > > + retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
> > > if (retval)
> > > - kobject_put(hstate_kobjs[h - hstates]);
> > > + kobject_put(hstate_kobjs[hi]);
> > >
> > > return retval;
> > > }
> > > @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo
> > > return;
> > >
> > > for_each_hstate(h) {
> > > - err = hugetlb_sysfs_add_hstate(h);
> > > + err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> > > + hstate_kobjs, &hstate_attr_group);
> > > if (err)
> > > printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
> > > h->name);
> > > }
> > > }
> > >
> > > +#ifdef CONFIG_NUMA
> > > +static struct attribute *per_node_hstate_attrs[] = {
> > > + &nr_hugepages_attr.attr,
> > > + &free_hugepages_attr.attr,
> > > + &surplus_hugepages_attr.attr,
> > > + NULL,
> > > +};
> > > +
> > > +static struct attribute_group per_node_hstate_attr_group = {
> > > + .attrs = per_node_hstate_attrs,
> > > +};
> > > +
> > > +
> > > +void hugetlb_unregister_node(struct node *node)
> > > +{
> > > + struct hstate *h;
> > > +
> > > + for_each_hstate(h) {
> > > + kobject_put(node->hstate_kobjs[h - hstates]);
> > > + node->hstate_kobjs[h - hstates] = NULL;
> > > + }
> > > +
> > > + kobject_put(node->hugepages_kobj);
> > > + node->hugepages_kobj = NULL;
> > > +}
> > > +
> > > +static void hugetlb_unregister_all_nodes(void)
> > > +{
> > > + int nid;
> > > +
> > > + for (nid = 0; nid < nr_node_ids; nid++)
> > > + hugetlb_unregister_node(&node_devices[nid]);
> > > +}
> > > +
> > > +void hugetlb_register_node(struct node *node)
> > > +{
> > > + struct hstate *h;
> > > + int err;
> > > +
> > > + if (!hugepages_kobj)
> > > + return; /* too early */
> > > +
> > > + node->hugepages_kobj = kobject_create_and_add("hugepages",
> > > + &node->sysdev.kobj);
> > > + if (!node->hugepages_kobj)
> > > + return;
> > > +
> > > + for_each_hstate(h) {
> > > + err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
> > > + node->hstate_kobjs,
> > > + &per_node_hstate_attr_group);
> > > + if (err)
> > > + printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> > > + " for node %d\n",
> > > + h->name, node->sysdev.id);
> > > + }
> > > +}
> > > +
> > > +static void hugetlb_register_all_nodes(void)
> > > +{
> > > + int nid;
> > > +
> > > + for (nid = 0; nid < nr_node_ids; nid++) {
> > > + struct node *node = &node_devices[nid];
> > > + if (node->sysdev.id == nid && !node->hugepages_kobj)
> > > + hugetlb_register_node(node);
> > > + }
> > > +}
> > > +#endif
> > > +
> > > static void __exit hugetlb_exit(void)
> > > {
> > > struct hstate *h;
> > >
> > > + hugetlb_unregister_all_nodes();
> > > +
> > > for_each_hstate(h) {
> > > kobject_put(hstate_kobjs[h - hstates]);
> > > }
> > > @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void)
> > >
> > > hugetlb_sysfs_init();
> > >
> > > + hugetlb_register_all_nodes();
> > > +
> > > return 0;
> > > }
> > > module_init(hugetlb_init);
> > > @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
> > > proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
> > >
> > > if (write)
> > > - h->max_huge_pages = set_max_huge_pages(h, tmp);
> > > + h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
> > >
> > > return 0;
> > > }
> > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> > > ===================================================================
> > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400
> > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400
> > > @@ -21,9 +21,12 @@
> > >
> > > #include <linux/sysdev.h>
> > > #include <linux/cpumask.h>
> > > +#include <linux/hugetlb.h>
> > >
> > > struct node {
> > > struct sys_device sysdev;
> > > + struct kobject *hugepages_kobj;
> > > + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
> > > };
> > >
> > > struct memory_block;
> > >
> >
> > I'm not against this idea and think it can work side-by-side with the memory
> > policies. I believe it does need a bit more cleaning up before merging
> > though. I also wasn't able to test this yet due to various build and
> > deploy issues.
>
> OK. I'll do the cleanup. I have tested this atop the mempolicy
> version by working around the build issues that I thought were just
> temporary glitches in the mmotm series. In my [limited] experience, one
> can interleave numactl+hugeadm with setting values via the per node
> attributes and it does the right thing. No heavy testing with racing
> tasks, tho'.
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [PATCH 4/5] hugetlb: add per node hstate attributes
From: Mel Gorman @ 2009-08-25 10:19 UTC (permalink / raw)
To: Lee Schermerhorn
Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
Adam Litke, Andy Whitcroft, eric.whitney
In-Reply-To: <20090824192902.10317.94512.sendpatchset@localhost.localdomain>
On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote:
> PATCH/RFC 5/4 hugetlb: register per node hugepages attributes
>
> Against: 2.6.31-rc6-mmotm-090820-1918
>
> V2: remove dependency on kobject private bitfield. Search
> global hstates then all per node hstates for kobject
> match in attribute show/store functions.
>
> V3: rebase atop the mempolicy-based hugepage alloc/free;
> use custom "nodes_allowed" to restrict alloc/free to
> a specific node via per node attributes. Per node
> attribute overrides mempolicy. I.e., mempolicy only
> applies to global attributes.
>
> To demonstrate feasibility--if not advisability--of supporting
> both mempolicy-based persistent huge page management with per
> node "override" attributes.
>
> This patch adds the per huge page size control/query attributes
> to the per node sysdevs:
>
> /sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
> nr_hugepages - r/w
> free_huge_pages - r/o
> surplus_huge_pages - r/o
>
> The patch attempts to re-use/share as much of the existing
> global hstate attribute initialization and handling, and the
> "nodes_allowed" constraint processing as possible.
> In set_max_huge_pages(), a node id < 0 indicates a change to
> global hstate parameters. In this case, any non-default task
> mempolicy will be used to generate the nodes_allowed mask. A
> node id > 0 indicates a node specific update and the count
> argument specifies the target count for the node. From this
> info, we compute the target global count for the hstate and
> construct a nodes_allowed node mask contain only the specified
> node. Thus, setting the node specific nr_hugepages via the
> per node attribute effectively overrides any task mempolicy.
>
>
> Issue: dependency of base driver [node] dependency on hugetlbfs module.
> We want to keep all of the hstate attribute registration and handling
> in the hugetlb module. However, we need to call into this code to
> register the per node hstate attributes on node hot plug.
>
> With this patch:
>
> (me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
> ./ ../ free_hugepages nr_hugepages surplus_hugepages
>
> Starting from:
> Node 0 HugePages_Total: 0
> Node 0 HugePages_Free: 0
> Node 0 HugePages_Surp: 0
> Node 1 HugePages_Total: 0
> Node 1 HugePages_Free: 0
> Node 1 HugePages_Surp: 0
> Node 2 HugePages_Total: 0
> Node 2 HugePages_Free: 0
> Node 2 HugePages_Surp: 0
> Node 3 HugePages_Total: 0
> Node 3 HugePages_Free: 0
> Node 3 HugePages_Surp: 0
> vm.nr_hugepages = 0
>
> Allocate 16 persistent huge pages on node 2:
> (me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages
>
> [Note that this is equivalent to:
> numactl -m 2 hugeadmin --pool-pages-min 2M:+16
> ]
>
> Yields:
> Node 0 HugePages_Total: 0
> Node 0 HugePages_Free: 0
> Node 0 HugePages_Surp: 0
> Node 1 HugePages_Total: 0
> Node 1 HugePages_Free: 0
> Node 1 HugePages_Surp: 0
> Node 2 HugePages_Total: 16
> Node 2 HugePages_Free: 16
> Node 2 HugePages_Surp: 0
> Node 3 HugePages_Total: 0
> Node 3 HugePages_Free: 0
> Node 3 HugePages_Surp: 0
> vm.nr_hugepages = 16
>
> Global controls work as expected--reduce pool to 8 persistent huge pages:
> (me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>
> Node 0 HugePages_Total: 0
> Node 0 HugePages_Free: 0
> Node 0 HugePages_Surp: 0
> Node 1 HugePages_Total: 0
> Node 1 HugePages_Free: 0
> Node 1 HugePages_Surp: 0
> Node 2 HugePages_Total: 8
> Node 2 HugePages_Free: 8
> Node 2 HugePages_Surp: 0
> Node 3 HugePages_Total: 0
> Node 3 HugePages_Free: 0
> Node 3 HugePages_Surp: 0
>
>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
>
> drivers/base/node.c | 2
> include/linux/hugetlb.h | 6 +
> include/linux/node.h | 3
> mm/hugetlb.c | 213 +++++++++++++++++++++++++++++++++++++++++-------
> 4 files changed, 197 insertions(+), 27 deletions(-)
>
> Index: linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/drivers/base/node.c 2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/drivers/base/node.c 2009-08-24 12:12:56.000000000 -0400
> @@ -200,6 +200,7 @@ int register_node(struct node *node, int
> sysdev_create_file(&node->sysdev, &attr_distance);
>
> scan_unevictable_register_node(node);
> + hugetlb_register_node(node);
> }
> return error;
> }
> @@ -220,6 +221,7 @@ void unregister_node(struct node *node)
> sysdev_remove_file(&node->sysdev, &attr_distance);
>
> scan_unevictable_unregister_node(node);
> + hugetlb_unregister_node(node);
>
> sysdev_unregister(&node->sysdev);
> }
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/hugetlb.h 2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/hugetlb.h 2009-08-24 12:12:56.000000000 -0400
> @@ -278,6 +278,10 @@ static inline struct hstate *page_hstate
> return size_to_hstate(PAGE_SIZE << compound_order(page));
> }
>
> +struct node;
> +extern void hugetlb_register_node(struct node *);
> +extern void hugetlb_unregister_node(struct node *);
> +
> #else
> struct hstate {};
> #define alloc_bootmem_huge_page(h) NULL
> @@ -294,6 +298,8 @@ static inline unsigned int pages_per_hug
> {
> return 1;
> }
> +#define hugetlb_register_node(NP)
> +#define hugetlb_unregister_node(NP)
> #endif
>
This also needs to be done for the !NUMA case. Try building without NUMA
set and you get the following with this patch applied
CC mm/hugetlb.o
mm/hugetlb.c: In function a??hugetlb_exita??:
mm/hugetlb.c:1629: error: implicit declaration of function a??hugetlb_unregister_all_nodesa??
mm/hugetlb.c: In function a??hugetlb_inita??:
mm/hugetlb.c:1665: error: implicit declaration of function a??hugetlb_register_all_nodesa??
make[1]: *** [mm/hugetlb.o] Error 1
make: *** [mm] Error 2
> #endif /* _LINUX_HUGETLB_H */
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:56.000000000 -0400
> @@ -24,6 +24,7 @@
> #include <asm/io.h>
>
> #include <linux/hugetlb.h>
> +#include <linux/node.h>
> #include "internal.h"
>
> const unsigned long hugetlb_zero = 0, hugetlb_infinity = ~0UL;
> @@ -1253,8 +1254,24 @@ static int adjust_pool_surplus(struct hs
> return ret;
> }
>
> +static nodemask_t *nodes_allowed_from_node(int nid)
> +{
This name is a bit weird. It's creating a nodemask with just a single
node allowed.
Is there something wrong with using the existing function
nodemask_of_node()? If stack is the problem, prehaps there is some macro
magic that would allow a nodemask to be either declared on the stack or
kmalloc'd.
> + nodemask_t *nodes_allowed;
> + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> + if (!nodes_allowed) {
> + printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> + "for huge page allocation.\nFalling back to default.\n",
> + current->comm);
> + } else {
> + nodes_clear(*nodes_allowed);
> + node_set(nid, *nodes_allowed);
> + }
> + return nodes_allowed;
> +}
> +
> #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
> -static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> +static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> + int nid)
> {
> unsigned long min_count, ret;
> nodemask_t *nodes_allowed;
> @@ -1262,7 +1279,17 @@ static unsigned long set_max_huge_pages(
> if (h->order >= MAX_ORDER)
> return h->max_huge_pages;
>
> - nodes_allowed = huge_mpol_nodes_allowed();
> + if (nid < 0)
> + nodes_allowed = huge_mpol_nodes_allowed();
hugetlb is a bit littered with magic numbers been passed into functions.
Attempts have been made to clear them up as according as patches change
that area. Would it be possible to define something like
#define HUGETLB_OBEY_MEMPOLICY -1
for the nid here as opposed to passing in -1? I know -1 is used in the page
allocator functions but there it means "current node" and here it means
"obey mempolicies".
> + else {
> + /*
> + * incoming 'count' is for node 'nid' only, so
> + * adjust count to global, but restrict alloc/free
> + * to the specified node.
> + */
> + count += h->nr_huge_pages - h->nr_huge_pages_node[nid];
> + nodes_allowed = nodes_allowed_from_node(nid);
> + }
>
> /*
> * Increase the pool size
> @@ -1338,34 +1365,69 @@ out:
> static struct kobject *hugepages_kobj;
> static struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
>
> -static struct hstate *kobj_to_hstate(struct kobject *kobj)
> +static struct hstate *kobj_to_node_hstate(struct kobject *kobj, int *nidp)
> +{
> + int nid;
> +
> + for (nid = 0; nid < nr_node_ids; nid++) {
> + struct node *node = &node_devices[nid];
> + int hi;
> + for (hi = 0; hi < HUGE_MAX_HSTATE; hi++)
Does that hi mean hello, high, nid or hstate_idx?
hstate_idx would appear to be the appropriate name here.
> + if (node->hstate_kobjs[hi] == kobj) {
> + if (nidp)
> + *nidp = nid;
> + return &hstates[hi];
> + }
> + }
Ok.... so, there is a struct node array for the sysdev and this patch adds
references to the "hugepages" directory kobject and the subdirectories for
each page size. We walk all the objects until we find a match. Obviously,
this adds a dependency of base node support on hugetlbfs which feels backwards
and you call that out in your leader.
Can this be the other way around? i.e. The struct hstate has an array of
kobjects arranged by nid that is filled in when the node is registered?
There will only be one kobject-per-pagesize-per-node so it seems like it
would work. I confess, I haven't prototyped this to be 100% sure.
> +
> + BUG();
> + return NULL;
> +}
> +
> +static struct hstate *kobj_to_hstate(struct kobject *kobj, int *nidp)
> {
> int i;
> +
> for (i = 0; i < HUGE_MAX_HSTATE; i++)
> - if (hstate_kobjs[i] == kobj)
> + if (hstate_kobjs[i] == kobj) {
> + if (nidp)
> + *nidp = -1;
> return &hstates[i];
> - BUG();
> - return NULL;
> + }
> +
> + return kobj_to_node_hstate(kobj, nidp);
> }
>
> static ssize_t nr_hugepages_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> - struct hstate *h = kobj_to_hstate(kobj);
> - return sprintf(buf, "%lu\n", h->nr_huge_pages);
> + struct hstate *h;
> + unsigned long nr_huge_pages;
> + int nid;
> +
> + h = kobj_to_hstate(kobj, &nid);
> + if (nid < 0)
> + nr_huge_pages = h->nr_huge_pages;
Here is another magic number except it means something slightly
different. It means NR_GLOBAL_HUGEPAGES or something similar. It would
be nice if these different special nid values could be named, preferably
collapsed to being one "core" thing.
> + else
> + nr_huge_pages = h->nr_huge_pages_node[nid];
> +
> + return sprintf(buf, "%lu\n", nr_huge_pages);
> }
> +
> static ssize_t nr_hugepages_store(struct kobject *kobj,
> struct kobj_attribute *attr, const char *buf, size_t count)
> {
> - int err;
> unsigned long input;
> - struct hstate *h = kobj_to_hstate(kobj);
> + struct hstate *h;
> + int nid;
> + int err;
>
> err = strict_strtoul(buf, 10, &input);
> if (err)
> return 0;
>
> - h->max_huge_pages = set_max_huge_pages(h, input);
"input" is a bit meaningless. The function you are passing to calls this
parameter "count". Can you match the naming please? Otherwise, I might
guess that this is a "delta" which occurs elsewhere in the hugetlb code.
> + h = kobj_to_hstate(kobj, &nid);
> + h->max_huge_pages = set_max_huge_pages(h, input, nid);
>
> return count;
> }
> @@ -1374,15 +1436,17 @@ HSTATE_ATTR(nr_hugepages);
> static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> - struct hstate *h = kobj_to_hstate(kobj);
> + struct hstate *h = kobj_to_hstate(kobj, NULL);
> +
> return sprintf(buf, "%lu\n", h->nr_overcommit_huge_pages);
> }
> +
> static ssize_t nr_overcommit_hugepages_store(struct kobject *kobj,
> struct kobj_attribute *attr, const char *buf, size_t count)
> {
> int err;
> unsigned long input;
> - struct hstate *h = kobj_to_hstate(kobj);
> + struct hstate *h = kobj_to_hstate(kobj, NULL);
>
> err = strict_strtoul(buf, 10, &input);
> if (err)
> @@ -1399,15 +1463,24 @@ HSTATE_ATTR(nr_overcommit_hugepages);
> static ssize_t free_hugepages_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> - struct hstate *h = kobj_to_hstate(kobj);
> - return sprintf(buf, "%lu\n", h->free_huge_pages);
> + struct hstate *h;
> + unsigned long free_huge_pages;
> + int nid;
> +
> + h = kobj_to_hstate(kobj, &nid);
> + if (nid < 0)
> + free_huge_pages = h->free_huge_pages;
> + else
> + free_huge_pages = h->free_huge_pages_node[nid];
> +
> + return sprintf(buf, "%lu\n", free_huge_pages);
> }
> HSTATE_ATTR_RO(free_hugepages);
>
> static ssize_t resv_hugepages_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> - struct hstate *h = kobj_to_hstate(kobj);
> + struct hstate *h = kobj_to_hstate(kobj, NULL);
> return sprintf(buf, "%lu\n", h->resv_huge_pages);
> }
> HSTATE_ATTR_RO(resv_hugepages);
> @@ -1415,8 +1488,17 @@ HSTATE_ATTR_RO(resv_hugepages);
> static ssize_t surplus_hugepages_show(struct kobject *kobj,
> struct kobj_attribute *attr, char *buf)
> {
> - struct hstate *h = kobj_to_hstate(kobj);
> - return sprintf(buf, "%lu\n", h->surplus_huge_pages);
> + struct hstate *h;
> + unsigned long surplus_huge_pages;
> + int nid;
> +
> + h = kobj_to_hstate(kobj, &nid);
> + if (nid < 0)
> + surplus_huge_pages = h->surplus_huge_pages;
> + else
> + surplus_huge_pages = h->surplus_huge_pages_node[nid];
> +
> + return sprintf(buf, "%lu\n", surplus_huge_pages);
> }
> HSTATE_ATTR_RO(surplus_hugepages);
>
> @@ -1433,19 +1515,21 @@ static struct attribute_group hstate_att
> .attrs = hstate_attrs,
> };
>
> -static int __init hugetlb_sysfs_add_hstate(struct hstate *h)
> +static int __init hugetlb_sysfs_add_hstate(struct hstate *h,
> + struct kobject *parent,
> + struct kobject **hstate_kobjs,
> + struct attribute_group *hstate_attr_group)
> {
> int retval;
> + int hi = h - hstates;
>
> - hstate_kobjs[h - hstates] = kobject_create_and_add(h->name,
> - hugepages_kobj);
> - if (!hstate_kobjs[h - hstates])
> + hstate_kobjs[hi] = kobject_create_and_add(h->name, parent);
> + if (!hstate_kobjs[hi])
> return -ENOMEM;
>
> - retval = sysfs_create_group(hstate_kobjs[h - hstates],
> - &hstate_attr_group);
> + retval = sysfs_create_group(hstate_kobjs[hi], hstate_attr_group);
> if (retval)
> - kobject_put(hstate_kobjs[h - hstates]);
> + kobject_put(hstate_kobjs[hi]);
>
> return retval;
> }
> @@ -1460,17 +1544,90 @@ static void __init hugetlb_sysfs_init(vo
> return;
>
> for_each_hstate(h) {
> - err = hugetlb_sysfs_add_hstate(h);
> + err = hugetlb_sysfs_add_hstate(h, hugepages_kobj,
> + hstate_kobjs, &hstate_attr_group);
> if (err)
> printk(KERN_ERR "Hugetlb: Unable to add hstate %s",
> h->name);
> }
> }
>
> +#ifdef CONFIG_NUMA
> +static struct attribute *per_node_hstate_attrs[] = {
> + &nr_hugepages_attr.attr,
> + &free_hugepages_attr.attr,
> + &surplus_hugepages_attr.attr,
> + NULL,
> +};
> +
> +static struct attribute_group per_node_hstate_attr_group = {
> + .attrs = per_node_hstate_attrs,
> +};
> +
> +
> +void hugetlb_unregister_node(struct node *node)
> +{
> + struct hstate *h;
> +
> + for_each_hstate(h) {
> + kobject_put(node->hstate_kobjs[h - hstates]);
> + node->hstate_kobjs[h - hstates] = NULL;
> + }
> +
> + kobject_put(node->hugepages_kobj);
> + node->hugepages_kobj = NULL;
> +}
> +
> +static void hugetlb_unregister_all_nodes(void)
> +{
> + int nid;
> +
> + for (nid = 0; nid < nr_node_ids; nid++)
> + hugetlb_unregister_node(&node_devices[nid]);
> +}
> +
> +void hugetlb_register_node(struct node *node)
> +{
> + struct hstate *h;
> + int err;
> +
> + if (!hugepages_kobj)
> + return; /* too early */
> +
> + node->hugepages_kobj = kobject_create_and_add("hugepages",
> + &node->sysdev.kobj);
> + if (!node->hugepages_kobj)
> + return;
> +
> + for_each_hstate(h) {
> + err = hugetlb_sysfs_add_hstate(h, node->hugepages_kobj,
> + node->hstate_kobjs,
> + &per_node_hstate_attr_group);
> + if (err)
> + printk(KERN_ERR "Hugetlb: Unable to add hstate %s"
> + " for node %d\n",
> + h->name, node->sysdev.id);
> + }
> +}
> +
> +static void hugetlb_register_all_nodes(void)
> +{
> + int nid;
> +
> + for (nid = 0; nid < nr_node_ids; nid++) {
> + struct node *node = &node_devices[nid];
> + if (node->sysdev.id == nid && !node->hugepages_kobj)
> + hugetlb_register_node(node);
> + }
> +}
> +#endif
> +
> static void __exit hugetlb_exit(void)
> {
> struct hstate *h;
>
> + hugetlb_unregister_all_nodes();
> +
> for_each_hstate(h) {
> kobject_put(hstate_kobjs[h - hstates]);
> }
> @@ -1505,6 +1662,8 @@ static int __init hugetlb_init(void)
>
> hugetlb_sysfs_init();
>
> + hugetlb_register_all_nodes();
> +
> return 0;
> }
> module_init(hugetlb_init);
> @@ -1607,7 +1766,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
> proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
>
> if (write)
> - h->max_huge_pages = set_max_huge_pages(h, tmp);
> + h->max_huge_pages = set_max_huge_pages(h, tmp, -1);
>
> return 0;
> }
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400
> @@ -21,9 +21,12 @@
>
> #include <linux/sysdev.h>
> #include <linux/cpumask.h>
> +#include <linux/hugetlb.h>
>
> struct node {
> struct sys_device sysdev;
> + struct kobject *hugepages_kobj;
> + struct kobject *hstate_kobjs[HUGE_MAX_HSTATE];
> };
>
> struct memory_block;
>
I'm not against this idea and think it can work side-by-side with the memory
policies. I believe it does need a bit more cleaning up before merging
though. I also wasn't able to test this yet due to various build and
deploy issues.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* Re: [Bug #14016] mm/ipw2200 regression
From: Mel Gorman @ 2009-08-26 9:37 UTC (permalink / raw)
To: Johannes Weiner
Cc: Pekka Enberg, Rafael J. Wysocki, Linux Kernel Mailing List,
Kernel Testers List, Bartlomiej Zolnierkiewicz, Mel Gorman,
Andrew Morton, netdev, linux-mm
In-Reply-To: <20090826082741.GA25955@cmpxchg.org>
On Wed, Aug 26, 2009 at 10:27:41AM +0200, Johannes Weiner wrote:
> [Cc netdev]
>
> On Wed, Aug 26, 2009 at 09:09:44AM +0300, Pekka Enberg wrote:
> > On Tue, Aug 25, 2009 at 11:34 PM, Rafael J. Wysocki<rjw@sisk.pl> wrote:
> > > This message has been generated automatically as a part of a report
> > > of recent regressions.
> > >
> > > The following bug entry is on the current list of known regressions
> > > from 2.6.30. Please verify if it still should be listed and let me know
> > > (either way).
> > >
> > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14016
> > > Subject : mm/ipw2200 regression
> > > Submitter : Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
> > > Date : 2009-08-15 16:56 (11 days old)
> > > References : http://marc.info/?l=linux-kernel&m=125036437221408&w=4
> >
> > If am reading the page allocator dump correctly, there's plenty of
> > pages left but we're unable to satisfy an order 6 allocation. There's
> > no slab allocator involved so the page allocator changes that went
> > into 2.6.31 seem likely. Mel, ideas?
>
> It's an atomic order-6 allocation, the chances for this to succeed
> after some uptime become infinitesimal. The chunks > order-2 are
> pretty much exhausted on this dump.
>
> 64 pages, presumably 256k, for fw->boot_size while current ipw
> firmware images have ~188k. I don't know jack squat about this
> driver, but given the field name and the struct:
>
> struct ipw_fw {
> __le32 ver;
> __le32 boot_size;
> __le32 ucode_size;
> __le32 fw_size;
> u8 data[0];
> };
>
> fw->boot_size alone being that big sounds a bit fishy to me.
>
Agreed. While there are a low number of order-6 pages free in the page
allocation failure dump, there are not enough for watermarks to be
satisified. As it's atomic, there is little that can be done from a VM
perspective and it's the responsibility of the driver. I'm no driver expert
but I'll have a go at fixing it anyway.
My reading of this is that the firmware is being loaded from a workqueue and
I am failing to see any restriction on sleeping in the path. It would appear
that the driver just used the most convenient *_alloc_coherent function
available forgetting that it assumes GFP_ATOMIC. Can someone who does know
which way is up with a driver tell me why the patch below might not
work?
Bartlomiej, any chance you could give this a spin? Preferably, you'd
have preempt enabled and CONFIG_DEBUG_SPINLOCK_SLEEP on as well because
that combination will complain loudly if we really can't sleep in this
path.
=====
ipw2200: Avoid large GFP_ATOMIC allocation during firmware loading
ipw2200 uses pci_alloc_consistent() to allocate a large coherent buffer for
the loading of firmware which is an order-6 allocation of GFP_ATOMIC. At
system start-up time, this is not a problem. However, the firmware on the
card can get confused and the corrective action taken is to reload the
firmware and reinit the card. High-order GFP_ATOMIC allocations of this
type can and will fail when the system is already up and running.
As the firmware is loaded from a workqueue, it should be possible for
the driver to go to sleep. This patch converts the call of
pci_alloc_consistent() which assumes GFP_ATOMIC to dma_alloc_coherent()
which can specify its own flags.
The big downside with this patch is that it uses GFP_REPEAT to avoid the
driver unloading. There is potential that this will cause a reclaim
storm as the machine tries to find a free order-6 buffer. A suggested
alternative for the driver owner is in the comments.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
drivers/net/wireless/ipw2x00/ipw2200.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)
diff --git a/drivers/net/wireless/ipw2x00/ipw2200.c b/drivers/net/wireless/ipw2x00/ipw2200.c
index 44c29b3..f2e251e 100644
--- a/drivers/net/wireless/ipw2x00/ipw2200.c
+++ b/drivers/net/wireless/ipw2x00/ipw2200.c
@@ -3167,7 +3167,19 @@ static int ipw_load_firmware(struct ipw_priv *priv, u8 * data, size_t len)
u8 *shared_virt;
IPW_DEBUG_TRACE("<< : \n");
- shared_virt = pci_alloc_consistent(priv->pci_dev, len, &shared_phys);
+
+ /*
+ * This is a whopping large allocation, in or around order-6 so
+ * dma_alloc_coherent is used to specify the GFP_KERNEL|__GFP_REPEAT
+ * flags. Note that this action means the system could go into a
+ * reclaim loop until it cannot reclaim any more trying to satisfy
+ * the allocation. It would be preferable if one buffer is allocated
+ * at driver initialisation and reused when the firmware needs to
+ * be reloaded, overwriting the existing firmware each time
+ */
+ shared_virt = dma_alloc_coherent(
+ priv->pci_dev == NULL ? NULL : &priv->pci_dev->dev,
+ len, &shared_phys, GFP_KERNEL|__GFP_REPEAT);
if (!shared_virt)
return -ENOMEM;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* Re: [PATCH 4/5] hugetlb: add per node hstate attributes
From: Mel Gorman @ 2009-08-26 10:12 UTC (permalink / raw)
To: Lee Schermerhorn
Cc: linux-mm, linux-numa, akpm, Nishanth Aravamudan, David Rientjes,
Adam Litke, Andy Whitcroft, eric.whitney
In-Reply-To: <1251233380.16229.3.camel@useless.americas.hpqcorp.net>
On Tue, Aug 25, 2009 at 04:49:40PM -0400, Lee Schermerhorn wrote:
> On Tue, 2009-08-25 at 14:35 +0100, Mel Gorman wrote:
> > On Mon, Aug 24, 2009 at 03:29:02PM -0400, Lee Schermerhorn wrote:
> > > <SNIP>
> > >
> > > Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h
> > > ===================================================================
> > > --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/node.h 2009-08-24 12:12:44.000000000 -0400
> > > +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/node.h 2009-08-24 12:12:56.000000000 -0400
> > > @@ -21,9 +21,12 @@
> > >
> > > #include <linux/sysdev.h>
> > > #include <linux/cpumask.h>
> > > +#include <linux/hugetlb.h>
> > >
> >
> > Is this header inclusion necessary? It does not appear to be required by
> > the structure modification (which is iffy in itself as discussed in the
> > earlier mail) and it breaks build on x86-64.
>
> Hi, Mel:
>
> I recall that it is necessary to build. You can try w/o it.
>
I did, it appeared to work but I didn't dig deep as to why.
> >
> > CC arch/x86/kernel/setup_percpu.o
> > In file included from include/linux/pagemap.h:10,
> > from include/linux/mempolicy.h:62,
> > from include/linux/hugetlb.h:8,
> > from include/linux/node.h:24,
> > from include/linux/cpu.h:23,
> > from /usr/local/autobench/var/tmp/build/arch/x86/include/asm/cpu.h:5,
> > from arch/x86/kernel/setup_percpu.c:19:
> > include/linux/highmem.h:53: error: static declaration of kmap follows non-static declaration
> > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:60: error: previous declaration of kmap was here
> > include/linux/highmem.h:59: error: static declaration of kunmap follows non-static declaration
> > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:61: error: previous declaration of kunmap was here
> > include/linux/highmem.h:63: error: static declaration of kmap_atomic follows non-static declaration
> > /usr/local/autobench/var/tmp/build/arch/x86/include/asm/highmem.h:63: error: previous declaration of kmap_atomic was here
> > make[2]: *** [arch/x86/kernel/setup_percpu.o] Error 1
> > make[1]: *** [arch/x86/kernel] Error 2
>
> I saw this. I've been testing on x86_64. I *thought* that it only
> started showing up in a recent mmotm from changes in the linux-next
> patch--e.g., a failure to set ARCH_HAS_KMAP or to handle appropriately
> !ARCH_HAS_KMAP in highmem.h But maybe that was coincidental with my
> adding the include.
>
Maybe we were looking at different mmotm's
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply
* [PATCH] REGULATOR Handle positive returncode from enable
From: Linus Walleij @ 2009-08-26 10:54 UTC (permalink / raw)
To: Liam Girdwood, Mark Brown; +Cc: linux-kernel, Linus Walleij
This makes _regulator_enable() properly handle the case where
a regulator is already on when you try to enable it. Currently
it will erroneously handle positive return values as an error.
Signed-off-by: Linus Walleij <linus.walleij@stericsson.com>
---
drivers/regulator/core.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c
index dbf27bf..744ea1d 100644
--- a/drivers/regulator/core.c
+++ b/drivers/regulator/core.c
@@ -1236,11 +1236,12 @@ static int _regulator_enable(struct regulator_dev *rdev)
} else {
return -EINVAL;
}
- } else {
+ } else if (ret < 0) {
printk(KERN_ERR "%s: is_enabled() failed for %s: %d\n",
__func__, rdev->desc->name, ret);
return ret;
}
+ /* Fallthrough on positive return values - already enabled */
}
rdev->use_count++;
--
1.6.2.1
^ permalink raw reply related
* [B.A.T.M.A.N.] New name for batman-adv?
From: Andrew Lunn @ 2009-08-26 10:52 UTC (permalink / raw)
To: The list for a Better Approach To Mobile Ad-hoc Networking
In-Reply-To: <200908261651.16980.lindner_marek@yahoo.de>
> You are right - the situation still is a bit messy. I think Andrew has a point
> saying that "adv" might be rejected. I even could imagine that the name
> "batman" causes some irritation. ;-)
>
> A while back Justin Dean suggested a new name for the protocol:
> stateless proactive adhoc networking protocol
> Does not sound too bad in my opinion. Other opinions ?
spanp
Not the nicest of acronym. It would be good to be able to build a
short, 3 letter abbreviation of the acronym for the network device
name. That is one thing that batman->bat0 has going for it.
Andrew
^ permalink raw reply
* Re: kexec-tools-2.0.1 and CFLAGS
From: Simon Horman @ 2009-08-26 10:50 UTC (permalink / raw)
To: Magnus Damm; +Cc: kexec
In-Reply-To: <aec7e5c30908260056w629fb2fp4be4c8660860f1c4@mail.gmail.com>
On Wed, Aug 26, 2009 at 04:56:49PM +0900, Magnus Damm wrote:
> Hi Simon!
>
> On Wed, Aug 26, 2009 at 3:31 PM, Simon Horman<horms@verge.net.au> wrote:
> > On Thu, Aug 20, 2009 at 10:14:19PM +0900, Magnus Damm wrote:
> >> Kexec-tools 2.0.1 seems to build only with optimization enabled. If I
> >> set CFLAGS before calling configure and remove the "-O2" then the code
> >> won't link properly. I found it while cross compiling for SuperH, but
> >> it's most likely an issue on other platforms as well. Have a look at
> >> the "-O0" below:
> >>
> >> $ AR=_ar CC=_gcc CFLAGS="-O0" LDFLAGS="-static" ./configure
> >> --prefix=/ --host="sh3-linux" --without-zlib --without-xen
>
> [snip]
>
> >> This issue is most likely related to get_unaligned() and
> >> put_unaligned() having bad_unaligned_access_length() in their default
> >> case that never gets optimized away.
> >>
> >> Anyway, not very important. But at least you know now. =)
> >
> > Hi Magnus,
> >
> > I'm some what dubious about this, but it appears to be intentional that
> > get_unaligned() doesn't exist:
> >
> > ---- From kexec/kexec.h ----
> >
> > /*
> > * This function doesn't actually exist. The idea is that when someone
> > * uses the macros below with an unsupported size (datatype), the linker
> > * will alert us to the problem via an unresolved reference error.
> > */
>
> Yes, I understand the purpose of the code and it makes sense. But the
> data size in this case _is_ correct. The issue here seems to be that
> the compiler is not optimizing away the default case reference unless
> optimization is enabled. I'd suggest moving away from automatic type
> size detection to have separate functions for each size. That would
> make it possible to comple the code both with and without optimization
> enabled.
>
> Or may I'm misunderstanding the issue? =)
I think you are understanding it right, that the current code fails
with no optimisation enabled because the unused function isn't optimised
away. Is there a reason you need -O0? It seems that fixing this is probably
more effort than its worth.
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec
^ permalink raw reply
* [Qemu-devel] Re: [PATCH] net: Fix send queue ordering
From: Jan Kiszka @ 2009-08-26 10:49 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Mark McLoughlin, qemu-devel
In-Reply-To: <4A9512A8.1070709@siemens.com>
Jan Kiszka wrote:
> Ensure that packets enqueued for delayed delivery are dequeued in FIFO
> order. At least one simplistic guest TCP/IP stack became unhappy due to
> sporadically reordered packet streams.
Before I forget: This should also be considered for 0.11-rc (it's a 0.11
regression).
>
> At this chance, switch the send queue implementation to TAILQ.
>
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
>
> net.c | 29 ++++++++++++-----------------
> net.h | 5 +++--
> 2 files changed, 15 insertions(+), 19 deletions(-)
>
> diff --git a/net.c b/net.c
> index 1b531e7..d7a1ab6 100644
> --- a/net.c
> +++ b/net.c
> @@ -436,33 +436,28 @@ qemu_deliver_packet(VLANClientState *sender, const uint8_t *buf, int size)
>
> void qemu_purge_queued_packets(VLANClientState *vc)
> {
> - VLANPacket **pp = &vc->vlan->send_queue;
> -
> - while (*pp != NULL) {
> - VLANPacket *packet = *pp;
> + VLANPacket *packet, *next;
>
> + TAILQ_FOREACH_SAFE(packet, &vc->vlan->send_queue, entry, next) {
> if (packet->sender == vc) {
> - *pp = packet->next;
> + TAILQ_REMOVE(&vc->vlan->send_queue, packet, entry);
> qemu_free(packet);
> - } else {
> - pp = &packet->next;
> }
> }
> }
>
> void qemu_flush_queued_packets(VLANClientState *vc)
> {
> - VLANPacket *packet;
> -
> - while ((packet = vc->vlan->send_queue) != NULL) {
> + while (!TAILQ_EMPTY(&vc->vlan->send_queue)) {
> + VLANPacket *packet;
> int ret;
>
> - vc->vlan->send_queue = packet->next;
> + packet = TAILQ_FIRST(&vc->vlan->send_queue);
> + TAILQ_REMOVE(&vc->vlan->send_queue, packet, entry);
>
> ret = qemu_deliver_packet(packet->sender, packet->data, packet->size);
> if (ret == 0 && packet->sent_cb != NULL) {
> - packet->next = vc->vlan->send_queue;
> - vc->vlan->send_queue = packet;
> + TAILQ_INSERT_HEAD(&vc->vlan->send_queue, packet, entry);
> break;
> }
>
> @@ -480,12 +475,12 @@ static void qemu_enqueue_packet(VLANClientState *sender,
> VLANPacket *packet;
>
> packet = qemu_malloc(sizeof(VLANPacket) + size);
> - packet->next = sender->vlan->send_queue;
> packet->sender = sender;
> packet->size = size;
> packet->sent_cb = sent_cb;
> memcpy(packet->data, buf, size);
> - sender->vlan->send_queue = packet;
> +
> + TAILQ_INSERT_TAIL(&sender->vlan->send_queue, packet, entry);
> }
>
> ssize_t qemu_send_packet_async(VLANClientState *sender,
> @@ -597,7 +592,6 @@ static ssize_t qemu_enqueue_packet_iov(VLANClientState *sender,
> max_len = calc_iov_length(iov, iovcnt);
>
> packet = qemu_malloc(sizeof(VLANPacket) + max_len);
> - packet->next = sender->vlan->send_queue;
> packet->sender = sender;
> packet->sent_cb = sent_cb;
> packet->size = 0;
> @@ -609,7 +603,7 @@ static ssize_t qemu_enqueue_packet_iov(VLANClientState *sender,
> packet->size += len;
> }
>
> - sender->vlan->send_queue = packet;
> + TAILQ_INSERT_TAIL(&sender->vlan->send_queue, packet, entry);
>
> return packet->size;
> }
> @@ -2330,6 +2324,7 @@ VLANState *qemu_find_vlan(int id, int allocate)
> }
> vlan = qemu_mallocz(sizeof(VLANState));
> vlan->id = id;
> + TAILQ_INIT(&vlan->send_queue);
> vlan->next = NULL;
> pvlan = &first_vlan;
> while (*pvlan != NULL)
> diff --git a/net.h b/net.h
> index 3ac9e8c..bab02f5 100644
> --- a/net.h
> +++ b/net.h
> @@ -1,6 +1,7 @@
> #ifndef QEMU_NET_H
> #define QEMU_NET_H
>
> +#include "sys-queue.h"
> #include "qemu-common.h"
>
> /* VLANs support */
> @@ -35,7 +36,7 @@ typedef struct VLANPacket VLANPacket;
> typedef void (NetPacketSent) (VLANClientState *, ssize_t);
>
> struct VLANPacket {
> - struct VLANPacket *next;
> + TAILQ_ENTRY(VLANPacket) entry;
> VLANClientState *sender;
> int size;
> NetPacketSent *sent_cb;
> @@ -47,7 +48,7 @@ struct VLANState {
> VLANClientState *first_client;
> struct VLANState *next;
> unsigned int nb_guest_devs, nb_host_devs;
> - VLANPacket *send_queue;
> + TAILQ_HEAD(send_queue, VLANPacket) send_queue;
> int delivering;
> };
>
>
Jan
--
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
^ permalink raw reply
* [PATCH] sh: fix CPU_SH7723/7724 numbering bug
From: Kuninori Morimoto @ 2009-08-26 10:49 UTC (permalink / raw)
To: linux-sh
Signed-off-by: Kuninori Morimoto <morimoto.kuninori@renesas.com>
---
arch/sh/kernel/cpu/sh4/probe.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/arch/sh/kernel/cpu/sh4/probe.c b/arch/sh/kernel/cpu/sh4/probe.c
index 10e6795..afd3e73 100644
--- a/arch/sh/kernel/cpu/sh4/probe.c
+++ b/arch/sh/kernel/cpu/sh4/probe.c
@@ -141,7 +141,7 @@ int __init detect_cpu_and_cache_system(void)
case 0x300b:
switch (prr) {
case 0x20:
- boot_cpu_data.type = CPU_SH7723;
+ boot_cpu_data.type = CPU_SH7724;
boot_cpu_data.flags |= CPU_HAS_L2_CACHE;
break;
case 0x50:
--
1.6.0.4
^ permalink raw reply related
* Re: WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f6f6e1a4), by kmemleak's scan_block()
From: Pekka Enberg @ 2009-08-26 10:48 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Catalin Marinas, Vegard Nossum, linux-kernel
In-Reply-To: <20090825093423.GA12935@elte.hu>
On Tue, Aug 25, 2009 at 12:34 PM, Ingo Molnar<mingo@elte.hu> wrote:
>> On Tue, Aug 25, 2009 at 12:28 PM, Catalin
>> Marinas<catalin.marinas@arm.com> wrote:
>> >> Does this look OK to you?
>> >
>> > For the kmemleak.c part:
>> >
>> > Acked-by: Catalin Marinas <catalin.marinas@arm.com>
>>
>> Vegard? Ingo? The patch is based on tip/out-of-tree so it probably
>> should go to the kmemleak tree?
>
> I'm testing it currently - but yeah, i'd agree that it should go
> into the kmemleak tree, with a .32 merge date or so.
Ingo/Vegard, ACK/NAK? I don't want my amazingly good patch to end up
in /dev/null so can we just shove it in kmemleak.git?
Pekka
^ permalink raw reply
* [Qemu-devel] [PATCH] net: Fix send queue ordering
From: Jan Kiszka @ 2009-08-26 10:47 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Mark McLoughlin, qemu-devel
Ensure that packets enqueued for delayed delivery are dequeued in FIFO
order. At least one simplistic guest TCP/IP stack became unhappy due to
sporadically reordered packet streams.
At this chance, switch the send queue implementation to TAILQ.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
net.c | 29 ++++++++++++-----------------
net.h | 5 +++--
2 files changed, 15 insertions(+), 19 deletions(-)
diff --git a/net.c b/net.c
index 1b531e7..d7a1ab6 100644
--- a/net.c
+++ b/net.c
@@ -436,33 +436,28 @@ qemu_deliver_packet(VLANClientState *sender, const uint8_t *buf, int size)
void qemu_purge_queued_packets(VLANClientState *vc)
{
- VLANPacket **pp = &vc->vlan->send_queue;
-
- while (*pp != NULL) {
- VLANPacket *packet = *pp;
+ VLANPacket *packet, *next;
+ TAILQ_FOREACH_SAFE(packet, &vc->vlan->send_queue, entry, next) {
if (packet->sender == vc) {
- *pp = packet->next;
+ TAILQ_REMOVE(&vc->vlan->send_queue, packet, entry);
qemu_free(packet);
- } else {
- pp = &packet->next;
}
}
}
void qemu_flush_queued_packets(VLANClientState *vc)
{
- VLANPacket *packet;
-
- while ((packet = vc->vlan->send_queue) != NULL) {
+ while (!TAILQ_EMPTY(&vc->vlan->send_queue)) {
+ VLANPacket *packet;
int ret;
- vc->vlan->send_queue = packet->next;
+ packet = TAILQ_FIRST(&vc->vlan->send_queue);
+ TAILQ_REMOVE(&vc->vlan->send_queue, packet, entry);
ret = qemu_deliver_packet(packet->sender, packet->data, packet->size);
if (ret == 0 && packet->sent_cb != NULL) {
- packet->next = vc->vlan->send_queue;
- vc->vlan->send_queue = packet;
+ TAILQ_INSERT_HEAD(&vc->vlan->send_queue, packet, entry);
break;
}
@@ -480,12 +475,12 @@ static void qemu_enqueue_packet(VLANClientState *sender,
VLANPacket *packet;
packet = qemu_malloc(sizeof(VLANPacket) + size);
- packet->next = sender->vlan->send_queue;
packet->sender = sender;
packet->size = size;
packet->sent_cb = sent_cb;
memcpy(packet->data, buf, size);
- sender->vlan->send_queue = packet;
+
+ TAILQ_INSERT_TAIL(&sender->vlan->send_queue, packet, entry);
}
ssize_t qemu_send_packet_async(VLANClientState *sender,
@@ -597,7 +592,6 @@ static ssize_t qemu_enqueue_packet_iov(VLANClientState *sender,
max_len = calc_iov_length(iov, iovcnt);
packet = qemu_malloc(sizeof(VLANPacket) + max_len);
- packet->next = sender->vlan->send_queue;
packet->sender = sender;
packet->sent_cb = sent_cb;
packet->size = 0;
@@ -609,7 +603,7 @@ static ssize_t qemu_enqueue_packet_iov(VLANClientState *sender,
packet->size += len;
}
- sender->vlan->send_queue = packet;
+ TAILQ_INSERT_TAIL(&sender->vlan->send_queue, packet, entry);
return packet->size;
}
@@ -2330,6 +2324,7 @@ VLANState *qemu_find_vlan(int id, int allocate)
}
vlan = qemu_mallocz(sizeof(VLANState));
vlan->id = id;
+ TAILQ_INIT(&vlan->send_queue);
vlan->next = NULL;
pvlan = &first_vlan;
while (*pvlan != NULL)
diff --git a/net.h b/net.h
index 3ac9e8c..bab02f5 100644
--- a/net.h
+++ b/net.h
@@ -1,6 +1,7 @@
#ifndef QEMU_NET_H
#define QEMU_NET_H
+#include "sys-queue.h"
#include "qemu-common.h"
/* VLANs support */
@@ -35,7 +36,7 @@ typedef struct VLANPacket VLANPacket;
typedef void (NetPacketSent) (VLANClientState *, ssize_t);
struct VLANPacket {
- struct VLANPacket *next;
+ TAILQ_ENTRY(VLANPacket) entry;
VLANClientState *sender;
int size;
NetPacketSent *sent_cb;
@@ -47,7 +48,7 @@ struct VLANState {
VLANClientState *first_client;
struct VLANState *next;
unsigned int nb_guest_devs, nb_host_devs;
- VLANPacket *send_queue;
+ TAILQ_HEAD(send_queue, VLANPacket) send_queue;
int delivering;
};
^ permalink raw reply related
* [B.A.T.M.A.N.] Breaking long lines...
From: Andrew Lunn @ 2009-08-26 10:47 UTC (permalink / raw)
To: The list for a Better Approach To Mobile Ad-hoc Networking
In-Reply-To: <200908261651.16980.lindner_marek@yahoo.de>
> > I found also some other things and also told that Marek - but I think that
> > not everything was included in the patch I send some weeks ago. Maybe it
> > was only to break printk statements to fit in 80 chars per line, but I am
> > not sure right now.
>
> Yes, you found a way to break a long string into smaller pieces although I
> can't quite remember how you did it. ;-)
ANSI C/C++ allows:
"foor"
"bar"
and the compiler will glue the parts back together as "foobar".
However, i often don't like the resulting code layout. I also think
there might be a bug in checkpatch. The relevant code is:
#80 column limit
if ($line =~ /^\+/ && $prevrawline !~ /\/\*\*/ &&
$rawline !~ /^.\s*\*\s*\@$Ident\s/ &&
$line !~ /^\+\s*printk\s*\(\s*(?:KERN_\S+\s*)?"[X\t]*"\s*(?:,|\)\s*;)\s*$/ &&
$length > 80)
{
WARN("line over 80 characters\n" . $herecurr);
}
my perl is not good, but it appears to be looking for KERN_, eg
KERN_ERR, KERN_WARNING etc, and maybe should be ignoring such lines?
Andrew
^ permalink raw reply
* [PATCH] omap iommu: avoid remapping if it's been mapped in MPU side
From: Hiroshi DOYU @ 2009-08-26 10:45 UTC (permalink / raw)
To: linux-omap@vger.kernel.org
In-Reply-To: <1251283037-19573-1-git-send-email-sakari.ailus@maxwell.research.nokia.com>
From: Hiroshi DOYU <Hiroshi.DOYU@nokia.com>
MPU side (v)-(p) mapping is necessary only if IOVMF_MMIO is set in
"flags".
Signed-off-by: Hiroshi DOYU <Hiroshi.DOYU@nokia.com>
---
arch/arm/plat-omap/iovmm.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/arch/arm/plat-omap/iovmm.c b/arch/arm/plat-omap/iovmm.c
index 6fc52fc..004fd83 100644
--- a/arch/arm/plat-omap/iovmm.c
+++ b/arch/arm/plat-omap/iovmm.c
@@ -615,7 +615,7 @@ u32 iommu_vmap(struct iommu *obj, u32 da, const struct sg_table *sgt,
u32 flags)
{
size_t bytes;
- void *va;
+ void *va = NULL;
if (!obj || !obj->dev || !sgt)
return -EINVAL;
@@ -625,9 +625,11 @@ u32 iommu_vmap(struct iommu *obj, u32 da, const struct sg_table *sgt,
return -EINVAL;
bytes = PAGE_ALIGN(bytes);
- va = vmap_sg(sgt);
- if (IS_ERR(va))
- return PTR_ERR(va);
+ if (flags & IOVMF_MMIO) {
+ va = vmap_sg(sgt);
+ if (IS_ERR(va))
+ return PTR_ERR(va);
+ }
flags &= IOVMF_HW_MASK;
flags |= IOVMF_DISCONT;
--
1.5.6.5
^ permalink raw reply related
* [PATCH 3/3] Add MAP_HUGETLB example
From: Eric B Munson @ 2009-08-26 10:44 UTC (permalink / raw)
To: linux-kernel, linux-mm, akpm
Cc: linux-man, mtk.manpages, randy.dunlap, Eric B Munson
In-Reply-To: <cover.1251282769.git.ebmunson@us.ibm.com>
This patch adds an example of how to use the MAP_HUGETLB flag to the
vm documentation directory and a reference to the example in
hugetlbpage.txt.
Signed-off-by: Eric B Munson <ebmunson@us.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
---
Documentation/vm/00-INDEX | 2 +
Documentation/vm/hugetlbpage.txt | 14 ++++---
Documentation/vm/map_hugetlb.c | 77 ++++++++++++++++++++++++++++++++++++++
3 files changed, 87 insertions(+), 6 deletions(-)
create mode 100644 Documentation/vm/map_hugetlb.c
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index 2f77ced..aabd973 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -20,3 +20,5 @@ slabinfo.c
- source code for a tool to get reports about slabs.
slub.txt
- a short users guide for SLUB.
+map_hugetlb.c
+ - an example program that uses the MAP_HUGETLB mmap flag.
diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt
index ea8714f..6a8feab 100644
--- a/Documentation/vm/hugetlbpage.txt
+++ b/Documentation/vm/hugetlbpage.txt
@@ -146,12 +146,14 @@ Regular chown, chgrp, and chmod commands (with right permissions) could be
used to change the file attributes on hugetlbfs.
Also, it is important to note that no such mount command is required if the
-applications are going to use only shmat/shmget system calls. Users who
-wish to use hugetlb page via shared memory segment should be a member of
-a supplementary group and system admin needs to configure that gid into
-/proc/sys/vm/hugetlb_shm_group. It is possible for same or different
-applications to use any combination of mmaps and shm* calls, though the
-mount of filesystem will be required for using mmap calls.
+applications are going to use only shmat/shmget system calls or mmap with
+MAP_HUGETLB. Users who wish to use hugetlb page via shared memory segment
+should be a member of a supplementary group and system admin needs to
+configure that gid into /proc/sys/vm/hugetlb_shm_group. It is possible for
+same or different applications to use any combination of mmaps and shm*
+calls, though the mount of filesystem will be required for using mmap calls
+without MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
+map_hugetlb.c.
*******************************************************************
diff --git a/Documentation/vm/map_hugetlb.c b/Documentation/vm/map_hugetlb.c
new file mode 100644
index 0000000..e2bdae3
--- /dev/null
+++ b/Documentation/vm/map_hugetlb.c
@@ -0,0 +1,77 @@
+/*
+ * Example of using hugepage memory in a user application using the mmap
+ * system call with MAP_HUGETLB flag. Before running this program make
+ * sure the administrator has allocated enough default sized huge pages
+ * to cover the 256 MB allocation.
+ *
+ * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages.
+ * That means the addresses starting with 0x800000... will need to be
+ * specified. Specifying a fixed address is not required on ppc64, i386
+ * or x86_64.
+ */
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <fcntl.h>
+
+#define LENGTH (256UL*1024*1024)
+#define PROTECTION (PROT_READ | PROT_WRITE)
+
+#ifndef MAP_HUGETLB
+#define MAP_HUGETLB 0x40
+#endif
+
+/* Only ia64 requires this */
+#ifdef __ia64__
+#define ADDR (void *)(0x8000000000000000UL)
+#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_FIXED)
+#else
+#define ADDR (void *)(0x0UL)
+#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
+#endif
+
+void check_bytes(char *addr)
+{
+ printf("First hex is %x\n", *((unsigned int *)addr));
+}
+
+void write_bytes(char *addr)
+{
+ unsigned long i;
+
+ for (i = 0; i < LENGTH; i++)
+ *(addr + i) = (char)i;
+}
+
+void read_bytes(char *addr)
+{
+ unsigned long i;
+
+ check_bytes(addr);
+ for (i = 0; i < LENGTH; i++)
+ if (*(addr + i) != (char)i) {
+ printf("Mismatch at %lu\n", i);
+ break;
+ }
+}
+
+int main(void)
+{
+ void *addr;
+
+ addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0);
+ if (addr == MAP_FAILED) {
+ perror("mmap");
+ exit(1);
+ }
+
+ printf("Returned address is %p\n", addr);
+ check_bytes(addr);
+ write_bytes(addr);
+ read_bytes(addr);
+
+ munmap(addr, LENGTH);
+
+ return 0;
+}
--
1.6.3.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 2/3] Add MAP_HUGETLB for mmaping pseudo-anonymous huge page regions
From: Eric B Munson @ 2009-08-26 10:44 UTC (permalink / raw)
To: linux-kernel, linux-mm, akpm
Cc: linux-man, mtk.manpages, randy.dunlap, Eric B Munson
In-Reply-To: <cover.1251282769.git.ebmunson@us.ibm.com>
This patch adds a flag for mmap that will be used to request a huge
page region that will look like anonymous memory to user space. This
is accomplished by using a file on the internal vfsmount. MAP_HUGETLB
is a modifier of MAP_ANONYMOUS and so must be specified with it. The
region will behave the same as a MAP_ANONYMOUS region using small pages.
Signed-off-by: Eric B Munson <ebmunson@us.ibm.com>
---
include/asm-generic/mman-common.h | 1 +
include/linux/hugetlb.h | 7 +++++++
mm/mmap.c | 19 +++++++++++++++++++
3 files changed, 27 insertions(+), 0 deletions(-)
diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h
index 3b69ad3..12f5982 100644
--- a/include/asm-generic/mman-common.h
+++ b/include/asm-generic/mman-common.h
@@ -19,6 +19,7 @@
#define MAP_TYPE 0x0f /* Mask for type of mapping */
#define MAP_FIXED 0x10 /* Interpret addr exactly */
#define MAP_ANONYMOUS 0x20 /* don't use a file */
+#define MAP_HUGETLB 0x40 /* create a huge page mapping */
#define MS_ASYNC 1 /* sync memory asynchronously */
#define MS_INVALIDATE 2 /* invalidate the caches */
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 38bb552..b0bc0fd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -110,12 +110,19 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
#endif /* !CONFIG_HUGETLB_PAGE */
+#define HUGETLB_ANON_FILE "anon_hugepage"
+
enum {
/*
* The file will be used as an shm file so shmfs accounting rules
* apply
*/
HUGETLB_SHMFS_INODE = 1,
+ /*
+ * The file is being created on the internal vfs mount and shmfs
+ * accounting rules do not apply
+ */
+ HUGETLB_ANONHUGE_INODE = 2,
};
#ifdef CONFIG_HUGETLBFS
diff --git a/mm/mmap.c b/mm/mmap.c
index 8101de4..9ca4f26 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -29,6 +29,7 @@
#include <linux/rmap.h>
#include <linux/mmu_notifier.h>
#include <linux/perf_counter.h>
+#include <linux/hugetlb.h>
#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -951,6 +952,24 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;
+ if (flags & MAP_HUGETLB) {
+ struct user_struct *user = NULL;
+ if (file)
+ return -EINVAL;
+
+ /*
+ * VM_NORESERVE is used because the reservations will be
+ * taken when vm_ops->mmap() is called
+ * A dummy user value is used because we are not locking
+ * memory so no accounting is necessary
+ */
+ len = ALIGN(len, huge_page_size(&default_hstate));
+ file = hugetlb_file_setup(HUGETLB_ANON_FILE, len, VM_NORESERVE,
+ &user, HUGETLB_ANONHUGE_INODE);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+ }
+
/* Obtain the address to map to. we verify (or select) it and ensure
* that it represents a valid section of the address space.
*/
--
1.6.3.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
* [PATCH 1/3] hugetlbfs: Allow the creation of files suitable for MAP_PRIVATE on the vfs internal mount
From: Eric B Munson @ 2009-08-26 10:44 UTC (permalink / raw)
To: linux-kernel, linux-mm, akpm
Cc: linux-man, mtk.manpages, randy.dunlap, Eric B Munson
In-Reply-To: <cover.1251282769.git.ebmunson@us.ibm.com>
There are two means of creating mappings backed by huge pages:
1. mmap() a file created on hugetlbfs
2. Use shm which creates a file on an internal mount which essentially
maps it MAP_SHARED
The internal mount is only used for shared mappings but there is very
little that stops it being used for private mappings. This patch extends
hugetlbfs_file_setup() to deal with the creation of files that will be
mapped MAP_PRIVATE on the internal hugetlbfs mount. This extended API is
used in a subsequent patch to implement the MAP_HUGETLB mmap() flag.
Signed-off-by: Eric Munson <ebmunson@us.ibm.com>
---
fs/hugetlbfs/inode.c | 21 +++++++++++++++++----
include/linux/hugetlb.h | 12 ++++++++++--
ipc/shm.c | 2 +-
3 files changed, 28 insertions(+), 7 deletions(-)
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index cb88dac..5584d55 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -506,6 +506,13 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid,
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
INIT_LIST_HEAD(&inode->i_mapping->private_list);
info = HUGETLBFS_I(inode);
+ /*
+ * The policy is initialized here even if we are creating a
+ * private inode because initialization simply creates an
+ * an empty rb tree and calls spin_lock_init(), later when we
+ * call mpol_free_shared_policy() it will just return because
+ * the rb tree will still be empty.
+ */
mpol_shared_policy_init(&info->policy, NULL);
switch (mode & S_IFMT) {
default:
@@ -930,13 +937,19 @@ static struct file_system_type hugetlbfs_fs_type = {
static struct vfsmount *hugetlbfs_vfsmount;
-static int can_do_hugetlb_shm(void)
+static int can_do_hugetlb_shm(int creat_flags)
{
- return capable(CAP_IPC_LOCK) || in_group_p(sysctl_hugetlb_shm_group);
+ if (creat_flags != HUGETLB_SHMFS_INODE)
+ return 0;
+ if (capable(CAP_IPC_LOCK))
+ return 1;
+ if (in_group_p(sysctl_hugetlb_shm_group))
+ return 1;
+ return 0;
}
struct file *hugetlb_file_setup(const char *name, size_t size, int acctflag,
- struct user_struct **user)
+ struct user_struct **user, int creat_flags)
{
int error = -ENOMEM;
struct file *file;
@@ -948,7 +961,7 @@ struct file *hugetlb_file_setup(const char *name, size_t size, int acctflag,
if (!hugetlbfs_vfsmount)
return ERR_PTR(-ENOENT);
- if (!can_do_hugetlb_shm()) {
+ if (!can_do_hugetlb_shm(creat_flags)) {
*user = current_user();
if (user_shm_lock(size, *user)) {
WARN_ONCE(1,
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 5cbc620..38bb552 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -110,6 +110,14 @@ static inline void hugetlb_report_meminfo(struct seq_file *m)
#endif /* !CONFIG_HUGETLB_PAGE */
+enum {
+ /*
+ * The file will be used as an shm file so shmfs accounting rules
+ * apply
+ */
+ HUGETLB_SHMFS_INODE = 1,
+};
+
#ifdef CONFIG_HUGETLBFS
struct hugetlbfs_config {
uid_t uid;
@@ -148,7 +156,7 @@ static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
extern const struct file_operations hugetlbfs_file_operations;
extern struct vm_operations_struct hugetlb_vm_ops;
struct file *hugetlb_file_setup(const char *name, size_t size, int acct,
- struct user_struct **user);
+ struct user_struct **user, int creat_flags);
int hugetlb_get_quota(struct address_space *mapping, long delta);
void hugetlb_put_quota(struct address_space *mapping, long delta);
@@ -170,7 +178,7 @@ static inline void set_file_hugepages(struct file *file)
#define is_file_hugepages(file) 0
#define set_file_hugepages(file) BUG()
-#define hugetlb_file_setup(name,size,acct,user) ERR_PTR(-ENOSYS)
+#define hugetlb_file_setup(name,size,acct,user,creat) ERR_PTR(-ENOSYS)
#endif /* !CONFIG_HUGETLBFS */
diff --git a/ipc/shm.c b/ipc/shm.c
index 1bc4701..5ba4962 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -370,7 +370,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
if (shmflg & SHM_NORESERVE)
acctflag = VM_NORESERVE;
file = hugetlb_file_setup(name, size, acctflag,
- &shp->mlock_user);
+ &shp->mlock_user, HUGETLB_SHMFS_INODE);
} else {
/*
* Do not allow no accounting for OVERCOMMIT_NEVER, even
--
1.6.3.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.