From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: [PATCH] arch/tile: new multi-core architecture for Linux
Date: Thu, 27 May 2010 21:34:12 +0100
Message-ID: <20100527203412.GB31920@shareable.org>
References: <201005200543.o4K5hFRF006079@farm-0002.internal.tilera.com> <201005252345.15685.arnd@arndb.de> <4BFDC3D1.40708@tilera.com> <201005271041.30519.arnd@arndb.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Return-path: <linux-arch-owner@vger.kernel.org>
Received: from mail2.shareable.org ([80.68.89.115]:53996 "EHLO
	mail2.shareable.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1755520Ab0E0UeY (ORCPT
	<rfc822;linux-arch@vger.kernel.org>); Thu, 27 May 2010 16:34:24 -0400
Content-Disposition: inline
In-Reply-To: <201005271041.30519.arnd@arndb.de>
Sender: linux-arch-owner@vger.kernel.org
List-ID: <linux-arch.vger.kernel.org>
To: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Metcalf <cmetcalf@tilera.com>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, linux-arch@vger.kernel.org

Arnd Bergmann wrote:
> On Thursday 27 May 2010, Chris Metcalf wrote:
> > > Yes, that makes sense. You definitely want binary compatibility between
> > > 32 bit binaries from a native 32 bit system on TILE-Gx in the syscall
> > > interface.
> > >   
> > 
> > The thing is, the COMPAT layer on TILE-Gx is actually not providing
> > TILEPro compatibility, since the architectures are too different --
> > conceptually similar but with different opcode numbering, etc.  Instead
> > what it's doing is providing a 32-bit pointer ABI, to help porting
> > crufty old code (this is in fact the primary customer driver), or to
> > allow more compact representations of pointer-heavy data.
> 
> Ah, interesting. I don't think any architecture does it this way
> so far. IIRC, while alpha had some applications built in 32 bit
> mode in the early days, those were just using the 64 bit system
> calls directly.
> 
> Then again, that probably required some rather ugly hacks to get
> the libc working, so since we have the compat code in kernel now,
> your approach is probably much better.
> 
> Are you able to build 32 bit kernels for TILE-Gx as well? It's
> probably something you never really want to do for performance
> reasons, but I guess you could use that to verify that the
> ABI is really compatible.
> 
> > > compat_sys_sendfile will not be needed with the asm-generic/unistd.h definitions,
> > > but I think you will still need a compat_sys_sendfile64, to which the same
> > > applies as to compat_sys_sched_rr_get_interval.
> > >   
> > 
> > I think it's the other way around: compat_sys_sendfile64() is just
> > sys_sendfile64(), but compat_sys_sendfile() needs to exist since it has
> > to write a 32-bit pointer back to userspace.
> 
> Ah. I guess you're right about compat_sys_sendfile64 not being needed.
> Funny enough, parisc, powerpc, s390 and sparc all define it anyway, so
> it didn't occur to me that they don't actually need to.

They do need it.

For example, on Sparc, compat_sys_sendfile64 takes a 32-bit
compat_size_t argument, and calls sys_sendfile64 with a 64-bit size_t
argument.

I'll be very surprised if 32-bit tile is using 64-bit size_t already :-)

-- Jamie


> 
> What I meant about compat_sys_sendfile is that you only define it if
> the 32 bit ABI contains a reference to sys_sendfile in the first
> place. With asm-generic/unistd.h, 32 bit uses always uses the sys_sendfile64
> kernel interface, while for 64 bit the two are identical, so we take
> the regular sys_sendfile.
> 
> > >> +static int hardwall_ioctl(struct inode *inode, struct file *file,
> > >> +                      unsigned int a, unsigned long b)
> > >> +{
> > >> [...]
> > >>     
> > > The hardwall stuff looks like it is quite central to your architecture.
> > > Could you elaborate on what it does?
> > >   
> > It's not "central" but it is an important enabler for access to our
> > "user network".  This is a wormhole-routed mesh network (the UDN, or
> > user dynamic network) that connects all the cpus.  If a task affinitizes
> > itself to a single cpu (to avoid migration) and opens /dev/hardwall and
> > does an ioctl on it, it can associate the particular /dev/hardwall file
> > object with some non-overlapping subrectangle of the whole 8x8 chip (our
> > cpus are laid out as "tiles" in an 8x8 configuration).  It can then do
> > an "activate" ioctl to get access to that subrectangle of the UDN, from
> > that cpu.  Other threads in that process (or anyone who can share that
> > file object one way or another, e.g. fork or sendmsg) can then also do
> > an "activate" ioctl on that file object and also get access, and they
> > can then exchange messages with very low latency (register file to
> > register file in a handful of cycles) and high bandwidth (32 bits/cycle
> > or about 3GB/sec).
> >
> > The actual "hardwall" refers to the fact that cpus on the periphery of
> > the allocated subrectangle of cpus set up the router so that they will
> > get an interrupt if some cpu tries to send a message that would
> > terminate outside the set of allocated cpus.  Doing it this way means
> > several unrelated tasks could have separate message-passing arenas
> > (spatially dividing the chip) and whenever the last task holding a
> > reference to a hardwall file object exits, the OS can drain any messages
> > from the UDN and deallocate the subrectangle in question.
> >
> > > If it is as essential as it looks, I'd vote for promoting the interface
> > > from an ioctl based one to four real system calls (more if necessary).
> > >   
> > 
> > The notion of using a file descriptor as the "rights" object is pretty
> > central, so I think a character device will work out well.
> 
> ok, I see. Now you could easily do this with system calls as well:
> Instead of the initial ioctl that associates the file descriptor
> with a rectangle, you can have a syscall that creates a rectangle
> and a file descriptor (using anon_inode_getfd) associated with it,
> and returns the fd to user space. This is similar to what we
> do for other system call interfaces that operate on their own fds.
> 
> Another alternative might be to combine this with cpusets subsystem,
> which has a related functionality. I guess that would be the
> preferred way if you expect tile-gx to take over the world and
> have lots of applications written to it.
> For a niche product, the syscall or ioctl approach does seem
> simple enough, and it does not require other users of cpusets
> to learn about requirements of your rectangles.
> 
> > > Note that the procfs file format is part of your ABI, and this looks
> > > relatively hard to parse, which may introduce bugs.
> > > For per-process information, it would be better to have a simpler
> > > file in each /proc/<pid>/directory. Would that work for you?
> > >   
> > 
> > Well, the hardwalls aren't exactly per-process anyway, and we don't in
> > practice use the ASCII output for anything much, so it may not matter
> > that they're not too parseable.  I may just look into making them more
> > parsable when I convert it to a /dev interface and leave it at that.
> 
> On a chardev, a binary interface seems more appropriate than
> an text based one anyway, so you could add another ioctl for this.
> 
> > I'm planning to defer this in any case, since the UDN interface, though
> > a nice-to-have, obviously isn't needed to run any standard C code.  I'll
> > make that part of a follow-up patch.
> 
> ok
> 
> > > Note that we're about to remove the .ioctl file operation and
> > > replace it with .unlocked_ioctl everywhere.
> > >   
> > 
> > OK, for now I'll ensure that we are locking everything internally
> > correctly. I believe we are already anyway.
> 
> ok. Then please just use .unlocked_ioctl in new drivers.
> 
> > > [hugevmap] Not used anywhere apparently. Can you explain what this is good for?
> > > Maybe leave it out for now, until you merge the code that needs it.
> > > I don't see anything obviously wrong with the implementation though.
> > >   
> > 
> > I'll omit it; we haven't used it yet.  The intent was to provide
> > guaranteed huge pages for TLB purposes to kernel drivers.  Currently we
> > just start with huge pages where possible, and fragment them if necessary.
> 
> Ok. Do you use huge pages for backing the linear kernel mapping?
> Normally device drivers get huge pages for free in kmalloc and
> get_free_pages because all the memory is mapped using the largest
> page size anyway.
> 
> > > +EXPORT_SYMBOL(inb);
> > >   
> > > If you just remove these definitions, you get a link error for any
> > > driver that tries to use these, which is probably more helpful than
> > > the panic.
> > >
> > > OTOH, are you sure that you can't just map the PIO calls to mmio functions
> > > like readb plus some fixed offset? On most non-x86 architectures, the PIO
> > > area of the PCI bus is just mapped to a memory range somewhere.
> > >   
> > 
> > I'll try to remove them and see if anything falls over.  We don't have
> > any memory-mapped addresses in the 32-bit architecture, though that
> > changes with the 64-bit architecture, which introduces IO mappings.  For
> > PCI we actually have to do a hypervisor transaction for reads or writes.
> 
> Ok, then I assume that PIO would also be a hypervisor call, right?
> If you don't have MMIO on 32 bit, you might want to not define either
> PIO (inb, ...) no MMIO (readb, ...) calls there and disable
> CONFIG_HAVE_MMIO in Kconfig.
>  
> > >> +SEQ_PROC_ENTRY(interrupts)
> > >> +static int proc_tile_interrupts_show(struct seq_file *sf, void *v)
> > >> +{
> > >> [...]
> > >>     
> > > Can you merge this with /proc/interrupts?
> > >   
> > 
> > It turns out /proc/interrupts is formatted the wrong way round if you
> > have 64 processors :-)  You want one row per cpu, not one column per cpu!
> 
> Yes, interesting observation. I'm sure the Altix folks are suffering from
> this a lot.
>  
> > Also, there are things listed that are not strictly IRQs in the normal
> > sense (things like TLB flushes and syscalls) which are still good for
> > assessing where latency glitches might be coming from on a particular cpu.
> 
> That's fine, just look at what a current x86 kernel gives you (slightly
> cut):
>            CPU0       CPU1
>   0:   18764948     504980   IO-APIC-edge      timer
>   1:     228456       2572   IO-APIC-edge      i8042
>   9:    2632595      79544   IO-APIC-fasteoi   acpi
>  12:    1094468      43409   IO-APIC-edge      i8042
>  16:      82761       1455   IO-APIC-fasteoi   uhci_hcd:usb6, yenta, heci
>  28:     908865      85857   PCI-MSI-edge      ahci
>  29:       6421      11595   PCI-MSI-edge      eth0
> NMI:          0          0   Non-maskable interrupts
> LOC:    1987682    9057144   Local timer interrupts
> SPU:          0          0   Spurious interrupts
> CNT:          0          0   Performance counter interrupts
> PND:          0          0   Performance pending work
> RES:    3598785    3903513   Rescheduling interrupts
> CAL:       8848       5944   Function call interrupts
> TLB:      31467      18283   TLB shootdowns
> TRM:          0          0   Thermal event interrupts
> THR:          0          0   Threshold APIC interrupts
> MCE:          0          0   Machine check exceptions
> MCP:        354        346   Machine check polls
> ERR:          0
> MIS:          0
> 
> Lots of things in there that fit your category.
> 
> > > This seems to be read-only and coming from a kernel command
> > > line option, so I guess looking at /proc/cmdline would
> > > be a reasonable alternative.
> > >   
> > 
> > I always find that kind of painful, since you have to parse it exactly
> > as the kernel would to be sure you're getting it right; strstr() is only
> > a 99% solution.
> 
> How about making it a module_param then? You can still see it
> in /sys/modules/*/parameters then, even if the code is builtin,
> but it won't be in the sysctl name space any more.
>  
> > >> +SYSCALL_DEFINE3(raise_fpe, int, code, unsigned long, addr,
> > >> +            struct pt_regs *, regs)
> > >>     
> > > Does this need to be a system call? I thought we already had
> > > other architectures without floating point exceptions in hardware
> > > that don't need this.
> > >   
> > 
> > Hmm, I didn't know about that.  Any information would be appreciated.  I
> > guess you could synthesize something that looked like a signal purely in
> > user-space?  But how would debuggers trap it?  I'm not sure how it would
> > work without a system call.
> 
> I think the C99 standard allows you to not implement SIGFPE at all but
> instead rely on applications doing fetestexcept() etc.
> 
> > >> diff --git a/arch/tile/kernel/sys.c b/arch/tile/kernel/sys.c
> > >> [...]
> > >> +ssize_t sys32_readahead(int fd, u32 offset_lo, u32 offset_hi, u32 count)
> > >> +{
> > >> +    return sys_readahead(fd, ((loff_t)offset_hi << 32) | offset_lo, count);
> > >> +}
> > >> +
> > >>
> > >>     
> > > These seem to belong with the other similar functions in compat.c
> > >   
> > 
> > Except they're also used by the 32-bit platform where there is no compat
> > mode (the compat code DOES use them too, it's true).
> 
> I see. AFAICT, all other architectures don't need the wrapper in
> the 32 bit native case because they define the syscall calling
> conventions in libc such that they match what the kernel
> expects for a 64 bit argument (typically split in two subsequent
> argument slots). Would that work for you as well?
> 
> > > Just use the sys_mmap_pgoff system call directly, rather than
> > > defining your own wrappers. Since that syscall is newer than
> > > asm-generic/unistd.h, that file might need some changes,
> > > together with fixes to arch/score to make sure we don't break
> > > its ABI.
> > >   
> > 
> > It should be OK.  Their sys_mmap2() just tail-calls sys_mmap_pgoff()
> > anyway, so it should be possible to switch mmap2 in asm-generic/unistd.h
> > to be mmap_pgoff instead.  We'll need some user-space changes (our mmap2
> > is defined in 4KB units) but that's not hard.
> 
> Hmm, I forgot about the page size. Actually the definition of sys_mmap2
> is to use 4KB units on all architectures except ia64, independent
> of the real page size. Maybe it's better to keep using sys_mmap/sys_mmap2
> after all but then use only one of the two (sys_mmap on 64 bit, sys_mmap2
> on 32 bit and compat).
> 
> Either way should work though.
> 
> > > It seems that this file fits in the same category as the
> > > backtrace code. Maybe move both away from arch/tile/kernel into a
> > > different directory?
> > >   
> > 
> > I'll think about it.  These are both basically disassembly-related, so
> > maybe an arch/tile/disasm directory with the tile_desc stuff and the
> > backtracer?  I'm not sure it's really worth moving out of
> > arch/tile/kernel though.
> 
> Ok. If you leave them in the directory, just split them out into a separate
> patch on your next submission then.
> 
> 	Arnd
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arch" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html