From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jamie Lokier Subject: Re: [PATCH] arch/tile: new multi-core architecture for Linux Date: Thu, 27 May 2010 21:34:12 +0100 Message-ID: <20100527203412.GB31920@shareable.org> References: <201005200543.o4K5hFRF006079@farm-0002.internal.tilera.com> <201005252345.15685.arnd@arndb.de> <4BFDC3D1.40708@tilera.com> <201005271041.30519.arnd@arndb.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Received: from mail2.shareable.org ([80.68.89.115]:53996 "EHLO mail2.shareable.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755520Ab0E0UeY (ORCPT ); Thu, 27 May 2010 16:34:24 -0400 Content-Disposition: inline In-Reply-To: <201005271041.30519.arnd@arndb.de> Sender: linux-arch-owner@vger.kernel.org List-ID: To: Arnd Bergmann Cc: Chris Metcalf , Linux Kernel Mailing List , Linus Torvalds , linux-arch@vger.kernel.org Arnd Bergmann wrote: > On Thursday 27 May 2010, Chris Metcalf wrote: > > > Yes, that makes sense. You definitely want binary compatibility between > > > 32 bit binaries from a native 32 bit system on TILE-Gx in the syscall > > > interface. > > > > > > > The thing is, the COMPAT layer on TILE-Gx is actually not providing > > TILEPro compatibility, since the architectures are too different -- > > conceptually similar but with different opcode numbering, etc. Instead > > what it's doing is providing a 32-bit pointer ABI, to help porting > > crufty old code (this is in fact the primary customer driver), or to > > allow more compact representations of pointer-heavy data. > > Ah, interesting. I don't think any architecture does it this way > so far. IIRC, while alpha had some applications built in 32 bit > mode in the early days, those were just using the 64 bit system > calls directly. > > Then again, that probably required some rather ugly hacks to get > the libc working, so since we have the compat code in kernel now, > your approach is probably much better. > > Are you able to build 32 bit kernels for TILE-Gx as well? It's > probably something you never really want to do for performance > reasons, but I guess you could use that to verify that the > ABI is really compatible. > > > > compat_sys_sendfile will not be needed with the asm-generic/unistd.h definitions, > > > but I think you will still need a compat_sys_sendfile64, to which the same > > > applies as to compat_sys_sched_rr_get_interval. > > > > > > > I think it's the other way around: compat_sys_sendfile64() is just > > sys_sendfile64(), but compat_sys_sendfile() needs to exist since it has > > to write a 32-bit pointer back to userspace. > > Ah. I guess you're right about compat_sys_sendfile64 not being needed. > Funny enough, parisc, powerpc, s390 and sparc all define it anyway, so > it didn't occur to me that they don't actually need to. They do need it. For example, on Sparc, compat_sys_sendfile64 takes a 32-bit compat_size_t argument, and calls sys_sendfile64 with a 64-bit size_t argument. I'll be very surprised if 32-bit tile is using 64-bit size_t already :-) -- Jamie > > What I meant about compat_sys_sendfile is that you only define it if > the 32 bit ABI contains a reference to sys_sendfile in the first > place. With asm-generic/unistd.h, 32 bit uses always uses the sys_sendfile64 > kernel interface, while for 64 bit the two are identical, so we take > the regular sys_sendfile. > > > >> +static int hardwall_ioctl(struct inode *inode, struct file *file, > > >> + unsigned int a, unsigned long b) > > >> +{ > > >> [...] > > >> > > > The hardwall stuff looks like it is quite central to your architecture. > > > Could you elaborate on what it does? > > > > > It's not "central" but it is an important enabler for access to our > > "user network". This is a wormhole-routed mesh network (the UDN, or > > user dynamic network) that connects all the cpus. If a task affinitizes > > itself to a single cpu (to avoid migration) and opens /dev/hardwall and > > does an ioctl on it, it can associate the particular /dev/hardwall file > > object with some non-overlapping subrectangle of the whole 8x8 chip (our > > cpus are laid out as "tiles" in an 8x8 configuration). It can then do > > an "activate" ioctl to get access to that subrectangle of the UDN, from > > that cpu. Other threads in that process (or anyone who can share that > > file object one way or another, e.g. fork or sendmsg) can then also do > > an "activate" ioctl on that file object and also get access, and they > > can then exchange messages with very low latency (register file to > > register file in a handful of cycles) and high bandwidth (32 bits/cycle > > or about 3GB/sec). > > > > The actual "hardwall" refers to the fact that cpus on the periphery of > > the allocated subrectangle of cpus set up the router so that they will > > get an interrupt if some cpu tries to send a message that would > > terminate outside the set of allocated cpus. Doing it this way means > > several unrelated tasks could have separate message-passing arenas > > (spatially dividing the chip) and whenever the last task holding a > > reference to a hardwall file object exits, the OS can drain any messages > > from the UDN and deallocate the subrectangle in question. > > > > > If it is as essential as it looks, I'd vote for promoting the interface > > > from an ioctl based one to four real system calls (more if necessary). > > > > > > > The notion of using a file descriptor as the "rights" object is pretty > > central, so I think a character device will work out well. > > ok, I see. Now you could easily do this with system calls as well: > Instead of the initial ioctl that associates the file descriptor > with a rectangle, you can have a syscall that creates a rectangle > and a file descriptor (using anon_inode_getfd) associated with it, > and returns the fd to user space. This is similar to what we > do for other system call interfaces that operate on their own fds. > > Another alternative might be to combine this with cpusets subsystem, > which has a related functionality. I guess that would be the > preferred way if you expect tile-gx to take over the world and > have lots of applications written to it. > For a niche product, the syscall or ioctl approach does seem > simple enough, and it does not require other users of cpusets > to learn about requirements of your rectangles. > > > > Note that the procfs file format is part of your ABI, and this looks > > > relatively hard to parse, which may introduce bugs. > > > For per-process information, it would be better to have a simpler > > > file in each /proc//directory. Would that work for you? > > > > > > > Well, the hardwalls aren't exactly per-process anyway, and we don't in > > practice use the ASCII output for anything much, so it may not matter > > that they're not too parseable. I may just look into making them more > > parsable when I convert it to a /dev interface and leave it at that. > > On a chardev, a binary interface seems more appropriate than > an text based one anyway, so you could add another ioctl for this. > > > I'm planning to defer this in any case, since the UDN interface, though > > a nice-to-have, obviously isn't needed to run any standard C code. I'll > > make that part of a follow-up patch. > > ok > > > > Note that we're about to remove the .ioctl file operation and > > > replace it with .unlocked_ioctl everywhere. > > > > > > > OK, for now I'll ensure that we are locking everything internally > > correctly. I believe we are already anyway. > > ok. Then please just use .unlocked_ioctl in new drivers. > > > > [hugevmap] Not used anywhere apparently. Can you explain what this is good for? > > > Maybe leave it out for now, until you merge the code that needs it. > > > I don't see anything obviously wrong with the implementation though. > > > > > > > I'll omit it; we haven't used it yet. The intent was to provide > > guaranteed huge pages for TLB purposes to kernel drivers. Currently we > > just start with huge pages where possible, and fragment them if necessary. > > Ok. Do you use huge pages for backing the linear kernel mapping? > Normally device drivers get huge pages for free in kmalloc and > get_free_pages because all the memory is mapped using the largest > page size anyway. > > > > +EXPORT_SYMBOL(inb); > > > > > > If you just remove these definitions, you get a link error for any > > > driver that tries to use these, which is probably more helpful than > > > the panic. > > > > > > OTOH, are you sure that you can't just map the PIO calls to mmio functions > > > like readb plus some fixed offset? On most non-x86 architectures, the PIO > > > area of the PCI bus is just mapped to a memory range somewhere. > > > > > > > I'll try to remove them and see if anything falls over. We don't have > > any memory-mapped addresses in the 32-bit architecture, though that > > changes with the 64-bit architecture, which introduces IO mappings. For > > PCI we actually have to do a hypervisor transaction for reads or writes. > > Ok, then I assume that PIO would also be a hypervisor call, right? > If you don't have MMIO on 32 bit, you might want to not define either > PIO (inb, ...) no MMIO (readb, ...) calls there and disable > CONFIG_HAVE_MMIO in Kconfig. > > > >> +SEQ_PROC_ENTRY(interrupts) > > >> +static int proc_tile_interrupts_show(struct seq_file *sf, void *v) > > >> +{ > > >> [...] > > >> > > > Can you merge this with /proc/interrupts? > > > > > > > It turns out /proc/interrupts is formatted the wrong way round if you > > have 64 processors :-) You want one row per cpu, not one column per cpu! > > Yes, interesting observation. I'm sure the Altix folks are suffering from > this a lot. > > > Also, there are things listed that are not strictly IRQs in the normal > > sense (things like TLB flushes and syscalls) which are still good for > > assessing where latency glitches might be coming from on a particular cpu. > > That's fine, just look at what a current x86 kernel gives you (slightly > cut): > CPU0 CPU1 > 0: 18764948 504980 IO-APIC-edge timer > 1: 228456 2572 IO-APIC-edge i8042 > 9: 2632595 79544 IO-APIC-fasteoi acpi > 12: 1094468 43409 IO-APIC-edge i8042 > 16: 82761 1455 IO-APIC-fasteoi uhci_hcd:usb6, yenta, heci > 28: 908865 85857 PCI-MSI-edge ahci > 29: 6421 11595 PCI-MSI-edge eth0 > NMI: 0 0 Non-maskable interrupts > LOC: 1987682 9057144 Local timer interrupts > SPU: 0 0 Spurious interrupts > CNT: 0 0 Performance counter interrupts > PND: 0 0 Performance pending work > RES: 3598785 3903513 Rescheduling interrupts > CAL: 8848 5944 Function call interrupts > TLB: 31467 18283 TLB shootdowns > TRM: 0 0 Thermal event interrupts > THR: 0 0 Threshold APIC interrupts > MCE: 0 0 Machine check exceptions > MCP: 354 346 Machine check polls > ERR: 0 > MIS: 0 > > Lots of things in there that fit your category. > > > > This seems to be read-only and coming from a kernel command > > > line option, so I guess looking at /proc/cmdline would > > > be a reasonable alternative. > > > > > > > I always find that kind of painful, since you have to parse it exactly > > as the kernel would to be sure you're getting it right; strstr() is only > > a 99% solution. > > How about making it a module_param then? You can still see it > in /sys/modules/*/parameters then, even if the code is builtin, > but it won't be in the sysctl name space any more. > > > >> +SYSCALL_DEFINE3(raise_fpe, int, code, unsigned long, addr, > > >> + struct pt_regs *, regs) > > >> > > > Does this need to be a system call? I thought we already had > > > other architectures without floating point exceptions in hardware > > > that don't need this. > > > > > > > Hmm, I didn't know about that. Any information would be appreciated. I > > guess you could synthesize something that looked like a signal purely in > > user-space? But how would debuggers trap it? I'm not sure how it would > > work without a system call. > > I think the C99 standard allows you to not implement SIGFPE at all but > instead rely on applications doing fetestexcept() etc. > > > >> diff --git a/arch/tile/kernel/sys.c b/arch/tile/kernel/sys.c > > >> [...] > > >> +ssize_t sys32_readahead(int fd, u32 offset_lo, u32 offset_hi, u32 count) > > >> +{ > > >> + return sys_readahead(fd, ((loff_t)offset_hi << 32) | offset_lo, count); > > >> +} > > >> + > > >> > > >> > > > These seem to belong with the other similar functions in compat.c > > > > > > > Except they're also used by the 32-bit platform where there is no compat > > mode (the compat code DOES use them too, it's true). > > I see. AFAICT, all other architectures don't need the wrapper in > the 32 bit native case because they define the syscall calling > conventions in libc such that they match what the kernel > expects for a 64 bit argument (typically split in two subsequent > argument slots). Would that work for you as well? > > > > Just use the sys_mmap_pgoff system call directly, rather than > > > defining your own wrappers. Since that syscall is newer than > > > asm-generic/unistd.h, that file might need some changes, > > > together with fixes to arch/score to make sure we don't break > > > its ABI. > > > > > > > It should be OK. Their sys_mmap2() just tail-calls sys_mmap_pgoff() > > anyway, so it should be possible to switch mmap2 in asm-generic/unistd.h > > to be mmap_pgoff instead. We'll need some user-space changes (our mmap2 > > is defined in 4KB units) but that's not hard. > > Hmm, I forgot about the page size. Actually the definition of sys_mmap2 > is to use 4KB units on all architectures except ia64, independent > of the real page size. Maybe it's better to keep using sys_mmap/sys_mmap2 > after all but then use only one of the two (sys_mmap on 64 bit, sys_mmap2 > on 32 bit and compat). > > Either way should work though. > > > > It seems that this file fits in the same category as the > > > backtrace code. Maybe move both away from arch/tile/kernel into a > > > different directory? > > > > > > > I'll think about it. These are both basically disassembly-related, so > > maybe an arch/tile/disasm directory with the tile_desc stuff and the > > backtracer? I'm not sure it's really worth moving out of > > arch/tile/kernel though. > > Ok. If you leave them in the directory, just split them out into a separate > patch on your next submission then. > > Arnd > -- > To unsubscribe from this list: send the line "unsubscribe linux-arch" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html