Project Home
Project Home
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
wiki3912: E500mc_msgxfer

QMS3205#

Optimized Message Pass on e500mc#

This work may be incorporated into the 6.4.2 release.

The e500mc architecture implements the Power ISA v2.06 standard (http://www.power.org/resources/downloads/PowerISA_V2.06_PUBLIC.pdf). This new standard includes some extensions that are intended to facilitate copying data across address spaces. These extensions can be utilized in the Neutrino architecture to improve the performance of message passing.

Background#

Neutrino Message Passing#

In the Neutrino architecture, each process has a single address space that spans a portion of the possible range of virtual addresses, and which maps virtual addresses to the physical memory that the system has provided to the process. A process can only access memory through this address space. The kernel has a special range of virtual addresses through which it can access kernel data structures. The virtual addresses that the kernel uses for accessing its data structures and the virtual addresses used for a process address space do not overlap. The virtual addresses for different process address spaces do overlap.

Thus the kernel can activate a process's address space and access both kernel data and a single process's data simultaneously. However, since every process address space covers the same range of virtual addresses, two process address spaces cannot be activated simultaneously.

Message passing from one process to another involves copying data from one process address space to another. Without the ability to activate two address spaces simultaneously, this becomes an exercise in juggling data.

In its current implementation, a message pass from one Neutrino process to another can take one of two paths.

For small messages, it is fastest to avoid the difficulty of dealing with multiple address spaces by performing an extra data copy (the double-copy mechanism). Instead of copying the data directly from one address space to another, the kernel establishes one address space, copies the data into its own kernel data structures, establishes the second address space, and then copies the data back out.

For larger messages, this double-copy mechanism can be too expensive. Instead, the physical memory belonging to one process is mapped into the other process's address space (we call this the double-mapping mechanism). In this manner, the kernel can establish a single address space but copy the data from the physical memory belonging to one process into the physical memory belonging to another process. However, this mapping operation is moderately expensive, and it might need to be done many times for a single message-pass. Depending on the architecture, we can be restricted to mapping a small piece of memory at a time so it's likely that we'll have to set up multiple mappings over the course of sending a large message. Also depending on the architecture, we might need to perform cache flushes if we map a single piece of physical memory into multiple address spaces.

The double-mapping mechanism is superior to the double-copy mechanism if the amount of data that can be moved with each mapping is large enough that establishing the mapping is less expensive than doing an extra copy of the data. The complexity of all the factors means it's difficult to know exactly when the double-copy mechanism is faster, and in fact trying to determine which mechanism is best would itself be an expensive operation so the decision can't be made at runtime. Instead, Neutrino uses the arbitrary rule that small messages will use the double-copy mechanism while large messages will use the double-mapping mechanism. This works well in practice because of the nature of the message traffic in a typical system -- most small messages are quite small, while large messages tend to be quite large.

As message passing is such a fundamental part of the Neutrino architecture, any improvement to message passing efficiency is worth considering.

e500mc Address Space Architecture#

In the PPC architectures prior to e500mc, there are several pieces that come into play to establish an address space.

Each address space has a process ID, or PID, along with a page table. When an address space is activated, if necessary a PID is allocated for it, the CPU's PID register is set to the process's PID value and the address of the page table is loaded into the active page table register (we use SPRG4 for this purpose on PPC booke architectures).

(Note that the e500 architecture uses a "TID" field ("TID" for "translation identity") in the MAS1 register to specify the process identification for a TLB entry, but uses a "PID" register to hold the ID of the current running process. In this overview we'll ignore this mess and just call everything a "PID".)

When the MMU is translating a virtual address into a physical address:

  1. The MMU looks through the TLB table entries to find an entry that matches each of the virtual address being mapped, the PID that is currently loaded in the PID register, and various context information stored in the MSR ("machine state register"). Since different address spaces use different PIDs the TLB table can contain entries for multiple different address spaces simultaneously without getting them confused.
  2. In the event that the TLB table doesn't contain any matching entries, the MMU generates a TLB-miss exception. The TLB-miss handler searches the active page table for a valid mapping to the given virtual address. If such a map is found, the TLB-miss handler loads it into a new TLB entry and the offending transaction is run again, presumably to succeed this time since the new TLB was added.
  3. In the event that no matching TLB is found and no matching mapping is found in the active page table, the kernel generates a page fault, and the page fault handler reacts appropriately (placing the process in PageWait state while the page of memory is initialized or loaded in, or dropping a signal on the process if the accessed virtual address is invalid).

The PowerISA v2.06 standard introduces some new registers and new instructions that facilitate access to different address spaces. The new instructions correspond to ordinary load and store instructions for reading or writing bytes, halfwords, words or doublewords, but instead of using the normal context established by the PID and MSR registers, they use context established in the new registers.

A new EPSC register is introduced to provide context for the new "external pid" load and store operations. The EPSC register includes an EPID ("external pid") field along with other bits corresponding with process context bits in the MSR. When one of the new instructions is executed, instead of the normal PID register and other context bits from the MSR, the MMU uses the fields of the EPSC register to match against the TLB entries. In the event that no match is found in the TLB table, when the TLB-miss exception is generated a bit is set in the exception syndrome register (ESR) to indicate that the fault happened when accessing the external address space.

In order to implement cross address-space copying the kernel needs to establish the primary address space normally, and must establish the second address space by setting the second address space's context in the EPSC register and noting the location of the second address space's page table. When a TLB-miss exception happens, the TLB-miss handler must check the ESR value to determine whether or not it needs to search the normal page table or the second address space's page table.

Once the second address space is established, the kernel can copy data between address spaces by reading or writing to the first address space using the normal instruction set, and by reading or writing the second address space using the new external load and store operations. New "copy_to" and "copy_from" functions would be implemented based on the optimized memcpy routine, but using the external PID load/store instructions where necessary.

In the event of a page fault, the page fault handler must note whether the fault happened in the primary or the external address spaces, and react appropriately.

High Level Design Points#

TLB-miss Handling#

The TLB-miss handler is necessarily slightly more complex when external PIDs are used. The existing fault handler simply loads the page table location from the SPRG4 register, while the new version would need to test the ESR register and load from either the SPRG4 register or another location depending on whether the faulting instruction was a normal operation or an external PID operation. In the normal case, the handler would end up executing 3 extra instructions (loading the ESR register into a general register, testing the appropriate bit field, and then executing a conditional branch). Executing 3 instructions once is not a significant problem, but the TLB-miss handler is execute with such great frequency that we don't want to incur this extra expense if it is not necessary.

An idea that was proposed is that we create two TLB-miss handlers -- one with and one without the extra instructions for handling external PID accesses. Then we could install the more complex version while an alternate address space is established, and install the regular version otherwise. Installing a different TLB-miss handler (done using a mtspr instruction) on establishing or removing an alternate address space is inexpensive compared with three extra instructions on each TLB-miss exception.

Fault Handling#

One of the complex aspects of the current system that must be modified to deal with cross address space message passing is handling of faults. In the existing system, when a memory access fault happens it is always for the current address space. With cross address space message passing, a fault could happen for the alternate address space.

When a fault happens, it begins as a TLB-miss exception. The TLB-miss handler refers to the appropriate page table (as described above) to resolve the TLB miss. If there is no appropriate entry in the page table, the TLB-miss handler passes control to a different fault handler (e.g. instruction access or data access) that sets up some flags to indicate the nature of the fault and then passes control to a general fault mechanism (_exc_access).

To handle faults during cross address space message passing, we will introduce a new fault flag (VM_FAULT_ALTERNATE) that the _exc_data_access_booke routine will set when it handles a data access fault during a cross address space message pass. In order to determine if this flag is necessary, the routine will consult the ESR. The _exc_access routine will pass the flags through to the fault shim (vmm_fault_shim() in ker/ppc/init_cpu.c). The vmm_fault_shim() routine is responsible for constructing the fault_info structure, which includes the pointer to the faulting address space, and invoking the memory manager fault handler.

The current vmm_fault_shim() routine always specifies the currently active address space. This routine will be modified so that if the VM_FAULT_ALTERNATE flag is specified, the alternate address space will be indicated in the fault_info structure. To that end, a new field will be added to the cpu-specific portion of the address space structure (struct cpu_mm_aspace in memmgr/ppc/mm_internal.h) to hold a pointer to the process that owns the alternate address space required by the current thread.

Note that work is underway that will remove the vmm_fault_shim() routine. Once that work is complete, design described here will need to be changed slightly. Exact details remain to be seen but it is likely that the fault_info structure will be created in the _exc_access routine; in that case, _exc_access will be modified to detect the VM_FAULT_ALTERNATE flag and use the alternate address space owner instead.

Once the alternate address space owner is identified in the fault_info structure, the memory manager fault handler will work without modification (it already correctly handles the case where the faulting thread is referencing a separate address space in order to handle proc threads manipulating user process data).

Alternate Address Space as a Process Characteristic#

We could associate an alternate address space with a primary address space, so that the alternate address space is programmed into the EPSC register and the alternate page table is noted whenever the primary address space is established.

The vmm_aspace() function would be modified so that it defines the contents of the EPSC register along with the PID register. If an address space has an alternate then the EPSC.EPID field will be set to the alternate's PID, otherwise it will be cleared. When the alternate address space is established or cleared, if the primary address space is active on any cores, those cores will need to have their EPSC registers updated through an IPI command.

The TLB-miss and page fault handlers would need to be updated as described above.

The message copy mechanism would simply consist of establishing the source's address space, assigning the destination's address space to be the source's alternate, and then using the copy_to() or copy_from() function to copy the data.

In the event that the operation is preempted, the vmm_aspace() function will take care of removing or re-establishing the alternate address space context whenever a different address space is established.

On the surface this implementation appears to suffer from a security hole -- while a thread of process A is doing a message pass to process B, a second thread in process A would be free to mess around with process B's data. However, since the external PID load/store instructions are privileged instructions, the second thread of process A cannot execute them directly and hence can't touch process B's data (unless process A is running with privilege, in which case there are other mechanisms already available that would allow process A to inspect or manipulate process B).

External Address Space Associated with an Operation#

Associating an alternate address space with a primary address space provides a nice generic mechanism that will allow data to be copied between address spaces whenever necessary, with very few changes required (primarily in vmm_aspace) to support the alternate address space. However, it is unlikely that we will need a generic mechanism for moving data between address spaces -- beyond message pass there seems to be little utility to this functionality. Thus even the slight extra overhead to context switching from the necessary changes to vmm_aspace seems to be of questionable value.

Instead, we can establish an alternate address space only for the duration of the message pass operation. In this manner we would avoid the extra gear in vmm_aspace.

The message pass operation would establish the alternate address space at the beginning of the copy. The system would need to clear the alternate address space in the event that the message copy is preempted or a fault occurs.

In the event of preemption, the necessary mechanism is already present in the form of the xfer_restart() function that the kernel already calls when preempting a message pass operation.

In the event that a fault occurs due to a bad address, the xfer fault handlers cause the message copy to terminate and return from xferiov with a non-zero result code. Thus in the same place where we invoke XFER_FAULTRET() we can deinstall the alternate address space.

The TLB-miss and page fault handlers would still need to be modified as described above. However we would not need to change vmm_aspace() or introduce an IPI mechanism to establish or clear the alternate address space on other cores.

The actual message copy would be the same -- establish the primary and alternate address spaces and then use copy_to() or copy_from() to copy the data.

Asynchronous Message Passing#

The asynchronous message passing mechanism shares code with normal synchronous message passing, and so will inherit the benefits of this activity with very little effort. In ker/nano_xfer.c, the rcvmsg() function (used for asynchronous receive) will be modified in the same manner that will be done for synchronous message passing -- establish the alternate address space and use the appropriate copyto() function.

Expected Performance#

Establishing the alternate address space will be quite quick -- assigning a couple of registers and global values. Once the alternate address space is established, the message copy should be almost as fast as a simple copy within a single address space. There are three factors that will slow it slightly relative to copying data within an address space:

  • The TLB-miss handler will be slightly less efficient for the alternate address space than for the primary address space, as we will need to retrieve the alternate page table location from a per-cpu table. The primary address space page table location is stored in SPRG4, but we do not have any SPRG registers available for the alternate page table location. Retrieving the address from a per-cpu table will involve more instructions than simply reading SPRG4 (reading the CPU number from SPRG3, indexing into a per-cpu table of alternate page tables, and reading the page table entry).
  • The new instructions available for cross address space message passing use the indexed address form (i.e. they allow two registers to specify the source or destination address) while the existing message copy code uses a register plus fixed offset. This allows the message copy code to load or store 16 bytes of data in 5 instructions (e.g. lwz %r9,0(%r4); lwz %r10,4(%r4); lwz %r11,8(%r4); lwz %r12,12(%r4); addi %r4,%r4,16). Since this addressing mode is not available with the cross address space instructions, the new code will require 8 instructions for the same 16 byte transfer (e.g. lwepx %r9,%r0,%r4; addi %r4,%r4,4; lwepx %r10,%r0,%r4; addi %r4,%r4,4; lwepx %r11,%r0,%r4; addi %r4,%r4,4; lwepx %r12,%r0,%r4; addi %r4,%r4,4). These extra instructions will make the cross address space copy slightly more expensive.
  • There might be additional TLB pressure for certain message passes, since there will be TLBs for different address spaces loaded simultaneously. During a copy within an address space we could get by with a single TLB if it covered both the source and destination addresses, while a cross address space copy will always require separate TLBs for source and destination. This factor is not expected to be significant.

Actual performance will be tested and compared with the existing double-copy mechanism, for different message sizes and different mixes of IOV numbers and alignments.




Unit Testing#

As the primary focus of this feature is performance enhancements, unit testing will include a benchmarking exercise, testing throughput of message pass for a large variety of message sizes. These results will be compared with the same tests run without cross address space message passing. Successful testing will require improved performance for message pass between processes for some message sizes, and no degradation in performance for remaining message sizes or for message pass between threads of one process. The results of this testing will be recorded here when complete.

In addition to the benchmarking exercise, existing regress test cases will be used to test correct behaviour of message passing under a large variety of circumstances:

  • wh1_message and wh2_message : exercise fault handling on writing to or reading from different messaging buffers.
  • wh1_msgsend through wh7_msgsend: verify message contents for various message sizes, buffer sizes, alignments, IOV usage, and small and large page boundaries .
  • bk1_asyncmsg_putget and bk1_asyncmsg_multput: verify correct behaviour of async messaging. Note that the existing test cases don't verify message contents. These test cases will be updated to verify message contents.

All testcases will be run on at least three platforms: a PPC e500mc system, a PPC e500v2 system and a non-PPC system.

Unit Test Results#

Performance Results#

Performance tests were performed using the msgpass benchmark test (http://svn.ott.qnx.com/view/qa/trunk/benchmarks/simple/msgpass.c?view=markup). Testing was performed using a variety of different message sizes, using the command "msgpass -s<msgsize>" (for single buffer) or "msgpass -s<msgsize/4>,<msgsize/4>,<msgsize/4>,<msgsize/4>" (for 4-buffer passes) for each given message size.

For message sizes smaller than 257 bytes, the new message pass mechanism (here labeled the "xaspace" mechanism) is compared with both the double-copy mechanism and the double-map mechanism (note that it was necessary to perform a minor hack to the kernel to get the new mechanism or the double-map mechanism to operate with these messages, as the kernel will normally use the double-copy mechanism for any message size smaller than 257 bytes).

For message sizes larger than 256 bytes, the new message pass mechanism is compared with the double-map mechanism only (the double-copy mechanism is not currently used for these message sizes).

All message sizes are bytes, while times are microseconds. Tests were run on an early prototype P4080DS board. It appears that clock frequency and timer interrupt programming were not correct so the times reported were not accurate against a real clock, but since all tests were run in the same environment they serve to compare relative performance of the different mechanisms.

Results using procnto-booke_g (note: not SMP)#

msgpass options message size number of buffers double-copy double-map xaspace
-s16 16 1 11 23 12
-s32 32 1 11 23 12
-s64 64 1 11 23 12
-s128 128 1 11 23 12
-s255 255 1 11 23 12
-s256 256 1 11 23 12
-s64,64,64,64 256 4 11 24 12
-s257 257 1 23 12
-s1000 1000 1 24 13
-s4000 4000 1 26 14
-s16000 16000 1 35 24
-s1000000 1000000 1 3300 3200
-s250,250,250,250 1000 4 24 13
-s1000,1000,1000,1000 4000 4 26 14
-s4000,4000,4000,4000 16000 4 35 24
-s250000,250000,250000,250000 1000000 4 3300 3200

Results using procnto-booke-smp_g with single core (note: not SMP)#

msgpass options message size number of buffers double-copy double-map xaspace
-s16 16 1 14 26 15
-s32 32 1 14 26 15
-s64 64 1 14 26 15
-s128 128 1 14 26 15
-s255 255 1 14 26 15
-s256 256 1 14 26 15
-s64,64,64,64 256 4 14 27 15
-s257 257 1 26 15
-s1000 1000 1 27 15
-s4000 4000 1 29 17
-s16000 16000 1 40 27
-s1000000 1000000 1 3400 3200
-s250,250,250,250 1000 4 27 16
-s1000,1000,1000,1000 4000 4 29 17
-s4000,4000,4000,4000 16000 4 40 27
-s250000,250000,250000,250000 1000000 4 3400 3200

Analysis#

It is clear from the data that up to 256 bytes, the actual cost of copying the data is insignificant relative to the overall cost of the message pass. The time required does not change significantly with the message size. It is worth noting that the new message pass mechanism adds almost no overhead to the message pass call, and so is very close to being as efficient as the current double-copy mechanism.

In comparison to the double-map mechanism, the new mechanism is significantly faster than the old when the amount of data copied with each map is small (that is, the overhead of the mapping operation is significant relative to the cost of copying the data), and is no worse than the double-map mechanism even when the message size is so large that the cost of the copy overwhelms the overhead of establishing the mapping.

There is very little impact of breaking a message up into separate buffers using multiple IOVs on this test system.




Design and Code Reviews#

Design review held November 3 2009, attended by Chris Hobbs, Neil Schellenberger, Shiv Nagarajan, Brian Stecher, David Sarrazin, Steve Bergwerff, Attilla Danko, Peter Luscher, Adrian Mardare.

The design was presented and discussed, and a number of questions were answered. At the end, a single action item resulted: determine if the separate code path used by this design through the message transfer mechanism is compatible with the separate path used for Altivec support, and whether they can be merged to reduce code complexity or conditional compilation.

In nano_xfer.c, the xfer_cpy() function is conditionally compiled one way for 600 and 900 variants, and a separate way for other variants. Within the 600/900 version there is a runtime check to see if the Altivec registers are supported, while in the other version there is a compile-time check for whether or not the version is compiled for booke. In both cases, these logic paths are used to select a particular copy function. Merging these two versions would be possible but the result would not be particularly clean, as the Altivec copy function (_xfer_cpy_vmx()) has slightly different parameters than other copy functions. For that reason we will continue to have two xfer_cpy functions, selected at compile time, and the booke version will be modified to support this feature.

Code review was completed in the following discussion thread: http://community.qnx.com/sf/go/projects.core_os/discussion.osrev.topc10857

This feature has no implications for the Safety Manual.