View on GitHub

nCube systems

Documentation on the nCube parallel hypercube supercomputer

Up

4. THE PROCESSOR

4.1 Introduction

The processor array is made up of 2N nodes where N is 6,7,8,9 or 10. Each processing node (FIG. 4) consists of a general purpose 32 bit processor (including 32 and 64 bit floating point instructions), 128K bytes of ECC memory and 11 communication channels to support the hypercube interconnection scheme and the 8 system I/O channels.

4.2 Architecture Overview

4.2.1 Data Representation

The processor recognizes two main classes of data: integers and reals. Integers are represented in standard 2’s complement form and come in three types: byte (B-8 bits), halfword (H-16 bits) and word (W-32 bits). There two types of reals. The 32 bit format, called real (R), has an 8 bit exponent and 24 bits of significance. The longreal (L) format is 64 bits with 11 in the exponent and 53 in the significand. The longreal format is used for computations that need high accuracy and for intermediate computations with real variables when the computation is particular sensitive to roundoff error. Both of these formats conform to the IEEE Binary Floating Point Standard (P754).

In addition to the various data formats, the processor recognizes and manipulates addresses. Addresses are simply 32 bit unsigned values that point to individual bytes in a linear address space.

4.2.2 Registers, Interrupts and Communication

The processor’s instructions operate on data in main memory (as described above) or on data in 32 bit registers. The processor contains three types of registers: the general registers, the processor registers and the communication control registers. The 16 general registers are 32 bits long and are used for both operands and addresses. Since they are general they can be used interchangeably in all operations and addressing modes.

The processor registers are special purpose and can only be read from or written onto by Load Processor register (LDPR) and Store processor Register (STPR) instructions respectively. The exact formats and detailed descriptions of these registers are given in section 4.4.3. The processor registers are shown in FIGS. 7, 9A and 9B and include:

Processor registers 6 through 12 are used to signal “ready” and error conditions for the I/O channels.

The I/O ports on the processor are unidirectional Direct Memory Access (DMA) channels and each channel has two 32 bit write only registers: an address register for the buffer location and a count register indicating the number of bytes left to send or receive. Communication is performed by setting the registers of the desired channel to the appropriate address and data length and then the DMA channel takes over and communicates a message without processor intervention. Interrupts can be used to signal when a channel is available (i.e. when the count reaches zero the channel is “ready”). A separate interrupt vector is provided to indicate to a receiver that an error occurred during the data transmission.

In addition to communication synchronization and error reporting the processor uses vectored interrupts for:

  1. hardware errors (e.g. multibit memory errors)
  2. program exceptions (e.g. real overflow)
  3. software facilities (e.g. trace, timeout)

When an interrupt occurs the current program status (PS) and program counter (PC) are pushed on the stack. Then PS and PC are loaded with new values from the appropriated entry (indexed by the interrupt number) in the interrupt vector table in low memory.

4.2.3 Instruction Formats and Addressing Modes

An instruction consists of an operation code followed by zero and one or two data references:

Opcode | Reference 1 | Reference 2
Low address                High address

All instruction operation codes (opcodes) in the processor are one byte long. The first four bits indicate the operation and number of operands (e.g. ADD: 2 operands, BRANCH: 1 operand) while the other four bits denote the operand size and type (e.g. Halfword (integer), Real (floating point). This symmetry makes an opcode map easy to read and code generation easier for a compiler.

All of the standard instructions are available for each data type including arithmetic, logical, conversion, comparison, branch, call and trap instructions. Instructions can be preceded by a REPEAT prefix that causes them to be executed repeatedly until a termination condition is satisfied. This is a very powerful facility for vector and string operations. Repeats can also be used with both branches and calls in order to execute a block of code repeatedly. (i.e. a REPEAT BRANCH is equivalent to a loop instruction). And for future extension each operand type has a reserved “excap” code.

A few instructions have no operands (e.g. BREAKPOINT) and some have only one (e.g. CALL) but most have two address fields. All address fields begin with a one byte mode selector. For all modes involving the general registers the first four bits indicate the mode and the remaining four determine which register to use. If there is an offset indicated it follows the mode selector. Some of the modes provided are literal, immediate, direct and indirect with no registers involved; and register direct, register indirect with and without offset, autoincrement and autodecrement and offset addressing with both the program counter (PC) and the stack pointer (SP). As with instructions there is a reserved “escape” code defined for the mode selector field.

4.3 Data Representation

The processor recognizes two classes of data: integrers and reals (floating point number). There are three types of integers and two types of reals.

4.3.1 Integers

The three integer data types are all represented in standard 2’s complement. They are called Byte (B), Halfword (H) and Word (W) and are 8, 16 and 32 bits long respectively. The ranges for the three integer formates are specified as follows:

Most instructions treat integers as signed numbers but the logical operations (e.g. AND, OR) view their operands as unsigned quantities. Addresses are also treated by the processor as unsigned values. The address space is logically a linear set of bytes from address 0 to 2^32-1; thus addresses are unsigned 32 bit integers (Words).

4.3.2 Reals

The floating point implementation in the processor conforms to the IEEE Binary Floating Point Standard (P754). With the floating point arithmetic not only are the rounded results as accurate as possible but it is feasible to compute guaranteed bounds on the errors using the special directed rounding modes. Also because to the high accuracy of Real (32 bits) computations and the availability of Longreal (64 bits) to back them up at crucial points, it will be possible to run many more programs in Real precision instead of automatically using Longreal everywhere.

The representations for the two floating point type are illustrated below including the formulas for the value represented. In the formulas “s” is the sign, “e” is the exponent, “f” is the fraction and “b” is the bias in the exponent.

Real  | s | e | f |   b = 127
     31      23   0
Longreal | s | e | f |   b = 1024 
        63      51   0

The two formats are closely related; the distinguishing characteristics being the exponent range (defined by the parameter b) and the fraction precision. The Real format has 24 bits of precision (about 7 digits) with a range of approximately 10^(-38) to 10^(38). The Longreal format has a much wider range–about 10^(-308) to 10^(308) – and more than twice the precision of Real at 53 bits or about 15 digits. Thus Longreal, besides being a powerful standalone computational format, makes an excellent backup facility for Real calculations at points in a program where the results are very sensitive to roundoff error.

This implementation conforms to the IEEE Floating Point Standard which was carefully designed to provide accurate and reliable arithmetic. The following properties are a result of the standard.

  1. Denormalized numbers (e=0) fill the space between zero and the smallest normalized number. They provide a far superior way of dealing with underflow than the typical “flush to zero” response.

  2. The implicit bit yields the greatest possible accuracy and is one of the two reasons for choosing radix 2. The other is speed; for a given amount of hardware binary will always be fastest.

  3. The offset (b) was chosen to ensure that all normalized numbers have representable reciprocals.

  4. The format was organized to permit very fast comparisons.

  5. Infinities (e=11…1, f=0) were explicity represented to allow for handling zero divide and overflow exceptions.

  6. When e=11…1 and f<>0 the representation is treated as Not a Number (Nan) and instead of producing a numeric result when used as an operand the processor generates an exception. Nan’s were provided to allow for software extensions including runtime diagnostics like “uninitialized variable” and to permit potentially flawed computation like 0/0 to continue in order to observe the effect, if any, on the final results.

  7. Longreal has greater range and more than double the precision of Real to permit exact Real multiply with no threat of overflow or underflow and generally to allow for Longreal accumulations of Real computations.

The floating point architecture of the processor implemented in accordance with the principles of the present invention includes much more than the data representations. All of the IEEE Standard requirements are either met in the hardware or are facilitated in software. Among these requirements is the provision of rounding modes. In the Program Status (PS) register are two bits that control the rounding mode in effect. The modes are:

Another important facility in the floating point architecture is exception handling. The following required faults are recognized

All of these exceptions have an associated flag (and Inexact has an interrupt enable) in the PS register. If an exception occurs and its interrupt is enabled, the processor produces enough information for recovery. If the interrupt is disabled the flag is set and the processor takes predefined action:

The floating point architecture also provides all the standard instructions for all formats: add, subtract, multiply, divide, compare and conversion. But in addition there are some unusual but crucial instructions. Square root is correctly rounded and as fast as divided. Remainder is an exact operation and permits argument reduction for periodic functions with no roundoff error.

4.4 Registers

The following sections describe three types of registers in the processor: the General registers, the Input/Output registers and the Processor registers.

4.4.1 General Registers

The 16 General registers (128), shown in FIG. 9A, are labeled R0 to R15. They are 32 bits wide and are used for data and addresses. They are consistently symmetrical with no special designations or uses for any of them. When integer data shorter than 32 bits is moved to a General register it is sign-extended to 32 bits. When data longer than 32 bits are stored in the registers, the low order part of the data goes in the designated register, Ri, and the high order part resides in Ri+1. The numbers “wrap around” so that if a Longreal is moved to R15 the high order section is found in R0.

4.4.2 Input/Output Registers

In a processor, each of the 11 input and output ports (48), shown in FIG. 5, is an independent Direct Memory Access (DMA) channel and has two 32 bit registers: an address register and a count register. The address register contains a pointer to the least significant byte of the next halfword to be transferred. If it is an output port the data is moved from memory out to the port. If it is an input port the data is moved to memory that has been received from the output port of the sending processor. In both cases the count register is set to indicate the number of bytes to be sent or received. As data is sent or received, the appropriate address and count registers are incremented and decremented respectively by the number of bytes transferred. When the count reaches zero the ready flag in the Input or Output Status register (see below) is set and an interrupt is generated if an interrupt has been enabled.

The DMA channels operate independently of instruction processing. They being functioning whenever a count register is set to a nonzero value. All of the ports are general except one input and one output port are designated “host” (H) and are normally used to communicate over the I/O bus to the System Control Boards.

4.4.3 Processor Registers

The Processor registers are the third type of register in the processor. All Processor registers are 32 bits wide. They contain all the special purpose and miscellaneous information and can only be loaded or stored by the Load Processor Register (LDPR) and Store Processor Register (STPR) instructions, respectively. These registers are labeled P0 to P11 but they also have unique names that denote their purpose:

Stack Pointer (P0, SP)

The SP contains a pointer to the current top of stack. The stack grows toward low memory.

Program Status (P1, PS)

This register contains the information that defines the current state of a program. The format of the PS is shown below:

 | REP | REP REG | RC | R R T IE IG H TO CE |
31    29        26   24                     0

 | IV R R R R HX IN OF FZ UF IX U N Z V C |
15                                        0

All of the fields are one bit except REP (2 bits), REP REG (4 bits), and RDC (2 bits). The meanings of the fields are defined below (R is “Reserved”):

FLAGS

FLOATING POINT EXCEPTIONS

The indicated flag is set when the associated exception occurs and if not disabled the corresponding interrupt is generated. (In present embodiment of the invention only the Inexact Result interrupt can be disabled. The exceptions are defined in Section 4.5.

INTERRUPT ENABLE FLAGS

If a flag is set and the associated exception or event occurs an interrupt is generated. If the bit is zero the interrupt is suppressed until the interrupt condition is cleared or the interrupt is enabled. The floating point interrupt conditions are cleared as soon as the subsequent instruction begins execution.

CONTROL FIELDS

Fault Register (P2, FR)

When the processor takes an interrupt generated by an exception this register contains information to aid recovery. The format of the Fault Register is shown below.

 | R R R R R R R R R R R R |
31                        20

 | R R R R R S2 I2 E2 F2 S1 I1 E1 F1 GR RS RL |
15                                            0

The Guard, Round and Sticky bits are the hardware bits that are used for rounding in floating point operations as defined in the IEEE Binary Floating Point Standard. The Fraction, Exponent, Invalid and Sign bits for each operand allow an interrupt handler to determine if the operand is Nan, infinity, denormal, zero or “ordinary” (valid, nonzero) and its sign without decoding the instruction.

Configuration Register (P3, CR)

This register is used to set various configuration parameters including the Model Number which is a Read-Only field. The format of the CR is:

 | MODEL NUMBER | RESERVED |
31             24         16

 | RESERVED | TYPE | CYC | REFR |
15         12     11    987     0

Processor Identification Register (P4, PI)

The PI is set by the operating system at initialization and allows processors to identify themselves. The high order bit (31) indicates whether the processor is in the hypercube array (0) or on a System Control Board (1). The rest of the bits indicate the address or position of the processor in the array or on an Interface Board.

Timeout Register (P5, TR)

Approximately every 100 microseconds the unsigned value in this register is decremented. Thus it can count for about 5.1 days. If the Timeout Register is zero an interrupt is generated whenever it is enabled. Decrementing stops when the value in the Timeout reaches zero.

Output Ready (P6, OR)

There is a Ready flag for each output channel. When the flag is set to one it indicates that the count register for that channel is zero and the channel is ready to transmit more data. The format of the register is

 | OH R R R R R R R R R R R R R R R |
31                                 16

 | R R R R R R O9 O8 O7 O6 O5 O4 O3 O2 O1 O0 |
15                                           0

where OH means Output Host, R is Reserved for future expansion and Oi is the Output port number i. The OR register is read only.

Input Ready (P7, IR)

For each input port there is a flag which when set indicates that the corresponding count register has gone to zero, the channel has completed its DMA function and is now ready to receive more data. The format of the register is the same as the Output Ready register except I (Input) is substituted for O (Output). The IR register is read only.

Output Enable (P8, OE)

This register has the same format as the Output Ready register but the meaning of the flag is different. If a flag is set to one an interrupt is generated when the corresponding output channel is ready to transmit. The interrupt is suppressed if the flag is zero or if the Output enable (OI) flag in the Program Status register is zero.

Input Enable (P9, IE)

When an input count register has become zero and the channel is ready to receive, an interrupt is generated if the corresponding flag in this register is set to one. If the flag is zero or the Input enable (II) flag in the Program Status register is zero the interrupt is suppressed.

Input Pending (P10, IP)

If the count register of an input port is zero but there is a halfword in the port that has not been stored in memory, the corresponding bit in this register is set to one. This register is read only.

Input Parity Error (P11, PE)

Every halfword received is checked for parity. If an error is detected then after the transmission is complete (the count register becomes zero) instead of generating a “ready” interrupt, the corresponding flag in this register is set and an “input error” interrupt is generated. This register is read only.

Input Overrun Error (P12, IO)

If a halfword is received and overwrites a previously received halfword before is can be stored in memory an error is noted. After the count goes to zero instead of signaling a “ready” interrupt, the corresponding flag is set to one and an “input error” interrupt is generated. This register is read only.

4.5 Interrupts and Exceptions

The processor has a powerful vectored interrupt facility and generates several kinds of interrupts: program exceptions, software facilities, I/O signals and hardware errors. The program exceptions include integer overflow and zero divide, the floating point exceptions, stack overflow and address and reserved opcode faults. The software facility interrupts are trap, breakpoint and trace. The Input Ready, Output Ready, Input Parity and Input Overrun interrupts are the I/O signals. And the hardware errors are Corrected and Uncorrectable memory errors and Processor Self Test errors.

All interrupts (including the TRAP and breakpoint (BKPT) instructions) have the same convention. There is an unsigned number associated with the new interrupt (the argument of the trap instruction) that is multiplied by eight to give the absolute location in low memory of the interrupt vector. Each vector is eight bytes; the first four bytes contain the absolute address vector (VA) of the interrupt handing routine and the next four bytes are a new Program Status (NPS) value. When an interrupt is generated the processor pushes the Program Counter (PC) and the Program Status (PS) on the stack, sets the Program Status register to NPS and the Program Counter register to VA. If the interrupt is signaling a program exception (interrupts 3 through 12, see below) instead of saving the PC, the processor pushes the address of the offending instruction (“previous PC”) on the stack so that the exception handler can decode the instruction. One reason decoding may be necessary is because the IEEE Floating Point Standard requires the ability to construct a result, store it where the instruction would have and then continue the computation. When the interrupt handler is finished it executes a Return from Interrupt (REI) instruction that pops the old PS and PC values off the stack and into their respective registers. A TRAP instruction with the appropriate number as its argument can simulate any interrupt (except that the PC is always pushed on the stack with TRAP regardless of its argument).

All interrupts are defined below. The number at the left is the interrupt number.

4.5.1 Interrupt Definitions

The interrupt can be disabled but in either case the result stored in the destination is the low order part of the result (in divide it is the divident).

4.5.2 Error Flag

There is an internal Error flag that is tied to the Error pin that indicates that the processor is in an unknown, inconsistent or failure state. On resetting the processor the Error flag is initialized to one and if the on-chip initialization sequence and subsequent diagnostic software run successfully it can be cleared by software (EROF). It is also set by consecutive unserviced Uncorrectable ECC errors. The Error flag and pin can also be set and reset by the ERON and EROF instructions respectively.

4.6 Communication

There are 22 unidirectional direct memory access (DMA) I/O channels on each processor, 11 for input and 11 for output. The Input ports are numbered 0,1,…,9 and 31; while the Output ports are numbers 32,33,…,41 and 63. The input and output ports are normally used in pairs to form 11 full duplex I/O channels are shown below:

{(0,32),(1,33),...,(9,41),(31,63)}

Ports 31 and 63 are normally used for communicating with the Host (on any System Control Board). Ports 0 to 9 and 32 to 41 are used to build the hypercube interconnection network. Numbers 10 to 30 and 42 to 62 are reserved for future expansion.

Each of the I/O channels has an address register, a count register, a “ready” flag and an interrupt enable flag. In addition each input channel has a parity error flag, an overrun error flag and a “DMA pending” flag. Besides the enable for each channel there are two global enable flags in the Program Status (PS) register. The II flag disables all input interrupts (including errors) even if the corresponding channel flag is enabled and the IO flag disables all output interrupts.

In order to send a message from a memory buffer on a given output channel one first either checks its ready flag or enables its interrupt and waits for a “ready” interrupt. As soon as the channel indicates that it is ready (idle), the address register is set to point to the first (low) byte of the message, which must begin on an even boundary, by executing a LPTR (Load Pointer) instruction. The source operand of this instruction is the address of the message buffer and the destination operand is an integer whose value determines which of the channel registers is to be loaded:

In order to start the automatic message output, the corresponding count register must be set to the number of bytes in the message. (In this version of the processor the low order bit is forced to zero in both the address and the count registers; thus the message buffer must start on an even byte boundary and be an even number of bytes long. No error is signaled is a program violates this requirement.) This is done by executing a LCNT (Load Count) instruction. The destination operand indicates the register to be loaded as explained above for the LPTR instruction and the source operand is the count value (an unsigned 32 bit integer). The LCNT instruction also resets the parity and overrun error flags when setting up an input port. The message transmission is automatic and as data is sent the address register is incremented and the count is decremented by the number of bytes transferred. When the count becomes zero the output stops, the ready flat is set and if enabled the ready interrupt is generated.

In addition to sending a message on a single channel, the processor has a powerful BROADCAST facility. In order to send a message over several channels at once, one must first ensure that the desired output channels are ready. Then a BPTR (Broadcast Pointer) instruction is executed. Its source operand is the address of the message as in LPTR but its destination operand is a 32 bit mask. Every bit position that is set to one will cause the corresponding output channel address register to be loaded. (Bit position 0 corresponds to output channel 32, position 1 to channel 33, etc.) The message broadcast is started by executing a BCNT (Broadcast Count) instruction whose destination operand is a mask as explained above for the BPTR instruction and whose source operand is an unsigned 32 bit integer equal to the number of bytes in the message. The major advantage of broadcasting is that the sending processor only has to access each transmitted datum once thus reducing the memory bandwidth used by the DMA facility. The processor can only handle one broadcast at a time so if a subsequent broadcast is attempted, even on different channels, before the current one is finished the results will be undefined.

In order for a message to be transmitted successfully the corresponding input channel of the receiving processor must first be set up with an address to an input buffer and the same count as the output channel. One way this can be accomplished is by using a software protocol that always sends a single halfword as the length of the desired message and waiting for the receiving processor to respond with a halfword code that indicates “ok to send message”. This protocol will work because the last halfword that is sent remains available for DMA even if the receiving processor’s input channel is uninitialized (count=zero). The presence of this data in the input channel is indicated by the corresponding bit in the INPUT DATA PENDING register (which can be tested by software) being set. Thus as soon as the count register is set to one, the halfword (either the length or on “ok to send”) is stored in memory.

Before attempting to DMA the data to memory that is in an uninitialized input port the error (Overrun and Parity) flags must first be checked or they will be lost. This is because the Load Count instruction clears the error flags.

The processor recognizes two types of errors in communication. Each halfword is sent with a parity bit and on reception a parity check is made. Also if a halfword is received into a DMA channel before the precious one is stored in memory an input overrun error is detected. (Overrun can occur when the input count goes to zero before the output count–a software error, or when too many messages are being sent to the processor at the same time.) If either type of error occurs the corresponding flat is set and when the input count reaches zero instead of “ready”, an “input error” interrupt is generated (if II is set). A software error that is not detected by the processor occurs when the output count is smaller than the input. In that case, after the message is sent the input channel will simply hang. This condition can be avoided by correct software or by setting up timeout conditions using the Timeout Register.

4.7

4.8

4.9 Processor Initialization

A processor can be initialized by either asserting the reset pin or by executing a RSET instruction. The resulting initialization is significantly different in the two cases. They are both described below.

Hardware Initialization

Hardware initialization is done by asserting the reset pin and proceeds in several steps:

4.9.2 Initialization Procedure (shadow ROM)

The code in the on-chip shadow ROM is listed below with comments.

! During shadow ROM execution all interrupts are disabled including
! interrupts that are not normally maskable;
RSET ;

! The RAM chips need 8 refresh cycles to initialize themselves. The
! refresh rate starts at one refresh every 8 cycles since the Configuration
! register is set to zero on reset. We idle for the required 64 cycles by
! looping on RSET 10 times. Each loop takes 7 cycles (3 for the RSET and
! 4 for the REP);

MOVW #11,R0;
REP R0;
RSET;

! The refresh rate is lowered to every 40 cycles by writing a 4 in the
! Configuration register. This is conservatively high but the operating
! system can lower it further if the processor clock rate justifies it;

LDPR #4,#CONFIG;

! Memory is now initialized with correct ECC bits by writing zero to
! every location. Since the Configuration register is initialized to
! assume 16k×4 memories, only the first quarter of memory is initialized
! by writing 8191 words. If the operating system changes the Configuration
! to 64k×4, then it should initialize the last 3/4 of memory;

MOVW #8191,R0;
MOVW #O,R1;
REP R0;
MOVW #O,(R1)+

! A self test belongs here. The result is encoded and stored in memory
! at location 4. A -1 means everything is fine;

MOVH #-1,4;

! Bit 31 of the ID Register is initialized when the reset pin is asserted
! with a one if the processor is an I/O processor or a zero is the processor
! is an array processor. I/O processors are initialized from memory while
! array processors are initialized by the serial ports;

STGPR #IDREG,R0;
BL IOINIT;

! Array processor initialization waits for a port to receive a
! message. The code below assumes that only one port will try to initialize
! the processor. If messages come in at two ports exactly at the same time,
! the code may not work;

PROCINIT: STPR #INPEND, R0  ! Are any incoming messages pending?
          BF PROCINIT       ! No, try again
          FFOW R0, R1       ! Yes, R1 gets the port number

! Initialize the port so DMA transfer of a two byte message to location
! 2 will occur;

LPTR #2,R1;
LCNT #2,R1;

! Compute in R3 the corresponding output port for a reply;

MOVW R1,R3;
ADDW #R2,R3;

! Wait for incoming message DMA to complete;

INWAIT1: STPR #INRDY, R2   ! Store input ready flags in R2
         BITW R2, R0       ! Test the appropriate flag
         BE INWAIT1        ! Loop until port is ready

! Start the output port DMA. The message will be the two byte self test
! status in location 4;

LPTR #4,R3;
LCNT #2,R3;

! Reinitialize the same input port to receive the contents of memory;

         LPTR #8, R1      ! The message will start at location 8
         LCNT 2, R1       ! for number of bytes indicate by the first message

INWAIT2: STPR #INRDY, R2  ! Wait for input DMA to complete by
         BITW R2, R0      ! testing the appropriate ready flag
         BE INWAIT2       ! and looping back until ready (done)

! Jump to a preset location (1024) to begin execution from memory. The
! JMP resets the "shadow ROM active" flag;

JMP 1024;

! I/O processor initialization. Wait for memory location 0 or 1 to go
! nonzero. The external processor that loads the memory image must wait
! at least xxx cycles after the RESET signal has gone away;

IOINIT: BITH #-1,0   ! Test halfword at location 0
        BE IONIT     ! Loop back until it becomes non-zero

! Jump to a preset location (1024) to begin execution from memory. The
! JMP resets the "shadow ROM active: flag;

JMP 1024;

! End of shadow Rom code;
MEMORY                      ADDRESS
0                           0
Initial message length      2
test results                4
reserved                    6
DMA (int? ? vector)         8
etc