Exploring NISC Architectures for Matrix Application

ACEEE Int. J. on Communication, Vol. 02, No. 02, July 2011

Exploring NISC Architectures for Matrix Application
Sadhna K. Mishra1, Arvind Rajawat2, and R. P. Singh2
1
MANIT/Comp. Science & Engg, Bhopal, India
Email: sadhnaguddy@rediffmail.com
2
MANIT/Elect. & Communication, Bhopal, India
Email:{rajawata, singhrp}@manit.ac.in

Abstract— The paper presents the design of target NISC (No (PM) was sluggish, the designers tried to improve perfor-
Instruction Set Computer) architecture for matrix application mance by constructing complex instructions. While imple-
in a C based design flow. It starts with the implementation of mentation, each complex instruction takes several clock
a standard application program which generates customized cycles; with Datapath control words for each clock cycle are
designs using the NISC toolset. Further, it demonstrates and
stored in a much faster Micro Program Memory (mPM). The
analyzes the compilation and simulation results of several
matrix applications on a number of different available NISC
very concept of micro programming allows for emulation of
architectures in terms of register and execution cycle-counts. any instruction set and construction of specialized instruc-
Subsequently, a comparative analysis has been presented to tion, at the same time as speeding up the execution. However,
explore the options to select the best set of architecture. on the down side the micro programming did not allow for
efficient pipelining of the given Datapath [12] [14] [15] [16]
Index Terms— NISC, ASIP, CISC, RISC, Datapath. [17]. During late 1980s, the RISC became popular and the
very concept was to eliminate the complex instructions and
I. INTRODUCTION the mPM. In RISC, all instructions are simple and they per-
form in one clock cycle allowing Datapath to be efficiently
In the recent years, with increased complexities of pipelined in 4-8 pipelined stages. Here, the mPM is replaced
embedded systems, the people involved in designing have with decoding stage that follows the instruction fetch from
been searching for a new alternative approach that could PM. Given that instructions are simpler, a RISC wants ap-
handle complexities and subsequently lead to dual target of proximately two instructions for each complex instruction
achieving better design and to meet the strict design and, consequently, the size of the PM is doubled. Neverthe-
constraint such as size, timing, performance trade-offs, etc. less, the Fetch-Decode-Execute-Store pipeline of the whole
During the past two decades, several design methodologies processor improved the execution speed several times in
have emerged such as (i) Application-Specific Instruction- comparison to its predecessor [12] [14] [16]. The present pa-
Set Processors (ASIP), (ii) High-level Synthesis (HLS), etc per is based on NISC toolset, and the work compiles the
(figure 1). The concept of NISC (No- Instruction-Set- different matrix applications results in relation to the design
Computer) is a step forward in this endeavor. NISC concept of applications on different available NISC and other custom
offers an entirely new approach for design of custom architectures. In matrix application, the aim was to show the
processor. use of embedded processors in embedded control systems.
These processors required performance in terms of basic
mathematical abilities, bit manipulation etc. The work in-
volved the generation of a set of Verilog codes of benchmark
C codes and subsequently to explore different options to
achieve the best set of results. The paper has been designed
with various sections: section 2 concentrates on “NISC Ar-
chitecture”; section 3 focuses on “NISC Methodology”; sec-
tion 4 says about “Project Methodology”; section 5 explains
the “Implementation”, and finally the sections 6 summarizes
the “Conclusion” of the study.
Figure1: Layout of CISC, RISC and NISC [15]
Basically, the NISC toolset is used primarily for (i) C-to-RTL II.NISC ARCHITECTURE
synthesis and (ii) Embedded Custom Processor Design. The
historical progression of the processor architecture could be As shown in figure 2, a typical NISC architecture is
broadly divided into three phases. During 1970s, it was CISC comprised of (i) Control Pipelining i.e. CW and Status
which was a popular choice. Given that the Program Memory register, (ii) Datapath Pipelining i.e. pipelined components
or registers at input/output of components, and (iii) Data
*Research Scholar, MANIT, Bhopal, India , 4 620 51, E-mail: Forwarding i.e. the dotted connection lines from output of
sadhnaguddy@rediffmail.com, Tel: (0755) 3296849. some components to input of some others. Here, the control
**Assot Prof., Elect. & communica tion, MANIT Bhopal, India,
462051, E-mail: rajawata@manit.ac.in, Tel: (0755) 2670558. word register (CW) the controls for both the datapath and
*** Prof., Elect. & communication, MANIT Bhopal, India, 462051, the address generator (AG) of the controller, and the datapath
E-mail: singhrp@manit.ac.in, Tel: (0755) 2671666.) section of CW contains the control values of all datapaths’
38
© 2011 ACEEE
DOI: 01.IJCOM.02.02.157


components as well as a small constant fields [11, 15]. figure 3 [5] [7].

IV. PROJECT METHODOLOGY

Project methodology is presented though two sets of
experiments.
A. C- to –RTL synthesis
It presents the compilation and simulation results of
several benchmarks of different categories on different NISC
Figure 2: The NISC Processor [11, 15] architectures. It was compiled and after running matrix
Simultaneously, the controller section of CW determines how applications on a set of generic NISC architectures, these
the next PC address is calculated, and it provides a condition, architectures were as below:
a jump type either (i) direct or (ii) indirect, and an offset to the GN_0, which was comprised of No-Pipelined Datapath, No
AG. For indirect jumps, AG calculates the target address by data forwarding path, Single port memory, 2 Input bus, 1
adding the offset and the current value of PC. At the same Output bus, 1 Shared constant/offset port, 1 RF2x1 integer, 1
time, AG uses the value on its address port as target address. ALU, 1 Multiplier_ signed, 1 Display, 1 Comparator and 1
If the condition in CW and the status input of the AG are Converter;
equal, then the calculated target address is loaded into PC, in GN_1, which was comprised of No-Pipelined Datapath, No
other words it is incremented. Further, in NISC processor, data forwarding path, Single port memory, 2 input bus, 1
there is a link register (LR) in the controller which stores the Output bus, 1 Shared constant/offset port, 1 RF2x1 integer, 1
return address of a function call and the return address is ALU, 1 Multiplier_signed, 1 Display , 1 Comparator, 1
usually the value of current PC plus one. In addition to Converter, 1 Interrupt unit, 1 External output register;
standard components, the datapath could have pipelined and GN_2, which was comprised of Pipelined Datapath , No data
multi-cycle components such as ALU, MUL and Mem which forwarding path, Single port memory, 2 Input bus, 1 Output
are single-cycle, pipelined and multi-cycle components, bus, 1 Shared constant/offset port, 1 RF2x1 integer, 1 ALU, 1
respectively. There is no limitation on the connections of Multiplier_signed, 1 Display , 1 Comparator and 1 Converter;
components in the datapath, however if the input of a GN_3, which was comprised of Pipelined Datapath, No data
component comes from multiple sources, a Bus or a forwarding path, Single port memory, 2 Input bus, 1 Output
Multiplexer is used to select the actual input. The buses are bus, 1 Shared constant/offset port, 1 RF2x1 integer, 1 ALU, 1
explicitly modeled and we assume one control bit per each Multiplier_signed, 1 Display , 1 Comparator, 1 Converter, 1
writer to the bus. The multiplexers are implicit and we assume Interrupt unit, 1 External output register.
log²n control bits for n writers [11, 15]. GN_4, which was comprised of Pipelined Datapath, Full
data forwarding path, Single port memory, 2 Input bus, 1
III. NISC METHODOLOGY Output bus, 1 Shared constant/offset port, 1 RF2x1 integer, 1
ALU, 1 Multiplier_signed, 1 Display , 1 Comparator and 1
Converter.
GN_5, which was comprised of Pipelined Datapath, Full
data forwarding path, Single port memory, 2 Input bus, 1
Output bus, 1 Shared constant/offset port, 1 RF2x1 integer, 1
ALU, 1 Multiplier_signed, 1 Display , 1
Comparator, 1 Converter, 1 Interrupt unit, 1 External output
register.
Finally, the NMIPS, which was comprised of 2 Input bus, 1
Output bus, RF2x1, 1 Shared constant/offset port, Single port
Figure 3: The Methodology of NISC [7] memory, ALU, Multiplier_signed, Display. This datapath was
NISC methodology is relatively simple and well defined to be as similar as possible to that of MIPS M4k [8].
used by application programmers. Here, the user specifies an B. Embedded Custom Processor Design
application in C programming language and after studying
The methodology involved the Design Embedded Custom
the program structure, the user selects NISC architecture from
Processors for specifying the datapath and the custom
the list or defines his/her own architecture by selecting
functional units and then compiled the different applications
components and their connectivity from a component list.
and did analyze the aspects of suitability of a specific
Using the NISC compiler the C program is compiled for the
architecture for a particular class of application. The first
given architecture. After compilation the result can be
NISC in this study was comprised of Single port memory,
simulated and evaluated. Improvements can be made by
RF6x3, 2 ALUs, 2 Comparators, 2 Multiplier_signed, 2
changing original C program or selected architecture. At the
Divider_signed, 2 Divider_unsigned, Display and 3
end, RTL generator is used to generate Verilog RTL code for
constants. The second NISC in this study contains Single
FPGA/ASIC implementation. A precise view is shown in
39
© 2011 ACEEE
DOI: 01.IJCOM.02.02.157


port memory, RF6x3, 1 ALUs, 1 Comparators, 1 TABLEI. RF MAX VALUES OF AVAILABLE ARCHITECTURES
Multiplier_signed, 1 Divider_signed, 1 Divider_unsigned,
Display and 3 constants. Finally, the NISC in this study was
comprised of Single port memory, RF6x3, 1 ALUs, 1
Comparators, 1 Multiplier_signed, Display and 3 constants.

V. IMPLEMENTATION
The entire work was implemented though a set of
application C program which was compiled and allowed run
on a set of generic NISC architectures namely GN_0, GN_1,
GN_2, GN_3, GN_4, GN_5, NMIPS, and finally the Design
Embedded Custom Processor Custom Arch 1,Custom Arch
2,Custom Arch 3, which have been mentioned earlier. In fact,
matrix multiplication is a binary operation that takes a pair of
matrices, and produces another matrix. Matrix offers a concise
way of representing linear transformations between vector
space and matrix multiplication which corresponds to the
composition of linear transformations. Resulting matrix agrees
with the result of composition of the linear transformations
represented by the two original matrices.
The product of an m×p matrix A with a p×n matrix B is an m×n
matrix denoted AB whose entries are:
p Figure 4. RF Max Value of Available NISC Architecture.

( AB)i, j   AikBkj The outputs of RF Max value of the custom NISC architecture
k 1 with matrix application have been presented in table II. As
where 1 d” i d” m is the row index and 1 d” j d” n is the column shown, all the Custom Architectures had no pipelined
index. datapath so the results of custom architectures had been
The experimental outputs from these exercises are shown compared to GN_0 and GN_1 (both have no pipelined
below in terms of register and execution cycle-counts of each datapath). The result indicated that the Custom Arch 1,
matrix application. Custom Arch 2 and Custom Arch 3 in matrix multiplication [8]
[8] had comparatively lower value of RF Max. The fourth row
A. C- to –RTL synthesis & Embedded Custom Processor of the table II shows the results depicting that all the custom
Design Result Basis on RF Max architecture required more number of RF where the GN_0
The results of RF Max value of the available NISC and GN_1were the most suitable for matrix multiplication
architecture and applications have been presented in table I. unroll-1. Fifth row of the table shows that custom Architecture
All the programs were scheduled with the help of five major 1 was the best situation as compared to GN_0 and GN_1
parts such as _$jump ToStartupMain, NiscInterrupt, _$ given that it had less number of required RF for Application
NiscstartupMain, Application program (e.g. Matrix Program for this specific custom architecture. Figure 5 is the
Multiplication, Matrix Multiplication unroll-1, Matrix graphical depiction of table II.
Multiplication unroll-2) and NiscMain. Most of the registers TABLE II. RF MAX VALUE CUSTOM NISC ARCHITECTURE
were using the Application Programs. Computation was done
in four steps mainly calculating the memory address, loading
the values from data memory, operation perform and finally
generating the results. Briefly, the results indicated that GN_4
and GN_5 (pipelined and data forwarding data path) had
lowest units of RF Max Index. If we compare the value of
MIPS M4K, the performance of GN_4 and GN_5 were good
because Application Programs (e.g. Matrix Multiplication,
Matrix Multiplication unroll-1, Matrix Multiplication unroll-
2) had required less number of RF. So the RF values for GN_4
and GN_5 were good. Figure 4 is the graphical representation
of table I.

40
© 2011 ACEEE
DOI: 01.IJCOM.02.02.157


TABLE IV.
CYCLE COUNT VALUE OF CUSTOM ARCHITECTURE

Figure 5. RF Max Value of Custom NISC Architecture

A. C- to –RTL synthesis & Embedded Custom Processor
Given, all the Custom Architectures had no pipeline datapaths
Design Result Basis on Cycle Count
so the results of these architectures have been compared to
Table III presents the result of cycle count of the available GN_0 and GN_1 (both are no pipelined). The first column of
NISC architecture and matrix application. Briefly, the result the table shows that Custom Arch 1 had the lowest units of
shows that the GN_1, GN_2 (no pipeline) were suitable cycle count so it was the best possible architecture of all
architectures of matrix multiplication [8][8], matrix Matrix Application on the basis of cycle count. Figure 5 is
multiplication unroll-1. As mentioned earlier, the programs the graphical illustration of table IV.
were schedule in the following five parts _$jump
ToStartupMain, NiscInterrupt, _$ NiscstartupMain,
Application program and NiscMain. In GN_0 and GN_1 the
part NiscInterrupt and NiscMain had required less number
of cycle count, and thus it could had affected the total cycle
count.
TABLE III.
CYCLE COUNT VALUE AVAILABLE ARCHITECTURE

Figure 7. Cycle Count Value of Custom Architecture

VI. CONCLUSIONS
The research paper is an attempt to incorporate the design of
matrix multiplication C code implementation and it presents a
set of customized designs in relation to register and cycle
count. It also shows and analyzes the simulation result related
However, the GN_4, GN_5 (pipelined with data forwarding with several benchmarks on a number of different available
data path) were the suitable architectures of matrix NISC architectures and Custom Architectures in relation to
multiplication unroll-2 and they had the lowest value of cycle register and cycle count. As mentioned above, the initial result
count. Figure 6 is the graphical representation of table III. indicated that the GN_4 and GN_5 architectures were
relatively suitable for all applications in respect to register
Max index. But when we focus on custom architectures in
terms of matrix multiplication [8] [8] and matrix multiplication
unroll-2, the Custom Arch1 was the best option. Further, if
we consider matrix multiplication unroll-1, the GN_0 and GN_1
were the best options. Finally, if we compare the results in
respect to cycle count, GN_0 was suitable for matrix
multiplication [8] [8] and matrix multiplication unroll-1 but if
we see the results of matrix multiplication unroll-2, the GN_4
and GN_5 were the best options. If we see in holistic terms,
the results of custom architecture (Custom Arch 1) proved to
Figure 6. Cycle Count Value of Available NISC Architecture be the best result for all the matrix applications.
The results of cycle count values of the custom NISC
architecture with matrix multiplication have been presented
in table IV.

41
© 2011 ACEEE
DOI: 01.IJCOM.02.02.157


REFERENCES [9] B.Gorjara, M. Reshadi & D.Gajski, “NISC Communication
Interface”, Center for Embedded Computer Systems, TR 06-05,
[1] M. Reshadi, P. Mishra & N.Dutta, “Hybrid Compiled
March 2006.
Simulation: An efficient technique for instruction- set architecture
[10] J. Trajkovic & D Gajski,”Communication Design for No
Simulation”, ACM Transactions on Embedded Computing Systems
Instruction Set Computer”, Center for Embedded Computer
(TECS), April 2009.
Systems, TR 05-09, July 2005.
[2] B.Gorjiara, M.Reshadi and D. Gajski, Merged Dictionary Code
[11] B.Gorjara, M. Reshadi & D.Gajski, “NISC Technology and
Compression for FPGA Implementation of Custom Microcoded
Preliminary Results”, Center for Embedded Computer Systems,
PEs, ACM Transactions on Reconfigurable Technology and Systems,
TR 05-11, August 2005.
2008.
[12] M. Reshadi & D.Gajski, “No-Instruction-Set-Computer
[3] B. Gorjiara and D. Gajski, Automatic Architecture Refinement
(NISC) Technology”, Center for Embedded Computer Systems, pp
Techniques for Customizing Processing Elements, Design
1-21, 2005.
Automation Conference (DAC), June 2008.
[13] M. Reshadi and D. Gajski, NISC Modeling and Simulation,
[4] M. Reshadi, B. Gorjara and D. Gajski, C-Based Design Flow: A
Center for Embedded Computer Systems, TR 04-08, pp. 2-5, March
Case Study on G.729A for Voice over Internet Protocol, Design
2004.
Automation Conference (DAC), pp. 72-75, May 2008.
[14] M. Reshadi and D. Gajski, “NISC Application and Advantages”,
[5] NISC Technology website: http://www.cecs.uci.edu/~nisc/
Center for Embedded Computer Systems, TR 04-08, pp. 2-5, March
[6] J. Trajkovic & D. Gajski, “Automatic Data Path Generation
2004.
from C code for Custom Processors”, International Embedded
[15] M. Reshadi and D. Gajski, NISC Modeling and Compilation,
Systems Symposium, May 2007.
Center for Embedded Computer Systems, TR 04-33, pp 2-7,
[7] B. Gorjiara and D. Gajski, FPGA-friendly Code Compression
December 2004.
for Horizontal Microcoded Custom IPs, FPGA , pp. 108-115,
[16] D. Gajski, NISC: The Ultimate Reconfigurable Component,
February 2007.
Center for Embedded Computer Systems, TR 03-28, pp. 2-8,
[8] B. Gorjara, M. Reshadi & D.Gajski, “Designing a Custom
October 2003.
Architecture for DCT Using NISC Technology”, Asia and South
[17] M. Morris Mano, Computer system Architecture, Prentice
Pacific Design Automation Conference (ASPDAC), Design Contest,
Hall, India, 2003,ch.8, pp.282-285.
January 2006.

42
© 2011 ACEEE
DOI: 01.IJCOM.02.02.157

Exploring NISC Architectures for Matrix Application

Recommended

More Related Content

What's hot (19)

Similar to Exploring NISC Architectures for Matrix Application (20)

More from IDES Editor (20)

Recently uploaded (20)

Exploring NISC Architectures for Matrix Application