# An Optically Interconnected Pipelined Parallel Processing System: OCULAR-II

Makoto Naruse<sup>*a*</sup>, Haruyoshi Toyoda<sup>*b*</sup>, Yuji Kobayashi<sup>*b*</sup>, Daisuke Kawamata<sup>*a*</sup>, Neil McArdle<sup>*a*</sup>, Alain Goulet<sup>*a*</sup>, and Masatoshi Ishikawa<sup>*a*</sup>

<sup>a</sup>Department of Mathematical Engineering and Information Physics, Graduate School of Engineering, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan <sup>b</sup>Hamamatsu Corporation, Hiraguchi 5000, Shizuoka 434-0041, Japan

# ABSTRACT

A multi-layered optoelectronic parallel processing system, which is called Optoelectronic Computer Using Laser Arrays with Reconfiguration (OCULAR-II) is shown. This system consists of layers of processing modules, which are composed of electronic programmable processing element array each having parallel optical input/output (I/O) connected by optical interconnection modules. Every module is designed to be modular and cascadable. The algorithms for this system are also shown which exploit the aggregate bandwidth supplied by optics and the computation versatility given by electronic processors.

**Keywords:** optical interconnections, optoelectronic VLSI, smart pixels, parallel processing system, pipelined architecture

# 1. INTRODUCTION

The remarkable improvement of the microprocessors or digital signal processors (DSP) and the increase amount of data transmitted in computer networks have been making the limit of the overall computer systems' performance being the communication capability, not the performance of the processors at the terminals. Optical interconnections using integrated optoelectronic devices such as vertical cavity surface emitting lasers (VCSELs) are promising candidates to overcome bandwidth limitations owing to the their physical advantages over their electronic counterpart.<sup>1</sup>

In addition to the high aggregate bandwidth, sufficient computation capability should be provided to meet the advanced features of parallel optical communication devices. For example, the amount of data in recording media and in the computer network is drastically increasing, which makes it hard to search the appropriate contents in a short search time. The direct connection between processing devices and communication devices, as shown in the optoelectronic very large-scale integrated circuit (OE-VLSI) or smart pixels, is a solution to meet the requirement of tremendous data transfer as well as performing computation procedures at the same time.

However, although parallel optical interconnects and OE-VLSIs are potentially able to solve the pin bottleneck problems or eliminate clock skew at the physical layer of the systems, the architectural issues and applications must be also cared properly so that the potential advantages of optics are fully exploited.<sup>2</sup>

In this paper we will present the design and implementations of an optically interconnected parallel processing system, which is called Optoelectronic Computer Using Laser Arrays with Reconfiguration (OCULAR-II). This system is based on a pipelined architecture to extract the aggregate bandwidth supplied by optical interconnections as well as the digital signal processing versatility provided by programmable VLSI chips. This paper also discusses algorithms for optoelectronic hybrid systems that take advantages given by both optics and electronics.

This paper is organized as follows. Section 2 describes the architecture of OCULAR-II. Section 3 and 4 respectively describe the processor module and the optical interconnection module. Section 5 discusses the realization of a two-layer system and applications for OCULAR-II. Section 6 concludes the paper.

Correspondence: E-mail: naruse@k2.t.u-tokyo.ac.jp URL: http://www.k2.t.u-tokyo.ac.jp/~sfoc/ TEL: +\$1.3.5841.6902 FAX: +\$1.3.5841.6937

## 2. OPTOELECTRONIC HIERARCHICAL PIPELINED ARCHITECTURE

## 2.1. Hierarchical Structure

Although two-dimensional parallel optical devices such as VCSELs have massive parallelism, their capabilities are not fully utilized without appropriate architecture for high-speed processing. For example, in the case of conventional architecture for image processing, there is a serious data transfer bottleneck between image acquisition devices and processors due to the parallel-to-serial conversion of the data, which is called input/output (IO) bottlenecks. The communication procedure limits the overall performance of the system even though sufficient processing performance could be supported by advanced LSI processors. The integration of sensors and processors on the same LSI die is one of the ideal approaches to eliminate IO bottleneck problems. However, since there is a physical limitation for the total number of transistors that can be fabricated on a LSI, complex numerical computations, which consume more than the available on-chip transistors, cannot be executed. Therefore, the IO bottleneck occurs again between the chip and other devices even if photo sensors and processors are integrated.

OCULAR-II eliminates this bottleneck by placing integrated general purpose digital processor modules that have both optical IO devices, to have hierarchical or pipelined structure as the entire system as shown in Fig. 1. The optical interconnection between processing module provides sufficient data transfer bandwidth owing to its 2-dimensional optical data paths which is well beyond that of electronic interconnections in terms of spatial bandwidth density.

In addition to the large bandwidth supplied by optics, global interconnectivity among processor arrays is achievable since the communication channels are constructed via optical path in free-space. The broadcast or multi-cast operation and interconnection between processors located spatially far away each other can be efficiently implemented by free-space optical interconnections in OCULAR-II.

Looking at the overall system configuration of OCULAR-II including the controller or host computer, since the bandwidth between processing modules and other electronic devices is rather limited due to the electronic interconnections, it could be difficult to obtain information from optical data paths because of the large bandwidth mismatch. However, in OCULAR-II, since each processing layer has computation capability, we could build a situation where the bandwidth is properly used by distributing the required tasks to be solved to several modules in the system, or by compressing the data which is to be down-loaded to the host computers so that the IO bottleneck doesn't occur. In other word, the hierarchical pipelined structure makes the information flow smoothly using the bandwidth of the optical and electronic interconnections as well as the computations performed at each processor layer. At the same time, to exploit the optics and electronics in the system, the algorithm design for OCULAR-II is another important issue.



Figure 1. OCULAR-II: System architecture.

# 2.2. Programmability

In OCULAR-II, the interconnection topology between processing modules is reconfigurable by the phase modulation of optical beam performed by a spatial light modulator fabricated in the Optical Interconnection Module. An appropriate interconnection pattern can be programmed that best suits to a given application. The principle of the reconfigurable optical interconnects is based on a 4f optical system in which computer generated hologram (CGH) is placed at the Fourier plane. Suppose the emission pattern of two-dimensional optical devices having  $N \times N$  elements is I(i, j), and let the inverse Fourier Transform of a CGH be m(i, j), then the output pattern of 4f optical system is the convolution of I(i, j) and m(i, j). Therefore, in principle,  $N^2$  kinds of inter-module optical interconnection topology are available by changing the convolution kernel, which is m(i, j), and the total number of channels between modules can be  $N^4$ , potentially. Since the interconnection topology is programmable as well as the processing element array as shown later in section 3, OCULAR-II system is fully programmable. This programmability is also one of the most important and required features of OCULAR-II for the realization of hierarchical processing since information must be properly processed at each module in order to obtain the required information at each layer where the overall information flows smoothly in the pipeline.

# 2.3. Modularity

The optoelectronic pipelined system is modular: the different module can be stacked together so that the overall system can be extended by cascading components each of which being a building block of the total system. This modularity also makes every part of the demonstrator systems to be tested independently and to be maintained easily. Specifically, OCULAR-II consists of two main modules: "Processor Module" contains the parallel optical IO channels and processor array, and "Optical Interconnection Module" is composed of the free-space imaging system: it integrates both the optics and the SLM.

# 2.4. System Integration

Since one of the most notable features of free-space optical interconnections is its high spatial density of communication channels, all optical and electronical devices have to be integrated into a single chip. Although OCULAR-II consists of several parts of discrete electronic modules, they can potentially be integrated into one chip by introducing bonding technologies such as the flip-chip bonding,<sup>4</sup> direct bonding,<sup>5</sup> or polyimide bonding technology.<sup>6</sup>

In addition, for the realization of optical interconnects and to decrease the inter-chip beam propagation time, the compactness of the optical module is critical issue. OCULAR-II uses a specially designed lens system that is compact and provide reconfigurability.

# 2.5. Algorithms Featuring the Hierarchical Structure

The versatility of physical layer technologies available in OCULAR-II is one of the features which differs from conventional computer systems: both optical and electronical interconnections are used in OCULAR-II. In addition, every processor module can have independent instruction stream and can have inter-processor communications owing to optical interconnections.

To obtain optimized performance for this system, computing tasks must be appropriately distributed into every part of the system according to the capability of the physical layer technologies. In addition, application must be optimally solved using every resource available in the system if needed. This means that the appropriate algorithms for optoelectronic pipelined processor are necessary to fully utilize the structural and physical characters of the system.

# 3. PROCESSOR MODULE

As described in the previous section, since an essential feature of inter-chip free-space optical interconnects is the bandwidth per unit space of a LSI chip, optical devices and Si CMOS processor must be fabricated on the same die. The processor module of OCULAR-II, however, consists of three parts to emulate the functionality and structure of ideal integrated OE-VLSI chips: It is composed of a VCSEL module for optical output, a PD array module for input, and processing element (PE) array module. The total processor module has a modular structure which is able to makes the overall system to be extended by combining with the Optical Interconnection Module which is shown in section 4.

#### 3.1. Processing Part

Simple logical and arithmetic operations are performed electrically in OCULAR-II based on the advanced VLSI technologies: general purpose processing is possible. The PE array in OCULAR-II is configured as a general purpose  $8 \times 8$  Single-Instruction-Multiple-Data flow (SIMD) array, where each PE contains a 24-bit memory, a programmable arithmetic logic unit (ALU) for performing bit-serial operations, electronic connections to the neighboring PEs, optical data input, and optical data output. The following set of instructions can be performed as shown in Table 1. The first two operands correspond to the read address of the data to be performed and the last operand is the write address where the result of the operation is to be recorded. The instruction set manages communication to neighboring PEs, optical input and optical output as the operation to memory mapped IOs, which reduces the number of transistors to be used on a LSI die, as well as making the program easy to write. By repeating instruction sets, arbitrary arithmetic and logical operation procedures can be executed.

There is a tradeoff between the functionality of a PE and the total number of PEs on a chip since the available number of transistors on a chip is limited. Since OCULAR-II is aimed at performing a variety of algorithms and application, our architecture must have a general purpose programmable PE, which potentially requires a large number of transistors consuming large area on the LSI chip. However, to meet the scale of integrated optical devices such as VCSEL, which is normally 125  $\mu$ m or 250  $\mu$ m nowadays or may be smaller in future, to obtain spatially massively parallel processing capability, the total number of a PE must be small, which is a contradict design criteria to the functional generality. The PE used in OCULAR-II is designed to be compact to provide massive parallelism as well as to keep the generality of processing.

Moving from architectural issue to implementation of processor architecture, Field Programmable Gate Array (FPGA) is used for 8×8 PE arrays which is fabricated on Actel A32200DX having 20,000 gate and work at 225MHz. The PE of OCULAR-II uses 48.76 % of the overall logic gate on this FPGA chip.

| Instruction: OF            | CODE, Operand 1, Operand 2, Operand 3   |
|----------------------------|-----------------------------------------|
| OP CODE                    | AND, OR, XOR, ADD                       |
| operand 1,2 (Read Address) | UP, DOWN, LEFT, RIGHT (4 neighborhood), |
|                            | ZERO, ONE, Optical Input                |
| operand 3 (Write Address)  | NEIGHBOR,NUL,Optical Output             |

 Table 1. Instruction set for the processing elements

## 3.2. Optoelectronic Interface

VCSELs with emission at the wavelength of 850nm fabricated by MODE Corp. are used as the light emitting devices. General line drives provided by Pericom Corp. is used for driving VCSELs which are implemented on a printed circuit board (PCB), namely the "VCSEL module", which is connected to the processing module by a compact electric connectors in current demonstrator systems.

A 8 × 8 Si photodetector (PD) array has been used for optical input channels. Each PD is connected to a transimpedance amplifier with thresholding functionality, which is monolithically integrated as shown in Fig. 2. As for the requirement for the PD and its circuitry, it has been pointed out that it must have compactness, or low-power dissipation, so that it should be properly applied to smart-pixel applications.<sup>7</sup> The PD used in OCULAR-II is our first trial for the future high-speed compact photo sensor with integrated functionality to confirm the basic circuit design and parameters of the semiconductor fabrication. The gain of the transimpedance amplifier is 150k $\Omega$  and the power consumption at a channel is around 20mW per channel. The sensitivity is 0.1 A/W at the operating wavelength of the VCSEL. The PD chip are set on a PCB board with some other electronic circuits, which is called the "PD module", that is connected to a PE module.

# 4. OPTICAL INTERCONNECTION MODULE

The Optical Interconnection Module, which is one of the most important features of OCULAR-II, is an imaging optical system with spatial light modulator (SLM) which provides the reconfigurability of the interconnection topology between the Processor Modules. It is designed to be compact and modular. The PAL-SLM is a phase modulating





spatial light modulator working in reflection that stands for Parallel Aligned Nematic Liquid Crystal Spatial Light Modulator. The PAL-SLM is addressed by a compact liquid crystal display (LCD) panel illuminated by a visible laser diode. In this way, computer generated holograms (CGH) are written that specifies the interconnection topology between processor layers. The block diagram and a photograph of the Optical Interconnection Module is shown in Fig. 3 and Fig. 4 respectively. Optical beams from the VCSEL array are deflected by the prism and to Fourier transform (FT) lens. The collimated beams are diffracted at the surface of the PAL-SLM depending on the pattern of the CGH written on the PAL-SLM. The diffracted optical beams go back to the FT lens, and are deflected by the side of the prism and imaged onto the PD array, which is placed at the opposite side of the VCSEL array.

The PAL-SLM has a 20mm×20mm active area and its resolution is 20 lp/mm (@50%MTF). The LCD used is a 1.3 inch LCD (LCX012BL, Sony) which has 640×480 pixels of 44.4  $\mu$ m pitch. A SELFOC planar micro lens array (PML) was used in the former OCULAR-II's Optical Interconnection Module<sup>8</sup> to couple the SLM with LCD with an imaging distance of 32mm. However low diffraction efficiencies were measured due to the poor spatial resolution of PML.

In the new system, the PML is replaced by a fiber optic plate (FOP). It has a thickness of 5mm, each fiber has a diameter of  $3\mu$ m and its numerical aperture is  $1.0.^9$  The total length between LCD and PAL-SLM is 15mm, which is greatly reduced to the former PML-based system.

It is technologically impossible to totally eliminate diffraction effects caused by the pixelized structure of the LCD and the FOP. However, it can be greatly reduced if a special lens system is employed<sup>10</sup> so that the spatial frequency of pixel structure is not transferred from LCD to the PAL-SLM. However, due to the diameter of each fiber and the MTF property of PAL-SLM itself, where the diffraction efficiency caused by high spatial frequency is low, the total amount of loss due to the pixelized structure with the use of FOP is less than 10%. This means that one of the principal features of the PAL-SLM, which the non-pixel structure, is successfully exploited with the use of FOP.

Figure 5 shows the diffraction efficiency of PAL-SLM when PML and FOP are used as coupling method, respectively.

The focal length of FT lens can vary between 360mm and 440mm to tolerate 10% disturbance of the system parameter due to the temperature fluctuation or other factors. The FT lens consists of four elements arranged in a telescope configuration. The shorter working distance of the lens that is between 155mm to 200mm. The reflection efficiency at the surface of PAL-SLM is 99.9%, which was optimally fabricated to the operating wavelength of the VCSEL.

The optical path is folded several times by mirrors and a prism to make the optical module compact. The FT lens is shared for the input optical beam to the surface of the PAL-SLM and the output beam reflected at PAL-SLM.

In addition, instead of using a half-mirror or polarized beam splitter for the readout of the PAL-SLM, a prism has been used. The loss of the light should be greatly reduced in principle since the diffraction efficiency of the PAL-SLM



Figure 3. Block diagram of the optical interconnection module.



Fourier transform lens

Since the current VCSEL used in OCULAR-II has a full divergence angle of 10 degree, most of the input beam was lost at the aperture of the FT lens. Only 3.6% of the input light was detected on the surface of the SLM. The transmission of the optical beam after reflection at the surface of SLM was 60%. To enlarge the efficiency, the addition of a micro lens array in front of the VCSEL array is projected to reduce the beam divergence angle, and thus increase the transmission of the optical system.

## 5.2. Algorithms for Optically Interconnected Pipelined Systems

The algorithm design should be reconsidered in optoelectronic systems so that the aggregate bandwidth given by free-space optical interconnects and the processing versatility given by electronics are fully exploited. The pipeline processing of the system is schematically shown in Fig. 9. The node in the network represents an intermediate step of the pipeline processing and the link between nodes indicates the required latency between the statuses. The total procedure is represented as the combination of processing time at each processor module and communication latency between modules. The optimum algorithm is the one which gives minimum latency between the starting point, depicted as "Start" in Fig. 9 and the final step of specified by "Results" in the same figure.

# 5.2.1. Example (a): Matrix Vector Products

Let  $\mathbf{A} = \{a_{ij}\}\$  be a  $N \times N$  matrix, and  $\mathbf{x}$  be N-dimensional vector. A matrix vector products  $\mathbf{y} = \mathbf{A}\mathbf{x}$  can be performed within a two layer systems if the matrix  $\mathbf{A}$  is located on the second PE layer and vector  $\mathbf{x}$  is transferred from first layer where uploading and initial computation is performed. This provides efficient processing performance





Figure 9. Pipelined algorithm for multi-layer system.

Figure 10. Matrix computation algorithm.

since the required communication procedure to provide  $x_i$  into to the second layer is performed simultaneously by using the inter-layer multi-cast optical interconnection as shown in Fig. 10(a). Figure10(b) shows the stream of instructions given to every processor layer. If more numbers of layers had been used, the processing time would have decreased by distributing the computations assigned originally to the second layer into the successive layers.

#### 5.2.2. Example (b): Quad-tree network

A quad-tree structure is a type of network where four components in a layer are connected to one element in the upper layer so that the overall processing nodes in the system can be connected in a tree-structure. Figure11 (a) shows the idea of embedding a quad-tree structure into a multi-layer processing system. Suppose the array size of the optoelectronic VLSI at each layer is  $N \times N$ , the quad-tree is then realized by using  $\log_2 N$  layers. The two kinds of interconnection pattern required are 1-to-4 fanout interconnections, which are shown schematically in Fig. 11 (a). Figure11 (b) shows the required CGH pattern to be imaged on the PAL-SLM in the Optical Interconnection Module and the corresponding interconnection that was obtained experimentally.

A quad-tree structure gives an efficient computation procedure to obtain information that should be based on the whole area of the processing layer owing to the global connection between processing nodes among layers. In addition, although bandwidth between the PE layers and the outer host computers is limited, we can compensate this feature by distributing given applications into several layers of processing modules so that the total amount of data that must be transfered from a layer to the host computer should match the capability of the physical layers.

## 6. CONCLUSION

An optically interconnected parallel processing system OCULAR-II has been described. Its architectural issues, components, a two-layer demonstrator system and the algorithms were discussed. Although the processing layers are separate components in OCULAR-II, they have been to designed to be modular. This is one of the core feature to realize pipelined expendable architectures and they are potentially able to be fabricated into unit chip by recent integration technologies.

OCULAR-II was investigated in details. Several new issues should be considered for its further development, such as alignment<sup>11</sup> between modules or control architecture of a multi-layered PE array. Several killer applications and algorithms such as a database management<sup>12</sup> are under investigation.

#### REFERENCES

- J. W. Goodman, F. J. Leonberger, S.-Y. Kung, and R. A. Athale, "Optical interconnections for VLSI systems", Proc. IEEE, Vol. 72, pp. 850–865, 1994
- M. Ishikawa and N. McArdle, "Optically Interconnected Parallel Computing Systems", IEEE Computer, Vol. 31, pp. 61–68, 1998.

# (b) Optical interconnection between layers



Figure 11. Quad-tree network in a pipelined architecture: (a) quad-tree embedded in a pipelined system. (b) Interconnection network between layers.

- M. Ishikawa, K. Ogawa, T Komuro, and I. Ishii, "A CMOS Vision Chip with SIMD Processing Element Array for 1ms Image Processing", 1999 Dig. Tech. Papers of 1999 IEEE Int. Solid-State Circuits Conf. (ISSCC'99), pp. 206-207, 1999.
- 4. A. V. Krishnamoorthy, L.M.F. Chirovsky, W.S. Hobson, R.E. Leibenguth, S. P. Hui, G. J. Zydzik, K. W. Goossen, J. D. Wynn, B. J. Tseng, J. Lopata, J. A. Walker, J. E. Cunningham, L. A. D'Asaro, "Vertical-cavity surface-emitting lasers flip-chip bonded to gigabit- per-second CMOS circuits", IEEE Photonics Technology Letters, Vol. 1, pp. 128-130, 1999
- 5. H. Wada, T. Takamori and T. Kamijoh, "Room-temperature photo-pumped operation of 1.58  $\mu$  m verticalcavity lasers fabricated on Si substrates using wafer bonding", IEEE Photonics Technology Letters, Vol. 11, pp. 1426-1428, 1996
- S. Matsuo, K. Tateno, T. Nakahara, H. Tsuda, and T. Kurokawa, "Use of polyimide bonding for hybrid integration of a vertical cavity surface emitting laser on a silicon substrate", Electronics Letters, 13, pp. 1148-1149, 1997
- 7. T. K. Woodward, "VLSI-Comatible Smart-Pixel Interface Circuits and Technology", IEEE/LEOS 1996 Summer Topical Meetings pp. 65, 1996
- 8. H.Toyoda, Y.Kobayashi, N. Yoshida, Y. Igasaki, T. Hara, N. McArdle, M. Naruse, and M. Ishikawa, "Compact optical interconnection module for OCULAR-II: a pipelined parallel processor", Technical Digest of Optics in Computing '99, 1999.
- Y. Kobayashi, Y. Igasaki, N. Yoshida, N. Fukuchi, H. Toyoda, T. Hara, and M. H. Wu, "Compact High-efficiency Electrically-addressable Phase-only Spatial Light Modulator", Proc. SPIE, Vol. 3951, pp. 158–165, 2000
- Y. Igasaki, F. LI, N. Yoshida, H. Toyoda, T. Inoue, N. Mukohzaka, Y. Kobayashi and T. Hara, "High Efficiency Electrically-Addressable Phase-Only Spatial Light Modulator", Optical Review, Vol. 6, No. 4, pp. 339-344, 1999
- 11. M. Naruse and M. Ishikawa, "Analysis Characterization of Alignment for Free-Space Optical Interconnects Based on Singular-Value Decomposition", Applied Optics, Vol. 39, No, 2, pp. 293-301, 2000
- 12. D. Kawamata, M. Naruse, I. Ishii, and M. Ishikawa, "Image Database Construction and Search Algorithm for Smart Pixel Optoelectronic Systems", to appear in Optics in Computing 2000.