Dedicated Optoelectronic Stochastic Parallel Processor for Real-Time Image Processing: Motion-Detection Demonstration and Design of a Hybrid Complementary-Metal-Oxide Semiconductor-Self-Electro-Optic-Device-Based Prototype

Alvaro Cassinelli, Pierre Chavel, Marc P. Y. Desmulliez

To cite this version:

HAL Id: hal-00878984
https://hal-iogs.archives-ouvertes.fr/hal-00878984
Submitted on 31 Oct 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Dedicated optoelectronic stochastic parallel processor for real-time image processing: motion-detection demonstration and design of a hybrid complementary-metal-oxide semiconductor–self-electro-optic-device-based prototype

Alvaro Cassinelli, Pierre Chavel, and Marc P. Y. Desmulliez

We report experimental results and performance analysis of a dedicated optoelectronic processor that implements stochastic optimization-based image-processing tasks in real time. We first show experimental results using a proof-of-principle-prototype demonstrator based on standard silicon–complementary-metal-oxide-semiconductor (CMOS) technology and liquid-crystal spatial light modulators. We then elaborate on the advantages of using a hybrid CMOS–self-electro-optic-device-based smart-pixel array to monolithically integrate photodetectors and modulators on the same chip, providing compact, high-bandwidth intrachip optoelectronic interconnects. We have modeled the operation of the monolithic processor, clearly showing system-performance improvement. © 2001 Optical Society of America

OCIS codes: 200.0200, 200.4650, 250.3140, 030.6140, 100.0100, 330.4150.

1. Introduction

Many early- and middle-vision tasks such as restoration, image segmentation, halftoning, or even motion detection can be formulated as optimization problems consisting of finding the ground states of an energy (or cost) function. Stochastic optimization by simulated annealing yields some of the best results because it is not generally constrained to quadratic or even convex cost functions. However, when implemented on conventional sequential workstations, the computational loads are too extensive for practical purposes, so that most of the proposed artificial vision systems rely—one way or another—on deterministic optimization algorithms that are at least 100 or 200 times faster than stochastic algorithms but often lead to results further from optimality.

One can point out two different sources for such a prohibitive computational load. One obvious source is the sequential processing of pixels in a two-dimensional (2-D) image; one way of alleviating this burden might be the processing of pixels in parallel, with dedicated hardware based on optical scale parallelism—by this expression we refer to parallel image processors consisting of one processing element (PE) per pixel. This is possible, provided that some rules are observed: Briefly stated, two pixels that interact with each other should not evolve at the same time; this rule has led us to the concept of semishift invariance and coloring in optoelectronic parallel processors (see Fig. 1). In the case of low-level image-processing tasks, the state of each PE of the array evolves as a function of a short-range shift-invariant interconnection pattern: Therefore stochastic (as well as deterministic) algorithms can advantageously be implemented on single-instruction-multiple-data machines with a high degree of parallelism. In this context, optical convolution represents an interesting alternative to intrachip electronic shift-invariant interconnects be-
cause they avoid electronic layout congestion and have the benefits of an easy reconfigurability of the interconnect topology. The principles of such a mainly digital optical processor architecture, combining electronic nonlinear processing and optical linear convolution have been studied extensively in the literature under various names, including symbolic substitution, mathematical morphology, and cellular automata.  

Whereas sequential updating of pixels is a common source of computational load shared with any deterministic algorithm, stochastic optimization algorithms suffer from the second burden of having to generate good quality, uncorrelated random numbers for every PE in the array; moreover, proper operation of the simulated annealing needs the generation to be repeated thousands of times per image under processing and even more for complex tasks. In the parallel version of the algorithm, huge amounts of (time and space uncorrelated) random numbers need to be generated to feed all the PEs of the array at the same time. Electronic (digital) random-number generators are not only time and chip-area consuming: They also suffer from the inherent pseudorandom nature of the generated sequence.  

On the basis of extensive previous experience, we selected differential detection of speckle as a reliable, quick, parallel, and easy way to control the generation of random numbers.

We have therefore been studying optoelectronic single-instruction-multiple-data machines that are able to minimize low-level image-energy functions at a video rate by use of the well-known simulated annealing algorithm. These devices, which we will refer to from now on as optoelectronic stochastic parallel processors (OSPPs) rely on three ingredients: (a) an optical convolution setup in charge of the reconfigurable, shift-invariant interconnection pattern; (b) an optical random-number array generator; and (c) an optoelectronic smart-pixel array (SPA) to implement the required nonlinear operation. So far, only inputs, not outputs, to our SPA have been optical, and convolution has been achieved either electronically (thanks to a hard-wired four-nearest-neighbor interconnect) or with the help of an optical convolution setup composed of an interchangeable Dammann grating and a spatial-light-modulator (SLM) chip interfaced to the SPA through a—relatively—slow electronic bus; indeed, as we will show, monolithic convolution of the data using a SPA having both optical input and optical output ports would account for an enormous increase in the overall performance of the OSPP system.

The paper is organized as follows. In Section 2 we review our present setup, which is, to our knowledge, the first complete reconfigurable (nonmonolithic) OSPP prototype dedicated to low-level image processing. This proof-of-principle prototype is based on a silicon complementary-metal-oxide-semiconductor (CMOS) SPA chip. Motion detection at a video rate in a sequence of gray-level images is demonstrated. In Section 3 the flip-chip bonding technique is considered for building a hybrid Si/AsGa SPA, and figures of merit of this novel OSPP are drawn up as a function of the optoelectronic transceiver characteristics.

2. Present Demonstrator

The underlying principles of the OSPP are described extensively elsewhere. We report here on our latest experimental illustration, namely, the demonstration of video-rate simulated annealing for motion detection in a sequence of noisy gray-level images. The purposes of this demonstrator based on a SiCMOS SPA are twofold: First, it provides a concrete optomechanical hardware that should help us draw realistic figures of merit while we test novel technologies for the SPA processor; second, it constitutes a proof of the good prospect of using stochastic-based algorithms at the core of more complex image-processing systems, without compromising the possibility of real-time processing—at least up to the video rate.

A. Motion-Detection Algorithm

From the point of view adopted in this study, motion-detection algorithms and systems provide an output consisting of a binary map e, called the label field, indicating motion or absence of it at every pixel of the input (noise-corrupted) gray-level image. There is evidence that motion detection can be achieved with a statistical regularization approach based on spatio-temporal Markov random fields. Far from being of only academic interest, motion detection is therefore an interesting demonstration candidate for our videorate processor.

Specifically, we started from a class of motion-detection algorithms first proposed by Lalande and Bouthemy. The original algorithm performs an elementary preprocessing of the sequence to prepare two observed or input data fields. First, a time gradient \( \hat{\mathbf{g}}(s, t) \) is obtained by differentiating in time a gray-level image, where \( s \) is a 2-D pixel coordinate and \( t \) is discrete in time. Then a binary version \( \hat{\mathbf{o}}(s, t) \) of the time gradient is calculated also,
simply with a predefined threshold. The energy function to be minimized [as a function of the 2-D label field $e(s, t)$ under estimation] incorporates both input fields as well as the previous state of the label field $e(s, t - dt)$.

In our setup the host computer is in charge of all the preprocessing. Although we started from the Lalande and Bouthemy algorithm, we found that it was necessary to introduce some simplifications to it before parallel optoelectronic implementation could be envisaged. Specifically, the need for two time-gradient input data fields $o(s, t)$ and $\hat{o}(s, t)$ was a problem. However, we concluded from simulations that it was worthwhile to investigate a model in which only the binary (thresholded) version $\hat{o}$ of the image gradient is used as input to the algorithm. The whole idea relies on the assumption that the state of motion is likely to change only if temporal changes are notable—this is the case for a pixel located on the edge of a moving object. Otherwise, we would prefer the label to retain the value held in the near past. Spatial regularization potentials are also defined in such a way as to favor homogeneous regions of motion. A schematic representation of the underlying principles of the simplified algorithm is shown in Fig. 2. It is appropriate here to mention the difference between motion detection and the mere subtraction of successive frames in a time sequence: A nonzero difference between two consecutive images may indicate motion, but it may as well be due to time-variable noise or a change in illumination conditions; in contrast, a null difference may be the result of subtracting consecutive pixel values at two different locations in the body of a uniformly lit moving object (see Fig. 3). The purpose of motion detection is to give a complete estimate of all pixels in the scene that are moving at a given time and follow their motion in terms of compact, but deformable, moving objects.

As expected, estimation of the current label field involves only neighborhood computations: The local energy gradient with respect to the label variable depends only on a short-range neighborhood around the pixel under consideration. Various strategies have been proposed for the order of site visiting (at random, by sequential scanning, or with a confidence ordering list to accelerate the relaxation process as in Ref. 12). However, as explained above, the parallel updating of sites happens to be the fastest strategy because the neighborhood size reduces to the four-eight nearest neighbors. Also, whereas many deterministic optimization algorithms require that much attention be paid to the label-field initialization procedure (the ultimate quality of the result strongly depends on it$^{13}$), stochastic-based algorithms completely eliminate such a constraint. In short, all a PE has to do is

(a) Calculate the local force (i.e., the local energy gradient of a particular site $s$ and time $t$ with respect to the label variable), which combines one regular-

![Fig. 2. Underlying principles of the simplified motion-detection algorithm. The heavy circles represents two consecutive locations of a moving object. Motion information is exclusively provided to the OSPP through the observed binary field $\hat{o}(t)$.](image2.png)

![Fig. 3. Test of the simplified motion-detection algorithm implemented on the (nonmonolithic) optically interconnected OSPP prototype. A Dammann grating is in charge of an eight-nearest-neighbor-optical-interconnection pattern. This first prototype relies on a low-resolution, 24 × 24-pixel large Si-CMOS chip (SPIE600). Gray levels, on a scale of 0 through 255: background = 180; object = 220. Gaussian noise, $\sigma = 10$. Object speed, (1, ±2) pixel/frame. Size: image, 24 × 24 pixels; moving object, 6 × 5 pixels.](image3.png)
ization term $F_S$ and a second (conditional) constraint-to-the-data term $F_C$: 
\[ F(s, t) = \beta_S \sum_{r \in \mathbb{N}(s)} [2\mathbf{e}(r, t) - 1] \quad (= F_S) \]
\[ + \beta_C [2\mathbf{e}(x, t - 1) - 1][2\mathbf{d}(s, t) - 1] \quad (= F_C). \] (1)

In Eq. 1, $N(s)$ represents the neighborhood of pixel $s$ and $\beta_S$ and $\beta_C$ are empirically determined constants. The term $F_S(t)$ can be seen as a regularization field, resulting from a convolution between the current field $\mathbf{e}(t)$ and a short-range (equally weighted) convolution kernel. The term $F_C(s, t)$ behaves as a conditional constraint to the label-field data held in the near past (memory effect), provided that no significant change has been detected in the particular site [i.e., the variable $\mathbf{d}(s, t)$ is zero]; otherwise, it tends to flip the label state, as explained above (Fig. 2). Our simulations have shown proper operation of the algorithm, provided that the ratio $\lambda = \beta_C/\beta_S$ lies between 0.4 and 1.2.

(b) Update its label $\mathbf{e}(s, t)$ according to the sigmoid probability law:
\[ \text{Pr}[\mathbf{e}(s, t) = 1] = \frac{1}{1 + \exp[-F(s, t)/T]}. \] (2)

Updating is performed by simply thresholding on the local force $F(s, t)$ by a standard electronic comparator; the nature of such a deterministic operation is turned probabilistic thanks to a laser speckle generator that projects a time-varying speckle light over two photodetectors attached to both positive and negative inputs of the comparator (the decision threshold is therefore randomly shaken). Characterization of these stochastic comparators (or random-number generators) is given in detail in Ref. 14.

Thanks to operations (a) and (b), the updating of a particular PE state is thereupon achieved. Updating can be done in parallel on pixels that do not directly interact with each other. Maximal sets of pixels that do not interact are said to belong to the same color domain, and colors domains are to be considered sequentially. As a result, the entire field $\mathbf{e}(t)$ is eventually updated. It should be reminded here that two neighboring (i.e., interconnected) processor cells never have the same color; therefore as the size of the neighborhood increases, so does the number of color domains (see Fig. 1). The process iterates following the annealing schedule by decreasing the speckle laser power that represents the (algorithmic) temperature ($T$) of the annealing. With this simple algorithm, one serious simulated annealing operation consists of typically 1000 updating cycles of the whole field $\mathbf{e}(t)$.

B. Smart-Pixel Array Hardware

Because of the general availability and maturity of silicon technology, in an earlier collaborative project with Institut d’Électronique Fondamentale, Centre National de la Recherche Scientifique, Université de Paris-Sud, we developed a CMOS SPA, named SPIE600, consisting of $24 \times 24$ identical processor cells for the optoelectronic simulation of an Ising spin array with two phototransistors per cell serving as optical inputs for the speckle light, used to compute the updating probability of Eq. (2) by differential detection of speckle. Of course, because light emission by silicon is far from being well mastered, there is no optical output. The optical inputs, however, are sufficient for the Ising spin problem. Although, indeed, the optical inputs may be used for the neighborhood convolution as explained in the Introduction, a provision for electronic bipolar interaction with the four nearest neighbors has been implemented. That study, published earlier, demonstrated a video realtime massively parallel implementation of an Ising spin array.

It appeared that the same chip could be used for demonstrating some low-level image tasks. For instance, the chip could be used for noise cleaning on binary images—a problem of only academic interest though—by simply setting all interaction coefficients of the Ising chip to unity and then optically inputting the binary image to be processed (dual rail encoded as explained in Ref. 4) onto the photodetectors on the chip. This is possible because each PE is provided with two optical inputs. This was actually done as the first full experimental demonstration of an OSPP for image processing (noise cleaning in binary images) at the video rate, exploiting the concept of semi-shift invariance by coloring. In the present study, optical convolution and electronic feedback have been added to the OSPP to broaden the interconnection neighborhood, even though the task demonstrated is motion detection. This will be explained in some detail below.

C. System Architecture

There is a fundamental interest in studying and evaluating the performances of an optically interconnected prototype, and the noise-cleaning OSPP demonstrator is naturally suitable for this kind of experiment. Indeed, optical inputs to the PEs can be used as channels for communicating between each other, provided that their outputs are made somehow optically available. As explained in Section 1, optical convolution is an interesting solution for the interconnection hardware because all contemplated applications of the OSPP (noise cleaning, halftoning, motion detection, etc.) need a shift-invariant interconnection pattern. Moreover, as the interconnection topology depends on each particular application, an easy way of reconfiguring it is also needed. Optical convolution also provides extension to larger convolution kernels, compromising neither the electronic complexity of each PE nor the density of the array.

With respect to the noise-cleaning demonstrator mentioned above, two major improvements were required. (1) Two SLMs were added to the system: one for optically generating the constraint-to-the-data (precalculated) binary field $F_C(t)$ of Eq. (1), and
four-nearest-neighbor-interconnected array, and

For a neighborhood-interconnection pattern, respectively. Parallelism is 24/11003.

A parallelism of 24/11003 is then divided into two different color domains, see Fig. 1. Because our images are 24 × 24 pixels large, the level of parallelism is 24 × 24/2 = 238 in the case of the four-nearest-neighbor-interconnecting hologram, and 24 × 24/4 = 119 in the case of a wider eight-nearest-neighborhood-interconnection pattern, respectively. For a fixed interconnection pattern, the level of parallelism scales with the number of pixels and is restricted only by the cost of dedicated chip design and fabrication. The SPIE600 chip was designed in 1994; with presently available technology, it would be possible to design a new version of the chip with many more pixels, perhaps several hundreds, on a side, and the processor operation would be exactly as fast.

Fig. 5. View of the complete setup that uses the SPIE600 chip. It includes a CCD camera for alignment purposes and continuous monitoring of system operation. Approximate dimensions are 35 cm × 21 cm × 14 cm (the random-number-generator hardware is not shown in the picture).

3. Design of a Hybrid CMOS-SEED-Based OSPP Prototype

If video-rate stochastic processing is to be achieved along with good results in terms of quality of the optimization (that is, at least 1000 updating cycles per annealing), then our optically interconnected SPA has to be provided with optoelectronic feedback that works at rates higher than 25 × 1000 × NC loops/s (where NC represents the number of color domains in the processor). Actually, our current OSPP demonstrator uses a binary ferroelectric liquid-crystal SLM chip for feedback and convolution; even if we neglect the time needed for reading and properly formatting the data flowing through the electronic loop (controlled by a 100-MHz Pentium host computer), the SLM full-frame rate is less than 2 kHz—more than 20 times under the minimum required rate in the simplest case of two-color operation of the processor array, so that only hardly 50 updating cycles could in principle be achieved at the video rate—a less than satisfactory situation. As a matter of fact, placing sources and detectors in different chips is an unsatisfactory solution because it leads to an electrical bottleneck when a full 2-D binary data field is routed from one chip to another. This is in-
indeed what limits the SLM full-frame rate. We thus contemplated the design of a new version of the OSPP, relying on a unique SPA in which all PEs would be electronically isolated from each other (except for the clock and probably some additional control signals) and would include in situ high-bandwidth optoelectronic transducers.

A. Optoelectronic-VLSI Chip Architecture

It is appropriate at this stage to review the operation of an optically interconnected version of the OSPP. Each PE of the SPA would require the following features:

1. At present, a pair of photodetectors to implement the probabilistic updating by differential detection of speckle. These inputs are also used to calculate the local force \( F_{c}(s, t) \) by analogous (intensity) addition of the neighborhood output optical signals. It is interesting to note that differential encoding allows robust operation under low-contrast input data conditions, which is generally the case when the signal is provided by some kind of modulator-based source;

2. A comparator (point nonlinearity) in charge of the thresholding operation that sets every PE into its new state;

3. A one-bit memory, defining the current state of the PE;

4. A dual-rail optoelectronic output to which the current state of the PE is directed so that this state can be read by the neighborhood PEs; these outputs can be either sources [light-emitting diodes (LEDs) or vertical-cavity surface-emitting lasers] or modulators such as p-i-n multiple quantum well (MQW)-n diodes.

The basic PE design then corresponds to a differential receiver enhanced with an optoelectronic transmitter device (known in the literature as a differential transceiver or two-beam transceiver).\(^9\) Recently there has been significant progress in the development of these receivers for data-communication and information-processing purposes, and many publications describe well-characterized and functional optoelectronic transceivers.\(^{18,19}\) The efforts in designing our elementary processor are then considerably reduced.

To make the coloring of the SPA easily reconfigurable, we can add additional features:

5. A three-bit memory to encode the corresponding PE color (five colors are enough for the twelve-nearest-neighbor-interconnected SPA) along with a binary counter, so that only one common clock signal (ticking the binary counter) would be needed to control the sequential updating of color domains (this signal may be provided optically as well). Also, the whole optical architecture of the OSPP demonstrator can be greatly simplified (saving one SLM and its corresponding optical arm) with a final improvement;

6. An additional one-bit memory for storing the constraint-to-the-data term \( F_{c}(s, t) \) during the whole simulated annealing process. Sequential loading of color numbers as well as the precalculated binary constraint-to-the-data field \( F_{c}(t) \) can be done in parallel over the rows of the array before the simulated annealing iterations are started. The technique has been tested on the present chip SPIE600: A full loading of the chip registers can be done in less than 3 \( \mu \)s, which would hardly affect the 40 ms needed to perform simulated annealing at the video rate. The schematic of a single PE with one dual-rail optical input and one dual-rail optical output is shown in Fig. 6. The optical architecture of the optically interconnected OSPP is shown in Fig. 7. A holographic array illuminator creates an array of reading beams to be projected onto the PE modulators. Thanks to a cube beam splitter and a computer-generated hologram, the resulting reflected signals (from the modulators in the on state) are convoluted and instantaneously back projected to the photodiodes of the SPA (a process that we referred to earlier as monolithic feedback). Relative weighing of each
contribution $F_s$ and $F_C$ to the local force of Eq. (1) is done by adjustment of the power of the optical source with respect to a fixed electrical input to the chip (the “constraint term $\beta_C$“ input line of Fig. 6).

B. Optoelectronic Transducers
In comparing different input–output devices, one has to consider several competing parameters such as power consumption and dissipation, sensitivity, response time, and driver layout area. Better optoelectronic transducers do not only lead to improved performances of the whole system but they also potentially broaden the range of applications for a given optical power (higher sensibility allows wider convolution kernels without additional optical power requirement). Liquid-crystal modulator-based smart pixels are limited to response times in the microsecond regime; as far as response time is concerned, there is a gap between these devices and semiconductor receiver–transmitter-based smart pixels (typically hundreds of megahertz). Although performances of the latter may seem to exceed our needs (response times of approximately 0.1 $\mu$s would just be sufficient), we are obliged to consider semiconductor devices because liquid-crystal microsecond regimes are unacceptable. Table 1 summarizes other possible technologies, along with the corresponding maximum (estimated) simulated annealing rates, strictly on the basis of response time; Subsection 3.C will have a much more detailed investigation of one of the technologies. Our selection of the technology was based on data directly available to us on two types of SPA:

- AsGa photothyristor array chips and hybrid CMOS–self-electro-optic-effect-device (SEED) chips,
- AsGa MQW-SEED$^{20}$ flip chip bonded to standard Si-CMOS circuits.

1. Photothyristor Pairs
At first sight, an array of AsGa differential photothyristor pairs$^{21}$ seems to fulfill all the requirements (i.e., differential detection and one-bit memory, thanks to optical bistability) without the help of any additional electronics.$^{22}$ High sensitivities of 2.6 fJ at 720 nm have been reported, and optical cascaded transmission between arrays has been demonstrated at rates as fast as 50 MHz.$^{23}$ We were fortunate to be able to test one integrated circuit composed of 32 strongly positively doped–weakly negatively doped–weakly positively doped–strongly negatively doped photothyristor pairs made available to us by IMEC.$^{24}$ The matrix was laid out in a checkerboard fashion, with one separate voltage supply for each interlaced matrix (bipartitioning of the processor is hard wired, as in SPIE600, and is therefore automatically ready for two-color operation). Unfortunately, our measurements on this particular device indicated poor overall performances: The average detection threshold turned out close to 1.2 pJ at 783 nm (far from the femt joule regime), and the average total power emitted per site (at 860 nm) was approximately 31 nW. Our photothyristor array has three major drawbacks: (a) an important mismatch between pairs (four elements were useless in our circuit); (b) low sensibility (approximately 100 times lower) at its own emission wavelength, severely compromising the all-optical interconnection issue; and (c) important divergence (LED-Lambertian source), which leads to cross talk and to insertion loss and imposes the use of additional imaging optics to concentrate their energy (microlens array, for instance). It is possible to switch a photothyristor by use of the light emitted by its neighbors, but, even considering null optical losses, the feedback loop would be very slow: $31 \text{ nW}/(100 \times 1.2 \text{ pJ}) = 26 \text{ ms}$ with the devices that we tested. Hence no more than one serious simulated annealing process (composed of 1000 updating cycles) would be possible every 2.6 s, which does not meet the requirement for a video real-time OSSP demonstrator. We understand from the latest publications on this device (Ref. 22) that better performance was achieved from later devices, but unfortunately, to our knowledge, such devices are no longer being developed.

## Table 1. OSPP Performance As a Function of the SPA Technology

<table>
<thead>
<tr>
<th>Technology</th>
<th>Assumed Response Time</th>
<th>Number of Simulated Annealing cycles per Second ($N_C = 4$)</th>
</tr>
</thead>
<tbody>
<tr>
<td>The photothyristor array that we tested</td>
<td>26 ms</td>
<td>0.01</td>
</tr>
<tr>
<td>SPIE600 (nonmonolithic demonstrator: CMOS chip and liquid-crystal SLM)</td>
<td>400 $\mu$s</td>
<td>0.6</td>
</tr>
<tr>
<td>S-SEED (MQW)</td>
<td>50 $\mu$s</td>
<td>5</td>
</tr>
<tr>
<td>FET-SEED (MQW)</td>
<td>10 ns</td>
<td>25,000</td>
</tr>
<tr>
<td>LED(MQW)-MSM</td>
<td>4 ns</td>
<td>62,500</td>
</tr>
<tr>
<td>CMOS-SEED (MQW)</td>
<td>1.6 ns</td>
<td>150,000</td>
</tr>
<tr>
<td>VCSEL(MQW)/MSM</td>
<td>1.6 ns</td>
<td>150,000</td>
</tr>
</tbody>
</table>

*Performance in terms of simulated annealing cycles per second by use of different technological approaches (a single simulated annealing operation represents 1000 updating cycles per color domain; $N_C = 4$ is the number of color domains). There is a gap between electronic enhanced detectors [hybrid CMOS-SEED or field-effect-transmitter (FET)-SEED devices] and simple symmetric-SEED$^{20}$ (S-SEED) devices. MSM: metal–semiconductor–metal detector; VCSEL: vertical-cavity surface-emitting laser. The photothyristor device that we tested was a preliminary test device and is believed to be far from the limit of that technology.
2. Hybrid CMOS-SEED Technology

Rather than compete with CMOS technology, hybrid CMOS-SEED technology\(^{25}\) seeks to complement it in the interconnection domain by providing additional high-density, high-bandwidth optical inputs and outputs to the existing electronic circuitry. Performances of the flip-chip-bonded GaAs-MQW modulator—detectors depend strongly on the design of its driving (CMOS) circuit. Operation of a two-beam, differential receiver—transmitter circuit at a rate of 620 Mbit/s and a sensitivity of 30 fJ per beam by use of transimpedance front-end amplifiers has been reported.\(^{18}\) An almost 100% device yield for large arrays (68 × 68) has also been demonstrated.\(^{19}\) Operating such devices at normal incidence with collimated light simplifies the imaging system (divergence is due only to diffraction). In short, the CMOS-SEED technology currently available offers fairly good performance while considerably reducing the effort in the design of our PE because well-characterized and tested standard transceiver cells are available from many university research groups.

Subsection 3.C is devoted to establishing a theoretical upper bound for the frequency of operation of an OSPP demonstrator that uses a CMOS-SEED SPA device. Two different receiver front-end amplifier circuits will be considered: transimpedance front-end amplifiers and charge-sense amplifiers. The methodology used in the present study is strongly inspired by earlier works of one of the authors.\(^{26}\)

C. Optical and Electronic Power Budget

The purpose of this section is to derive an upper bound for the operation frequency of the monolithic OSPP demonstrator (i.e., its feedback rate), depending on the total optical and electrical power available on the system and the maximum thermal dissipation of the SPA.

1. Optical Power Limitations

We will consider quasi-cw mode of operation of the external optical laser source (the reading laser), and we will also assume a maximum power of 1 W. The photodetectors used to convert the optical signal into current or voltage variations are p-i-n(MQW)-n photodiodes, having at least (an average) responsivity of \(S \approx 0.5\) A/W. The voltage output from the electronic circuitry modulates the device into high \(R_{ON}\) (typically 70%) or low \(R_{OFF} = R_{ON}/C\) reflectivity levels (\(C\) is the contrast ratio of the modulator—typically \(C = 2:1\)).

To estimate the overall system performance (i.e., maximum frequency of operation of the feedback loop), we have to determine the lossiest optical path from the source to the detectors of the SPA. Optical losses have been extrapolated on the realistic basis of our previous prototype demonstrator described in Section 2. Maximal losses occur on the route going from the laser source of total power \(P_{tot}\) to a particular modulator—e.g., the plus modulator of PE(\(j\)),

\[
p_m(j) = \eta_{S-P-E} P_{tot},
\]

and then from there to the plus photodiode of PE(\(i\)) (a neighboring PE, see Fig. 8), leading, if PE(\(j\)) was in the ON state, to the optical power,

\[
p_{d,ON}(i) = \eta_{PE-P-E} R_{ON} P_m,
\]

and if PE(\(j\)) was in the OFF state,

\[
p_{s,OFF}(i) = \frac{p_{d,ON}(i)}{C} = \eta_{PE-P-E} \frac{R_{ON}}{C} P_m.
\]

The efficiency of the path from the laser source to a particular PE modulator, \(\eta_{S-P-E}\), depends on the efficiency and the fanning out of the array illuminator. We make a safe assumption of an efficiency of 40% for the array illuminator in the case of a Dammann grating (and perhaps as much as 70% with Talbot holograms\(^ {35}\)). We then have \(\eta_{S-P-E} = 40\%/(2 \times N_{PE})\), where \(N_{PE}\) is the number of PEs in the SPA and the factor 2 accounts for the two photodiodes per PE. The fan-out efficiency of the interconnecting hologram, \(\eta_{PE-P-E}\), depends on the neighborhood under consideration. The two-level-phase Dammann grating used in our previous demonstrator has shown 16%, 9%, and 7% efficiency, respectively, in the case of a 4-, 8-, and 12-nearest-neighbor-interconnection pattern (leading to a total efficiency of 64%, 72%, and 84%, respectively). The minimum optical power that has to be detected by PE(\(i\)) (e.g., when a unique neighboring PE flips its state) is given by

\[
p_{min} = 2(p_{d,ON}^{ON} - p_{d,OFF}^{OFF}) = 2p_{d,ON}^{ON} \left(1 - \frac{1}{C}\right)
\]

\[
= 2\eta_{S-P-E}\eta_{PE-P-E} R_{ON} \left(1 - \frac{1}{C}\right) P_{tot}.
\]

For a total optical power \(P_{tot} = 1\) W and a state-of-the-art 32 × 16-pixel large array, \(p_{min}\) ranges from 52 \(\mu\)W in the case of a simple mesh-interconnected ar-
ray, 29 µW in the case of a 8-nearest-neighbor interconnection, and 24 µW in the case of a wider 12-nearest-neighbor-interconnect topology.

The optoelectronic conversion time \( T_{IN} \) (from optical signal to logic-level voltage) depends on the front-end amplifier in use. We have considered here a diode-clamped receiver front-end amplifier. In this case, the energy carried by the optical signal of power \( P_{min} \) has to be sufficient to produce a small voltage swing \( \Delta V_{in} \) about the threshold voltage of the input transistor of the diode-clamped receiver (see Fig. 9).

We consider here a clamped-diode single-stage voltage amplifier with zero output conductance as in Ref. 28. To be detected, the pulse duration \( \tau_o \) times its power \( P_{min} \) must satisfy the following condition,

\[
P_{min} \tau_o \geq \Delta E_{min} = \frac{1}{S} C_{in} \Delta V_{in},
\]

where \( C_{in} \) is the total output capacitance formed by the detectors, the bonding pad, and the input transistor gate \((C_{in} \approx 300 \text{ fF})\). If relation (7) is satisfied, then the sum of the conversion and amplification times is given by\(^29\)

\[
T_{IN} = \frac{3}{4} \frac{C_{in} \Delta V_{in}}{S P_{min} + 2 C_{out} \frac{\Delta V_{log}}{g_m \Delta V_{in}}},
\]

where \( C_{out} \) is the output capacitance of the amplification stage, \( g_m \) is its (constant) transconductance \((\approx 10^{-3} S)\), and \( \Delta V_{log} \) is the typical logic-level voltage swing \((\approx 1 V)\).

In the quasi-cw mode of operation (the duration of the reading pulse is \( \tau_o = T_{conv} \)) and during the updating cycle of 1 PE, the output modulators are driven to a high- or low-reflectivity state prior to the illumination by the reading beams (see the chronogram in Fig. 10), so that the output switching time is simply given by

\[
T_{OUT} = 2 \frac{C_{ex} V_o}{\Delta I_{trans}},
\]

where \( V_o \) is the voltage applied to the modulator (8V), \( C_{ex} \) is the output capacitance of the smart pixel, and \( \Delta I_{trans} \) is the current flowing to the output capacitance. Typically, \( T_{OUT} \approx 1 \). The total on-chip processing time \( T_{PE} \) is given by

\[
T_{PE} = T_{IN} + T_{elec} + T_{OUT},
\]

where \( T_{elec} \) is the electronic processing time determined by a simulation by Simulation Packages for Integrated Circuits in Electronics (SPICE) (typically tens of nanoseconds for CMOS technology). The clock period must exceed the total system processing time,

\[
T_{clock} = \frac{1}{F_{elec}} = \frac{1}{F_{clock}},
\]

where \( T_{flight} \) is the time of flight of the optical signals (of the order of 1 ns in our architecture). The updating rate of each color domain (i.e., the feedback rate) is simply \( F_{C} = F_{clock} / N_{C} \). An upper bound for the frequency of operation is then given by the following relation,

\[
P_{opt} = \frac{3}{8} \frac{\Delta E_{min}}{\eta_{IS} \eta_{PE-PE} R_{ON} \left( 1 - \frac{1}{C} \right) \left( 1 - \frac{N_{C}}{F_{elec}} F_{MAX} \right)^{-1}} \leq P_{tot},
\]

where

\[
F_{MAX} = \frac{1}{T_{amp} + T_{elec} + T_{OUT} + T_{vol}} \approx 83 \text{ MHz}
\]

is an upper bound for the clock frequency, independent of the total optical power available on the system and depending exclusively on the electronic circuit and the optical architecture characteristics.

2. Electrical Power Budget and Thermal Limitations

We calculated the electrical and thermal budget, and we found that for a reasonable heat-removal capability of the array \( Q \) of the order of 10 W/cm\(^2\) and a
usual electrical power supply of 5 W, the 16 × 32 SPA under study is only optical power limited.

D. System Performance

Figure 11 shows the laser power consumption for a 32 × 16 array-based monolithic OSPP, as a function of the frequency (clock) operation, in the case of three different interconnection patterns. Table 2 summarizes the overall performances of the proposed compact prototype, in terms of simulated annealing operations per second. Although the estimation of 150,000 annealing cycles/s (Table 1) was too optimistic, the performance of the new OSPP remains at least 3 orders of magnitude above the video-rate regime. It is important to say here that with our current speckle-generator configuration (with a rotating diffuser) as many as 100,000 spatiotemporal uncorrelated speckle fields can be generated per second, which is large enough for a video-rate serious simulated annealing but is too short if 1000 simulated annealings are required per second. However, it has been shown that with use of an alternative speckle-generator hardware module as described in Ref. 39. We assumed a 5-SPA system with 2 × 32 pixels. These systems are able to process standard resolution (256 × 256 pixels) images instead of hyper-low resolution (24 × 24 or 32 × 16 pixels) images, but, as explained before, this is just a different issue, concerning the technologically achievable SPA density. The OSPP is a parallel processor; hence there is no trade-off between SPA density and frequency of operation—the latter depends solely on the PE optical input–output bandwidth. Figure 12 shows the expected aggregate optical bandwidth (or aggregate capacity) achieved with a 32 × 16 hybrid CMOS-SEED-based OSPP prototype (the aggregate optical bandwidth of the system is defined as the average input–output bits per second flowing within the space 2-D optical bus during the annealing process). Figure 12 also draws a comparison between our OSPP prototype and other recent optoelectronic system demonstrators (photic switching systems, free-

![Figure 11](image1.png)

**Fig. 11.** Optical power requirement versus clock frequency. Solid curve, the 12-nearest-neighbor-optically-interconnected array; dashed curve, 8-nearest-neighbor-optically-interconnected array; dotted curve, 4-nearest-neighbor-optically-interconnected array.

![Figure 12](image2.png)

**Fig. 12.** Optical bandwidth improvement of the OSPP demonstrator and comparison with other SPA-based prototypes (CITR/95,32 HER/98,33 SPARCL/95,34 BS/98,35 AT&T/91,36 and AT&T/9337). Optical hardware module 2000/2007 represents perspectives for the CMOS-SEED systems, on the basis of foreseeable improvements in SPA hybrid CMOS-SEED technology38 and by use of a compact, rugged optical hardware module as described in Ref. 39. We assumed a 5-SPA system with 2 × 32 pixels. These systems are able to process standard resolution (256 × 256 pixels) images instead of hyper-low resolution (24 × 24 or 32 × 16 pixels) images, but, as explained before, this is just a different issue, concerning the technologically achievable SPA density. The OSPP is a parallel processor; hence there is no trade-off between SPA density and frequency of operation—the latter depends solely on the PE optical input–output bandwidth. Figure 12 shows the expected aggregate optical bandwidth (or aggregate capacity) achieved with a 32 × 16 hybrid CMOS-SEED-based OSPP prototype (the aggregate optical bandwidth of the system is defined as the average input–output bits per second flowing within the space 2-D optical bus during the annealing process). Figure 12 also draws a comparison between our OSPP prototype and other recent optoelectronic system demonstrators (photic switching systems, free-

### Table 2. Hybrid 32 × 16 OSPP Expected Performances

<table>
<thead>
<tr>
<th>Neighborhood</th>
<th>$P_{\text{clock}}$ (P$_{\text{tot}}$ = 1 W)</th>
<th>CDR (Mbits/s)</th>
<th>S.A./s</th>
</tr>
</thead>
<tbody>
<tr>
<td>4 (OSPP$_4$)</td>
<td>73.3</td>
<td>36.7</td>
<td>36,700</td>
</tr>
<tr>
<td>8 (OSPP$_8$)</td>
<td>66.8</td>
<td>16.7</td>
<td>16,700</td>
</tr>
<tr>
<td>12 (OSPP$_{12}$)</td>
<td>64.1</td>
<td>12.8</td>
<td>12,800</td>
</tr>
</tbody>
</table>

*Expected performances for the CMOS-SEED, OSPP 32 × 16 pixels. (CDR, smart-pixel optical channel data rate; S.A., simulated annealing).*
space optical backplanes, etc.) that use one or more SPAs. Indeed, aggregate capacity can be seen as a general system-performance indicator, resulting—in the case of the OSPP—from the (feedback rate) × (size of the image) product, divided by the number of color regions of the semishift-invariant single-instruction-multiple-data processor. The feedback rate (or clock frequency $F_{\text{clock}}$ divided by the number of colors $N_C$) corresponds to the individual optical bandwidth of the PE input-output differential channel (which strongly depends on the transducer technology). Note that the connectivity of the system refers to the number of optical input–output channels present on the system.

Figure 13 shows improvement on the prototype compactness. The system interconnection density refers to the total optical interconnection length (i.e., the collected length, taking into account every optical path within the system) divided by the total volume of the demonstrator. For an optoelectronic system based on free-space 2-D interconnected arrays, there is a theoretical upper bound for the interconnection density, given by $D = L/V = 2\pi/\lambda^2$, where $\lambda$ is the wavelength of light used within the system. For instance, if we consider the former standard 850-nm wavelength for optical communications, the interconnection density is approximately 8700 km/cm$^3$—at least 7 orders of magnitude above that of present demonstrators. Nevertheless, the goal is that the final size of the whole OSPP module be compatible with standard multichip electronic boards, which is likely to be a realistic aim considering that the random-number rotating diffuser does not need to be more bulky than a conventional heat-removal motorized fan.

4. Conclusion
We have described the first reconfigurable optoelectronic stochastic parallel processor (OSPP) dedicated to real-time (low-level) image tasks. Design and testing of a nonmonolithic free-space interconnected prototype (a SLM is required in addition to the silicon SPA) have been completed; we successfully demonstrated application of this system to motion detection in gray-level (synthetic) image sequences. Hardware reconfiguration of the OSPP involves rearrangement of the shift-invariant intra-SPA interconnection topology; this was achieved in our first prototype by simple replacement of the optical filter used within the nonmonolithic convolution setup. In fact, this (bulky) optomechanical hardware simulates intra-chip interconnects only through a relatively slow electronic feedback loop, which is responsible for the relatively poor overall system performance. Theoretical modeling shows that the feedback loop can be notably improved by the use of a hybrid CMOS-SEED-based SPA (capable of both optical input and optical output of signals) and a monolithic convolution setup. Indeed, even for the contemplated slower configuration ($N_C = 5$), as many as 12,800 images can be processed per second, outperforming most dedicated electronic image-processing systems—even when these rely on fast deterministic optimization algorithms. We also pointed out that the integration of a complete OSPP module in a conventional electronic board can be considered a realistic goal.

With this first complete OSPP demonstrator we hope to have drawn some of the attention we think this approach deserves to the field of dedicated hardware for video real-time image processing. Theoretical modeling of a hybrid CMOS-SEED-based OSPP should also pave the way for the implementation of more compact and powerful prototypes to come. This study represents one of the possible pathways for the introduction of novel optoelectronic devices into dedicated image-processing systems; nevertheless, application of the OSPP to neural-network-like signal-processing systems can be contemplated if non-shift-invariant and reconfigurable intrachip interconnects were somehow made available for use within the SPA—an issue far beyond the scope of the this paper but one that is worth investigating.

Early stages of this study were supported in part by the European Commission under contract ERBCI1*CT93-0004. The main part of the presented study was achieved during the thesis research of A. Cassinelli at the Institut d’Optique. Philippe Lalanne and Donald Prévost actively participated in early stages of this study. The essential role of Eric Belhaire, Francis Devos, Antoine Dupret, Patrick Garda, and Jean-Claude Rodier in designing and testing the SPIE600 chip is gratefully acknowledged. We also thank André Villing for his contribution to the design of the demonstrator electronic driving circuits.

References