Using FPGAS to accelerate the training process of a Gaussian mixture model based spike sorting
system
by Yongming Zhu

A thesis submitted in partial fulfillment of the requirement for the degree of Master of Science in
Electrical Engineering
Montana State University
© Copyright by Yongming Zhu (2003)

Abstract:
The detection and classification of the neural spike activity is an indispensable step in the analysis of
extracellular signal recording. We introduce a spike sorting system based on the Gaussian Mixture
Model (GMM) and show that it can be implemented in Field Programmable Gate Arrays (FPGA). The
Expectation Maximization (EM) algorithm is used to estimate the parameters of the GMM. In order to
handle the high dimensional inputs in our application, a log version of the EM algorithm was
implemented. Since the training process of the EM algorithm is very computationally intensive and
runs slowly on Personal Computers (PC) and even on parallel DSPs, we mapped the EM algorithm to a
Field Programmable Gate Array (FPGA). It trained the models without a significant degradation of the
model integrity when using 18 bit and 30 bit fixed point data. The FPGA implementation using a
Xilinx Virtex II 3000 was 16.4 times faster than a 3.2 GHz Pentium 4. It was 42.5 times faster than a
parallel floating point DSP implementation using 4 SHARC 21160 DSPs. 


USING FPGAS TO ACCELERATE THE TRAINING PROCESS OF A GAUSSIAN 

MIXTURE MODEL BASED SPIKE SORTING SYSTEM

by

Yongming Zhu

A thesis submitted in partial fulfillment 
of the requirement for the degree

of

Master of Science 

in

Electrical Engineering

MONTANA STATE UNIVERSITY 
Bozeman, Montana

November 2003


©COPYRIGHT
by

Yongming Zhu 
2003

All Rights Reserved


ii

APPROVAL 

of a thesis submitted by

Yongming Zhu

This thesis has been read by each member of the thesis committee and has been 
found to be satisfactory regarding content, English usage, format, citations, 
bibliographic style, and consistency, and is ready for submission to the College of 
Graduate Studies.

Dr. Ross K. Snider
(Signature)

/ 2 - / - O J *  
Date

Approved for the Department of Electrical and Computer Engineering

Dr. James Peterson a A-V

(Signature) Date

Approved for the College of Graduate Studies

Dr. Bruce McLedd^ A -A ^ A ^  o Z l  , Z ' / Z  
(Signature) Date


iii

STATEMENT OF PERMISSION TO USE

In presenting this thesis in partial fulfillment of the requirements for a master’s 

degree at Montana State University, I agree that the Library Shall make it available to 

borrowers under rules of the Library.

If I have indicated my intention to copyright this thesis by including a 

copyright notice page, copying is allowable only for scholarly purposes, consistent 

with “fair use” as prescribed in the U.S. Copyright Law. Requests for permission for / 

extended quotation from or reproduction of this thesis in whole or in parts may be 

granted only by the copyright holder.

Signature _ _ _ _ _ _ _
T 7  V


TABLE OF CONTENTS

1. INTRODUCTION....................................................................................................... I

What is Neural Spike Sorting?..................................................................................... I
Spike Sorting System................................................................................................... 3
Gaussian Mixture Model..............................................................................................6
Overview of the Thesis................................................................................................ 7

2. THE GAUSSIAN MIXTURE MODEL AND THE EM ALGORITHM..................9

Introduction.............................................
Mixture Model for Spike Sorting...........

The Gaussian Mixture Model (GMM)
GMMs for spike sorting......................................................................................... 11
Spike sorting using the GMM................................................................................13
The advantage and limitation of the GMM spike sorting system........................15

Maximum Likelihood and the Expectation Maximization Algorithm..................... 16
Maximum Likelihood Estimation...........................................................................17
The Expectation-Maximization (EM) algorithm................................................... 18
Expectation Maximization for the Gaussian Mixture Model............................... 20
Implementation of the EM algorithm.................................................................... 23

3. IMPLEMENTATION IN MATLAB........................................................................ 25

Introduction................................................................................................................ 25
Neural Signal Data from a Cricket............................................................................ 26
Practical Problems Related to the EM algorithm...................................................... 27

Model Number Selection........................................................................................27
InitiaUzation of the EM Algorithm........................................................................ 28
Evaluation of the Algorithm...................................................................................30

Numeric Considerations in Matlab............................................................................30
The Log EM Algorithm............................................. .......;................................... 30
Using A Diagonal Covariance Matrix................................................................... 34

Implementation Result in Matlab.............................................................................. 36
Matlab Performance...................................................................................................38

4. PARALLEL DSP IMPLEMENTATION................................................................. 40

Introduction................................................................................................................ 40
Hardware.................................................................................................................... 40

SHARC ADSP-21160M DSP Processor............................................................... 40
Bittware Hammerhead PCI board and VisualDSP++...........................................41

Analysis of the EM Algorithm...................................................................................43
Parallel Transformation of the EM Algorithm.......................................................... 44
Performance on the Parallel DSP Platform............................................................... 49

iv

\o
 \

o 
\o


V

TABLE OF CONTENTS - CONTINUED

5. FPGAIMPLEMENATION.......................................................................................51

Introduction................................................................................................................ 51
The FPGA and Its Structure.......................................................................................52

Field Programmable Gate Arrays.... ..........................  52
Structure of Xilinx's Virtex IIFPGA.................................................................... 53

AccelFPGA................................................................................................................ 56
The Fixed Point Representation of the EM Algorithm.............................................58
Using Lookup Tables................................................................................................. 59
Parallelism needed for FPGA implementation.......................................................... 65
Synthesis and Simulation...........................................................................................67
FPGA Result Evaluation............................................................................................69
FPGA Performance.............................................................................................   72

6. CONCLUSION AND FUTURE WORK.................................   73

Conclusion.................................................................................................................. 73
Future work................................................................................................................ 74

BIBLIOGRAPHY.......................................................................................................... 76

APPENDICES............................................................................................................... 80

APPENDIX A: SIMPLYFYING THE EXPRESSION OF 6 ( 0 ,0 s ) .................... 81
APPENDIX B: FLOATING POINT MATLAB CODE...........................................85
APPENDIX C: C CODE ON THE PARALLEL DSP PLATFORM...................... 89
APPENDIX D: FIXED-POINT MATLAB CODE USING ACCELFPGA......... 114


LIST OF FIGURES

1-1 This extracellular recording waveform shows different action 
potentials from an unknown number of local neurons. The data 
were recorded from an adult female cricket’s cereal sensory 
system. (Data courtesy of Alex [7])........... .....................................................2

1- 2 A typical neural signal recording, detecting and spike sorting
system................................................................................................................ 3

2- 1 Non-Gaussian density estimation using a GMM........................................ 10

2-2 Illustration of the spike generating process..................................................... 12

2-3 A multi-variant Gaussian model of the spike waveforms generated 
from a particular neuron. The solid line shows the mean vector of 
this model while the dashed line shows the three $  boundary 
according to the covariance matrix. All the waveforms in this 
model are shown in the background...............................................................13

2-4 Block Diagram of the Bayesian’s decision rule used for spike
sorting system. The log probability for a spike being in a cluster is
computed for each individual cluster and the highest scoring
cluster is the identified neuron...:................................................................... 15

2- 5 Diagram of EM algorithm for GMM parameter estimation....................... 23

3- 1 Plot of all the 720 neural spike waveforms..................................................26

3-2 The increase in likelihood due to the increase of cluster numbers.
The amount of the log likelihood increase is shown in the bar plot.
The dashed line shows the threshold value................................................... 28

3-3 Initialization of the EM algorithm................................................................... 29

3-4 Diagram of the revised EM algorithm. The E-step implemented in
the log domain is shown inside the dashed rectangle................................... 33

3-5 Results of three different approach..................................................................35

3-6 Clustering result from Matlab..........................................................................37

3-7 Clustering result for the training data. Five clusters are labeled
using different color....................................................................................... 37

vi

FIGURE PAGE


vii

LIST OF FIGURES - CONTINUED

4-1 Block diagram of Hammerhead PCI system.................................................. 42

FIGURE PAGE

4-2 The diagram shows most computational intensive parts in the EM 
algorithm and their execution percentage in the whole process, (a)

Calculation of 1 , probability of each data point in each 
cluster based on the current parameters, (b) Update of means and

covariance matrices, (c) Calculation of ^ ° l I x^  and ̂ x' I ^  , 
probability of being in each cluster given individual input data 
point and likelihood of each data point in the current Gaussian
Mixture Model, (d) Rest of the algorithm..................................................... 43

4- 3 Diagram of multiple DSP implementation of EM algorithm........................48

5- 1 Virtex II architecture overview [31]............................................................ 54

5-2 Slice Structure of Virtex IIFPGA [31]........................................................... 54

5-3 Top-down design process using AccelFPGA................................................. 57

5-4 Histogram of the input and output data range for the LUT table, (a)
The input, (b) The output........... ................................................................... 62

5-5 Input and output curve of the exponential operation...................................... 63

5-6 Diagram of implementing the LUT in block memory.......... ........................64

5-7 Diagram of the EM algorithm implementation in the FPGA.........................67

5-8 Floorplan of the EM algorithm implementation on a Virtex IIFPGA..........69

5-9 The mean vectors of the GMMs. (a) is the floating point version, (b)
is the output of the FPGA implementation.................................................. 70

5-10 The clustering results of both floating point and fixed point
implementations  ......................................................................................71


LIST OF TABLES

TABLE PAGE

3- 1 Performance of the spike sorting system on several PCs............................38

4- 1 List of semaphores and their functions........................................................ 46

4-2 Execution time for 8 iterations of EM algorithm on single and 4
DSP system.....................................................................................................49

4- 3 Performance of the EM algorithm on DSP system and PCs...................... 49

5- 1 Supported configuration of the block SelectRAM........................................ 55

5-2 The specifications of the Virtex I I XC2V3000 FPGA.............................. .....55

5-3 The error between the original floating point implementation and
the LUT simulations.......................................................................................63

5-4 List of directives can be used in AccelFPGA................................................. 65

5-5 The toolchain used in the FPGA implementation.......................................... 65

5-6 Device utilization summary of the Xilinx Virtex IIFPGA............................ 68

5-7 The post-synthesis timing report......................................................................68

5-8 The difference between the floating point output and FPGA
output..............................................................................................................70

5-9 Confusion matrix of fixed point spike sorting result for 720 nerual 
spikes from 5 neurons. Overall correct rate is 97.25% comparing 
to the floating point result.....................   71

5-10 Performance comparison between all the platforms.....................................72

viii


ABSTRACT

The detection and classification of the neural spike activity is an indispensable 

step in the analysis of extracellular signal recording. We introduce a spike sorting 

system based on the Gaussian Mixture Model (GMM) and show that it can be 

implemented in Field Programmable Gate Arrays (FPGA). The Expectation 

Maximization (EM) algorithm is used to estimate the parameters of the GMM. In 

order to handle the high dimensional inputs in our application, a log version of the 

EM algorithm was implemented. Since the training process of the EM algorithm is 

very computationally intensive and runs slowly on Personal Computers (PC) and even 

on parallel DSPs, we mapped the EM algorithm to a Field Programmable Gate Array 

(FPGA). It trained the models without a significant degradation of the model integrity 

when using 18 bit and 30 bit fixed point data. The FPGA implementation using a 

Xilinx Virtex II 3000 was 16.4 times faster than a 3.2 GHz Pentium 4. It was 42.5 

times faster than a parallel floating point DSP implementation using 4 SHARC 21160 

DSPs.

ix

Z


I

CHAPTER I 

INTRODUCTION 

What is Neural Spike Sorting?

Most neurons communicate with each other by means of short local 

perturbations in the electrical potential across the cell membrane, called action 

potentials or spikes [I]. By using extracellular glass pipettes, single etched (sharp) 

electrodes, or multiple-site probes, scientists have been able to record the electrical 

activity of neurons as they transmit and process information in the nervous system. It 

is widely accepted that information is coded in the firing frequency or firing time of 

action potentials [2,3]. It is further believed that information is also coded in the joint 

activities of large neural ensembles [4]. Therefore, to understand how a neural system 

transmits and processes information, it is critical to simultaneously record from a 

population of neuronal activity as well as to efficiently isolate action potentials arising 

from individual neurons.

Single nerve cell recording is possible by using intracellular electrodes. Glass 

pipettes and high-impedance sharp electrodes are used to penetrate a particular nerve 

cell and monitor the electrical activities directly. This has been used to understand the 

mechanism of action potentials and many other neuronal phenomena [5,6]. However, 

these electrodes are not practical in multi-neuron recording. For intact free-moving 

animals, a small movement of the intracellular electrodes will easily damage the nerve 

cell tissue. Furthermore, it is very difficult to.isolate a large number of neurons in a 

small local region. In awake animals, the isolation sometimes lasts for a very short


2

period of time. Fortunately, since all that we need is the timing of action potential, it 

is possible to use extracellular electrodes to acquire this information. With larger tips 

than intracellular electrodes, extracellular electrodes can simultaneously record 

signals of a small number (3 to 5) of neurons from a local region. Figure I shows the 

waveform comprised of action potentials recorded by one extracellular microelectrode. 

Each voltage spike in the waveform is the result of an action potential from one or 

more neurons near the electrode. The process of identifying and distinguishing these 

spikes that arise from different neurons is called “spike sorting”.

Figure 1-1 This extracellular recording waveform shows different action potentials 
from an unknown number of local neurons. The data were recorded from an adult 
female cricket’s cereal sensory system. (Data courtesy of Alex [7])

Spike sorting provides an alternative to physical isolation for multi-neuron 

recording. In this approach, no effort is made to isolate a single cell; rather the spikes 

due to several cells are recorded simultaneously and sorted into groups according to 

their waveforms. Each group is presumed to represent a single cell since their 

waveforms change as a function of position relative to the electrode. The closer an


3

electrode is to a particular neuron, the greater the amplitude of the signal will be 

compared with other waveforms.

Spike Sorting System

A typical system that measures and analyses extracellular neural signals is shown 

in Figure 1-2. There are four stages between the electrode and the identification of 

spike trains from individual neurons.

electrode

ClassifierFeature
Extractor

Amplifier
and
Filters

Spike
Detector

Figure 1-2 A typical neural signal recording, detecting and spike sorting system.

In the first stage neural signals are picked up by extracellular electrodes and 

amplified and then filtered. Low-pass filters are used to smooth the spike waveforms 

and provide an anti-aliasing filter before sampling. The frequency content of the 

waveform is typically less than 10 KHz. Other methods, such as wavelet denoising [8] 

can also be used to remove the recording noise. The second stage is spike detection. 

This is usually achieved by a simple threshold method, in which spikes are picked up 

when the maximum amplitude is bigger than a manually set threshold value. However, 

other methods including non-linear energy operators [9], wavelet-based detectors [10] 

and slope-shape detectors [8] have been used to improve the detection performance,


4

especially in the low SNR situations. A new approach, in which a noise model is first 

generated and then the distance from this noise model is computed to identify spikes

[7] is used to obtain the spikes for this paper. However, even in this approach a 

threshold must be chosen.

Once the data has been collected, features must be extracted to be used in 

classification: The features can be simple user-defined features such as peak spike 

amplitude, spike width, slope of initial rise, etc. Another widely used method for 

feature extraction is Principle Components Analysis (PCA) [11]. PCA is used to find 

an ordered set of orthogonal basis vectors that can capture the directions in the data of 

largest variation. The first K principal components will describe the most variation of 

the data and can be used as features to classify the spikes. Wavelet decomposition has 

also been used to define spike features in a combined time-frequency domain. Under 

high background noise, another linear transforming method based on entropy 

maximization has been reported to generate good features for classification [12].

People use extracted features instead of the full sampled waveform because of 

the “Curse of Dimensionality”, which means the computational costs and the amount 

of training data needed grow exponentially with the dimension of the problem. In high 

dimensional space, more data will be needed to estimate the model densities. Hence, 

with limited computational ability and training data, high dimensional problems are 

practically impossible to solve, especially in regards to statistical modeling. A 

person’s inability to view high dimensional data is another reason to decrease the data 

dimension to two or three for manual clustering.


5

In this paper, we present a log version of the Expectation Maximization (EM) 

algorithm which solves the numerical problem arising from the high dimensional 

input based on the Gaussian Mixture Model. In this case, we can use the full sampled 

spike waveforms which preserve all the information of the neural spikes, as the 

classifier input.

The last stage is the clustering of spikes. In this stage, spikes with similar 

features are grouped into clusters. Each cluster represents the spikes that arise from a 

single neuron. In most laboratories, this stage is done manually with the help of 

visualization software. All the spikes are plotted in some feature space and the user 

simply draws ellipses or polygons around sets of data which assigns data to each 

cluster. This process is extremely time-consuming and is affected by human bias and 

inconsistencies.

The development of microtechnology enables multi-tip electrode recording, such 

as strereotrode [13] or tetrode [14] recording . By adding spatial information of action 

potentials, multi-electrode recording can improve the accuracy of spike sorting [15]. 

Obviously, the amount of data taken from multi-channel recording can be too 

overwhelming to deal with manually. To solve this problem, some type of automated 

clustering method is required. During the past three decades, various unsupervised 

classification methods have been applied to the spike sorting issue. The applications 

of general unsupervised clustering methods such as K-means, fuzzy C-means, and 

neural network based classification schemes have achieved some success. Methods 

such as Template Matching and Independent Component Analysis have also been 

used. A complete review can be seen in [16].


6

In most cases, the classification is done off-line. There are two reasons: I. Most 

clustering methods need user involvement. 2. Most automated clustering methods are 

very computational intensive and general hardware, such as personal computers or 

work stations, can’t meet the real-time requirement of the data streams. However, 

online spike sorting is needed for many applications of computational neuroscience

[8], Implementing a classification method on high-speed hardware will reduce the 

training time needed to develop sophisticated models, which will allow more time to 

be devoted to data collection. This is important in electrophysiology where limited 

recording times are the norm.

We implement a Gaussian Mixture Model based spike sorting system on both a 

DSP system and a Field Programmable Gate Array (FPGA) platform. The FPGA 

implementation speeds up the system dramatically which will allow an online spike 

sorting .system to be implemenated.

Gaussian Mixture Model

If we view the measured spike signal as the combination of the original action 

potentials with the addition of random background noise, we can use a statistical 

process to model the spike signal generating process. Clustering can then be viewed 

as a model of the statistical distribution of the observed data. The whole data set can 

then be modeled with a mixture model with each cluster being a multivariate 

Gaussian distribution.

Gaussian mixture models have been found very useful in many signal processing 

applications, such as image processing, speech signal processing and pattern 

recognition [17,18]. For the spike sorting problem, a number of studies, based on the


7

Gaussian Mixture Model, have also been tried to provide statistically plausible, 

complete solutions. These studies have shown that better performance can be obtained 

by using a GMM than other general clustering methods, such as K-means [19], fuzzy 

c-means [20] and neural-network based unsupervised classification schemes [21]. 

With some modifications, the GMM based approaches also show promise in solving 

the overlap and bursting problems of spike sorting [22].

The Expectation Mmdmization (EM) algorithm is normally used to estimate the 

parameters of a Gaussian Mixture Model. EM is an iterative algorithm which updates 

mean vectors and covariance matrices of each cluster on each stage. The algorithm is 

very computational intensive and runs slowly on PCs or workstations.

In this paper, we present a log version of the EM algorithm which can handle 

high dimensional inputs. We also implement this revised EM algorithm on three 

different platforms, which include the PC, Parallel DSP and FPGA platforms. By 

comparing the different implementations, we found that, in terms of speed, the FPGA 

implementation gives the best performance.

Overview of the Thesis

The thesis will mainly focus on the following three points: the Gaussian Mixture 

Model, the Expectation Maximum algorithm, and the hardware implementation of the 

EM algorithm. Chapter 2 introduces the Gaussian Mixture Model and how to use the 

EM algorithm to estimate the mixture parameters. In Chapter 3, a log version of the 

EM algorithm and its performance on a PC is presented. The details of the 

implementation of the EM algorithm on a parallel DSP system are described in 

Chapter 4. Chapter 5 describes the parallel implementation of the EM algorithm on a


8

Xilinx Virtex II FPGA and compares the performance of FPGA with the previous 

implementations. Finally, Chapter 6 summarizes the work of this thesis and suggests 

possible future research directions.


9

CHAPTER 2

THE GAUSSIAN MIXTURE MODEL AND THE EM ALGORITHM

Introduction

This chapter introduces the Gaussian Mixture Model (GMM) and a maximum 

likelihood procedure, called the Expectation Maximization (EM) algorithm, which is 

used to estimate the GMM parameters. The discussion in this chapter will focus on 

the motivation, advantage and limitation of using this statistical approach. Several 

considerations for implementing the EM algorithm on actual hardware will be 

discussed at the end of this chapter.

The chapter is organized as follows: Section 2.2 serves to describe the form of 

the GMM and the motivations to use the GMM in clustering the action potentials 

from different neurons. In section 2.2, first a description of the Gaussian Mixture 

Model is presented. Then several assumptions and limitations of using the GMM in 

spike sorting are discussed. Section 2.3 introduces the clustering of the action 

potentials based on Bayesian decision boundaries. Finially, the EM algorithm and 

derivation of the GMM parameters estimation procedure are presented.

Mixture Model for Spike Sorting .

The Gaussian Mixture Model (GMM)

A Gaussian mixture density is a weighted sum of a number of component 

densities. The Gaussian Mixture Model has the ability to form smooth approximations 

to arbitrarily shaped densities. Figure 2-1 shows a one-dimensional example of the


10

GMM modeling capabilities. An arbitrary non-Gaussian distribution (shown in solid 

line) is approximated by a sum of three individual Gaussian components (shown by 

dashed lines).

Q ---------1— I i ^  I ___  ^
0 10 20 30 40 50 60 70 80 90

Figure 2-1 Non-Gaussian density estimation using a GMM

The complete Gaussian mixture density is parameterized by the mean vectors 

( //) , covariance matrices (Z )  and weights ( a  ) from all component densities which 

are represented by the notations O = {df,//,Z}. The model is given by Eq. 2-1,

M

p{x \0 )  = Y JOClp{x\ol) Eq. 2-1
i=i

where p(x\0) is the probability of x given model O, Jt is a D-dimensional random 

vector, p(x \o t) (i = 1,...,M) are the component densities and a, (i = 1,...,M) are the


11

mixture weights. Each component density is a D-variant Gaussian function given by 

Eq. 2-2,

p ( x \ ° l) I f  X;1 (x-Mi)
(27t)r/2 i Z ; r e

Eq. 2-2

where /ii is the mean vector and £ /  is the covariance matrix of each individual 

Gaussian model O1.

M

The mixture weights «/ satisfy the constraint that ^ a l =I ( cc, > 0), which
/=1

ensures the mixture is a true probability density function. The Oil ’s can be viewed as 

the probability of each component.

In spike sorting, action potentials generated by each neuron are represented by 

each component in a Gaussian Mixture Model. Details about using GMMs to model 

the spike sorting process will be described in the next section.

GMMs for spike sorting

Since action potentials are observed with no labeling from neurons they arise 

from, spike generating can be considered to be a hidden process. To see how the 

hidden process leads itself to modeling the spike waveforms, consider the generative 

process in Figure 2-2.


12

The characteristic White Noise
spike shapes

No I neuron
White Noise

Spike Sorting 
System

No 2 neuron White Noise
Spike Train

No 3 neuron White Noise

No 4 neuron

Figure 2-2 Illustration of the spike generating process.

The neural spike train is considered to be generated from a hidden stochastic 

process. Each spike is generated from one of the neurons according to an a priori 

discrete probability distribution {«,} for selecting neuron i. If we assume that the 

action potentials from each neuron has its own characteristic waveform shape (mean 

value) and addition of white noise with different variance (covariance), observations 

from each neuron can be modeled by a multi-variable Gaussian model as seen in 

Figure 2-3. The observation sequence thus consists of feature vectors drawn from 

different statistical populations (each neuron) in a hidden manner.


13

Figure 2-3 A multi-variant Gaussian model of the spike waveforms generated from a 
particular neuron. The solid line shows the mean vector of this model while the 
dashed line shows the three ^ boundary according to the covariance matrix. All the 
waveforms in this model are shown in the background.

Furthermore, if we assume that action potentials from different neurons are 

independent and identically distributed [16], the observation vector distribution can be 

given by a Gaussian Mixture Model (Eq. 2-1). Thus, the GMM results directly from a 

probabilistic modeling of the underlying spike generating process.

Spike sorting using the GMM

In the last section, we describe how to use the GMM to model the spike 

generating process. From the perspective of spike sorting, the goal is to estimate the 

parameters of the GMM and classify each action potential with respect to each neuron


14

class in which the action potential is generated from. To do this, the first step is to 

estimate all the parameters, including all the priors’ probabilities {«,} together with 

the means and covariance of each Gaussian component. Maximum likelihood (ML) 

parameter estimation is the method we will use to find these parameters with a set of 

training data. It is based on finding the model parameters that are most likely to 

produce the set of observed samples. The maximum likelihood GMM parameter 

estimates can be found via an iterative parameter estimation procedure, namely the 

Expectation Maximization (EM) algorithm. EM algorithms have been widely used in 

estimating the parameters of a stochastic process from a set of observed samples. 

Details about ML and EM algorithm will be discussed in the next section, 

i After estimation of the model parameters, classification is performed by the 

Bayesian decision rule. The probability that each observation ( x ) belongs to a 

particular classe ( Oj) is calculated by using Bayes’ rule (Eq. 2-3),

Y uP ( A 0k)P(Ok)
k=\

Each individual density p{ot | x) is determined by the means and covariance of 

each component density using Eq. 2-2. The prior probability P(Oi ) is identical to the 

weights o.i in Eq. 2-1.

Since the cluster membership is probabilistic, the cluster boundaries can be 

computed as a function of a confidence level. The spikes that don’t fall into any 

clusters will be considered as noise, or overlapping spikes that can be dealt with later.


15

Mathematically, it can be shown that if the model is accurate, the Bayesian decision 

boundaries will be optimal, resulting in the fewest number of misclassifications [23].

X

log P(O1Ix)

log p(o2|x)

Cluster M
0 M  =  G M - I j M - a M )

Max
selection

Cluster
Result

Figure 2-4 Block Diagram of the Bayesian’s decision rule used for spike sorting 
system. The log probability for a spike being in a cluster is computed for each 
individual cluster and the highest scoring cluster is the identified neuron.

Figure 2-4 shows a block diagram of the Bayesian’s decision rule applied to 

spike sorting using GMM. The input to the system is a vector representing an action 

potential. The probability that an action potential belongs to a class is calculated using 

Eq. 2-3 over all the classes. The vector is classified into a certain class whenever the 

vector falls into the class with a high confidence level. If the waveform doesn’t 

belong to any cluster with high probability, the vector will be considered as noise and 

will be thrown away. It is also possible that the waveform results from overlapping 

spikes and could be stored for further inverstigation.

The advantage and limitation of the GMM spike sorting system

GMM is a statistical approach for the spike sorting problem. It defines a 

probabilistic model of the spike waveforms. A big advantage of this framework is that


16

it is possible to quantify the certainty of the classification. By using Bayesian 

probability theory, the probability of both the spike form and number of spike shapes 

can be quantified. This is used in spike sorting because it allows experimenters to 

make decisions about the isolation of spikes. Also by monitoring the classification 

probabilities, experimenters will be able to tell possible changes in experiment 

conditions. For example, a drop in probability may indicate electrode drift or a rise in 

noise levels.

In spite of the advantages of the GMM approach, there are several inherent 

limitations of this statistical model for spike sorting. The Gaussian mixture model 

makes the assumption that the noise process is Gaussian and invariant. These two 

cases are not necessarily true in spike sorting. For example, with the presence of 

burst-firing and overlap spikes, the noise may show larger variation at the tails. 

Another assumption made is that all neurons fire independently and all the noise is 

uncorrelated. It is likely that in a local area, neurons have some correlation to each 

other. We could use a more complete statistic approach which includes the correlation 

between neurons to model the spike generating process [24]. However, this is outside 

of the scope of this thesis.

Maximum likelihood and the Expectation Maximization Algorithm

This section describes an unsupervised maximum likelihood procedure for 

estimating the parameters of a Gaussian mixture density from a set of unlabeled 

observations. A maximum likelihood parameter estimation procedure is developed for 

the GMM. First, a brief discussion of maximum likelihood estimation and the use of 

the FM algorithm for GMM parameter estimation is given. This is followed by the


17

derivation of the mixture weight, mean, and variance estimation equations. Finally, a 

brief discussion of the complete iterative estimation procedure is given.

Maximum Likelihood Estimation

Maximum likelihood parameter estimation is a general and powerful method for 

estimating the parameters of a stochastic process from a set of observed samples. We 

have a density function p{x 10 ) that is governed by the set of parameters 

O = {«,//, X} . We also have a data set of size N, supposedly drawn from this 

distribution, i.e. X  ={xl,x1,xi ,...,xN} . We assume that these data vectors are 

independent and identically distributed (i.i.d.) with distribution p(x  10 ) . Therefore, 

the resulting probability for all the samples is

p{X  10) = f t  Pixl 10 ) = (T(C) I X) Eq. 2-4

This function £ ( 0 \ X )  is called the likelihood of the parameters given the data, 

or the likelihood function. The likelihood is treated as a function of the parameters O 

where the data X is fixed. For maximum likelihood parameters estimation, our goal is 

to find the O that maximizes ^(O | X ) . That is, we wish to find 0*  where

0* = argmax ( 0 1 X) Eq. 2-5
o

If p (x |0 )  is simply a single Gaussian distribution and 0  = (X<72) . Then we 

can set the derivative of the £  function to zero and solve the likelihood Eq. 2-6 

directly for p, and cr2.

EC. 2-6
3 0


18

However, in our case using the GMM p(x  10 ) is a weighted sum of several 

single Gaussian distributions. Attempting to solve Eq. 2-6 directly for the GMM 

parameters, O = {«,//,E } , i = 1,...,M, does not yield a closed form solution [25]. 

Thus we must resort to more elaborate techniques.

The Expectation-Maximization ('EM) algorithm

The maximum likelihood GMM parameters estimates can be found via an 

iterative parameter estimation procedure, which is called the Expectation- 

Maximization (EM) algorithm [25]. The EM algorithm is a general method of finding 

the maximum-likelihood estimate of the parameters of an underlying distribution from 

a given data set when the data is incomplete or has missing values. There are two 

main applications of the EM algorithm. The first use of the EM algorithm occurs 

when the data indeed has missing values. The second application occurs when 

optimizing the likelihood function becomes analytically intractable. This latter 

application is used for the GMM case where the analytical solution for the likelihood 

equation can not be found.

The idea behind the second application of the EM algorithm is to simplify the 

likelihood function £  by assuming the existence of additional but hidden parameters. 

We assume that data X  is observed and is generated by some distribution. We also 

assume that there is another hidden data set Y, which will be defined in section 2.3.3. 

We call Z = {X,Y} the complete data set. The density function for the complete data 

set is:

p{Z 10) = p{X,Y\  0) = p(Y I X, 0 ) p ( X  10) Eq. 2-7


19

With this new density function, we can define a new likelihood function,

C(0\Z)  = C (0 \X ,Y )  = p (X ,Y \  0) Eq. 2-8

which is called the complete-data likelihood. Note that since 7  is a hidden variable 

that is unknown and presumably governed by an underlying distribution, the 

likelihood function is also a random variable. The EM algorithm first finds the 

expected value of the complete-data log-likelihood with respect to the unknown data 

Y given the observed data X  and the current parameter estimates. That is, we define:

6 ( 0 10 s) = £[logp(X , 7 10 ) I X ,0 s ] Eq. 2-9

where Q(0]0g) is the expected value of the log-likelihood with respect to the 

unknown data 7, 0 s are the current parameters estimates and O are the new 

parameters that we optimize to increase Q. Note that in this equation, X  and O8 are 

constants, and O is a normal variable that we wish to adjust. The evaluation of this 

expectation is called the E-step of the EM algorithm. The second step is the M step. 

This step is to maximize the expectation we computed in the first step. That is we find:

6 s = arg max £ (0 ,0 s) Eq. 2-10
o

These two steps are repeated as often as necessary. The maximum likelihood 

parameters estimates have some desirable properties such as asymptotic consistency 

and efficiency [26]. This means that, given a large enough sample of training data, the 

model estimates will converge to the true model parameters with probability one. So each 

step of iteration in EM algorithm is guaranteed to increase the log-likelihood and the 

algorithm is guaranteed to converge to a local maximum of the likelihood function [26].


20

Expectation Maximization for the Gaussian Mixture Model

In the spike sorting case, the GMM model consists of M  classes, each class 

represents a neuron that has been recorded. Let X  = {xl,x2,x3,...,xN} be the sequence of

observation vectors, i.e. the measured waveforms. In this case, we assume the following 

probabilistic model:

M

p{x \0 )  = ^ i P ( A o l) Eq. 2-11
NI

M

where the parameters are O = I a ^ a 2,...aM,Ov O1,...,oM] such that =1 and each
Z=I

O1 is a single Gaussian density function. Then the log-likelihood expression for this 

density from the data set X  = [Xv X2,X3,...,Xi) is given by:

N N M

iog(f( 0 1 x ) )  = log J I p N  IO) = J i o g Q T ^ p N  | O1)) Eq. 2-12
z = I  N I  N I

which is difficult to optimize because it contains the log summation. Now, we introduce a 

new set of data T. F is a sequence of data Y = {yv y2, y3,..., yN} whose values inform us

which neuron generated the observation vector Xi . This variable Y is considered to be a 

hidden variable. If we assume that we know the values of F* the likelihood becomes

log(f(0 | Z,F)) = i] lo g ( fN  I O j f  w )  =  f  l o g ( %  N  IN ) )  % 2-1%
N I  N I

Since the values of Y are unknown to us, we don’t actually know which neuron 

generated each spike. However, if we assume Y is a random vector which follows a 

known distribution, we can still proceed.

The problem now is to find the distribution of the hidden variable Y. First we guess 

that parameters O8 are appropriate for the likelihood £ ( 0 s). Given O8, we can easily


21

compute Py (Xi 10s) for each data vector Xi for each component density y . . In addition, 

the mixing parameters, CKy can be thought of as prior probabilities of each mixture 

component, that is CKy = P(Oj ) . Therefore, using Bayes’s rule, we can compute:

p(y,.|x,.,Os) atTlP y M 0D
M

Z X f a (a, l < )
yj=l

Eq. 2-14

p (F |Z ,O s) = ]qp(y,. |x,.,Og) Eq. 2-15
I = I

Here we assume that each neuron fires independently from each other. Now we 

have obtained the desired marginal density by assuming the existence of the hidden 

variables and making a guess at the initial parameters of their distribution. We can 

now start the E step in the EM algorithm and using this marginal density function to 

calculate the expected value of the log-likelihood of the whole data. The expected 

value is found to be (The derivation can found in Appendix A):

M N  M N
<2(0,0 8) = Z 1 Z loSia I W  I-X,-,O*) + Z Z <X-1 °i ))p(l IX1-,0 s ) Eq. 2-16

Z = I  Z = I  Z = I  Z = I

With the expression for the expected value of the likelihood and we can go to the 

next M step to maximize this expected likelihood with respect to the 

parameters O = {CK,, , Z , } . We can maximize the term containing CKz and the term

containing O1 independently since they are not related.

To find the expression for CKz, we have to solve Eq. 2-17 with the constraint of Eq 2-

18:


22

d C C Z loSttiV )P(/ 14 ,0 " ) )
Z=I Z=I_________________

d(a ,)
Eq. 2-17

Etif/ = 1
Z = I

We introduce the Lagrange multiplier / I , and solve the following equation:

Eq. 2-18

M N  M

dC C E loSCtifZ )p(l \xi,Og) + AC^al - I ) )
Z = I  Z = I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Z = I _ _ _ _ _ _ _ _ _ _

d(Of,)

and we get the expression

Eq. 2-19

I N
ti '̂ = T r E x Z I  4 , 0 * )  Eq. 2-20

-W j=i

To get the expression of //, and E z, we have to go through a similar procedure, but

the computation is much more complicated. To calculate these two parameters for the 

Gaussian distribution function, we’ll need to take derivatives of a function of matrices. 

The detailed procedure can be found in [25], and here we just give the results.

EXfCZ 14 ,0 * )
Pt = - 4 ---------------------  Eq. 2-21

E X Z 14 , 0 ' )
Z = I

E x z i 4 , o ' X 4 - / / r ) ( ^ - ^ r ) '
E z= - ----------- jf----------------------------- Eq. 2-22

E X Z 14 , O ')
Z = I

Collectively, Eq. (2-20), (2-21) and (2-22) form the basis of the EM algorithm 

for iteratively estimating the parameters of a GMM. The detailed steps to implement 

these equations will be discussed in the next section.


23

Implementation of the EM algorithm

As shown in Figure 2-5, the procedure to implement the iterative EM algorithm 

consists of the following steps:

1) Initialization: Initialize the model parameters O(O)

2) E-step: Calculate the expected likelihood of the given data based on the 

current parameters 0(i). Calculate the increase of likelihood. If the 

difference is below a certain threshold, stop the iterative process.

3) M-step: Estimate new model parameters 0(i+l) using current model 

parameters O(i) via estimation equations. Replace current model parameters 

with new model parameters, which are the new parameters for the next E- 

step.

/  Is Likelihood x  
L(i)-L(i-1)>thresholi

STOP

Initialization
Q = Q ( O ) Estimate Q('+1> 

using Q(')

M-Step

Calculate likelihood 
of the data using QM

Figure 2-5 Diagram of EM algorithm for GMM parameter estimation.

Since the new model parameters obtained from the estimation equations 

guarantee a monotonically increasing likelihood function, the algorithm is guaranteed 

to converge to a stationary point of the likelihood function. It has been argued that 

that the convergence of the EM algorithm to the optimal parameters of the mixture is


24

slow if the component distributions overlap considerably [25]. But if the clusters are 

reasonably well separated, which is typically the case in spike sorting, the 

convergence of the likelihood of the mixture model is rapid and the mixture density 

approximates the true density.


■ 25

CHAPTER 3

IMPLEMENTATION IN MATLAB 

Introduction

We have shown how to use the EM algorithm to estimate the parameters of 

Gaussian Mixture Model. In this chapter, we will find that a number of practical 

issues have to be examined closely before we can achieve robust parameter estimates. 

Moreover, since we will use the full-sampled waveforms as input, the high 

dimensional training data will cause numeric problems such as underflow due to 

limitations in precision. We will present a revised version of EM algorithm to solve 

these problems which we implement in software and hardware.

The chapter is organized as follows: First, the training data we used in this paper 

will be introduced. The practical problems related to the EM and applications on 

GMM spike sorting system are discussed in the next section. These problems include 

initialization of the GMM model, cluster number selection and algorithm evaluation. 

In the next section, the numeric underflow problem caused by high dimensional input 

data will be addressed. We compare three different approaches to handle this problem 

and present a log version of the EM algorithm with diagonal covariance matrices. 

This new EM algorithm gave the best result among the three methods. At the end of 

this chapter, we will show the resulting model trained by this log EM algorithm and 

its performance on a PC.


26

Neural Signal Data from a Cricket

The signals we use for this thesis were recorded from the cricket cereal sensory 

system [27]. Data was recorded using an NPI SEC-05L amplifier and sampled at 50 

KHz rate with a digital acquisition system running on a Windows 2000 platform. The 

stimuli for these experiments were produced by a specially-designed device which 

generated laminar air currents across the specimen’s bodies. Details about the 

experiments can be found in [27]. After raw data was collected, neural spikes were 

detected by first identifying the segments of having no spikes. A noise model was 

generated by assuming that the noise was Gaussian. Then, the spikes were detected if 

the distance between the recorded data and the noise model was bigger than a 

manually set threshold [7], After the threshold was set, we got rid of any waveforms 

that were obviously not action potentials and left 720 spikes for training of the EM 

algorithm. The 720 spikes are show in Figure 3-1. Each spike lasts for 3.2ms and has 

160 sampling points. This set of 720 160-dimensional vectors was used in the EM 

algorithm.

Figure 3-1 Plot of all the 720 neural spike waveforms.


27

Practical Problems Related to the EM algorithm 

Model Number Selection

The number of Gaussians for the mixture model must be selected before using 

the EM method. However, in spike sorting, we do not have a priori information of 

how many neurons are producing action potentials. So we have to estimate the model 

number using the whole data set.

Based on the probabilistic foundation of GMM, we can view the choice of the 

optimal cluster number as a problem of fitting the data by the best model. This 

problem can be solved by applying the maximum likelihood estimate. However, the 

likelihood is a monotonically increasing function of the number of parameters. In 

other words, the likelihood will increase as the model number becomes larger. A 

number of methods have been proposed to determine the model number using 

maximum likelihood criteria by adding some penalty functions which can compensate 

for the monotonically increasing of the log-likelihood function of the model number 

[24]. However, it is hard to define the suitable penalty term and not very efficient to 

implement.

To determine the model number in this paper, we use the method presented in 

[12]. We calculate the likelihood of the data set from clusters ranging in size from I to

7. As shown in Figure 3-3, the initial increment of likelihood is large but then 

becomes small. This is because as the number of cluster approaches the true number 

of the model, the increment of likelihood becomes small. So we picked the point 

when the increase ( J )  in likelihood fell below a certain threshold for 2 continuous 

points. Eq. 3-1 shows the actual calculation. The threshold S  was set to 4.


28

IogCfO + 2)) -  IogiC(t + !))<<? and log(f (? + 1)) -  log(f (/)) < S  Eq. 3-1 

From Figure 3-2, we can see that 5 models is a reasonable choice for the model 

number. Also, keep in mind that you can always merge or split clusters using more 

sophisticated methods [12,28],

Number selected

Number of clusters

Figure 3-2 The increase in likelihood due to the increase of cluster numbers. The 
amount of the log likelihood increase is shown in the bar plot. The dashed line shows 
the threshold value.

Initialization of the FM Algorithm

The selection of initial model parameters will affect the final results. This is 

because the likelihood surface associated with GMM tends to exhibit multiple local 

maxima. Thus, different initialization points will lead to different local maximum 

results. A generally used method is to apply the K-means clustering algorithm first 

and then use its output as the initial value of the mixture model’s parameters [16].


29

However, the K-means method itself is also sensitive to initialization and does not 

guarantee to get the right initialization for the mixture model. Furthermore, K-means 

algorithm is computationally expensive to implement.

Another more straight forward method is to randomly select initialization from 

the data set. It has been shown that this method gives the same performance as the K- 

means method [24]. In this paper, we initialize the means to randomly chosen data 

points and pick a single covariance matrix for all the clusters. We set values of the 

initialized covariance matrix to be reasonably large so that each cluster can see all the 

data at the beginning of the training. We randomly picked 100 starting points from the 

data set and chose the result that provided the largest likelihood. Figure 3-3 shows the 

means for the 5 initialization clusters. The covariance matrix { X, } and prior

probabilities {a, } of all the clusters are set to be equal. All the FM training processes 

in this paper start from this same initialization so that all the results are comparable.

Figure 3-3 Initialization of the EM algorithm


30

Evaluation of the Algorithm

One hard problem of spike sorting is that it is hard to evaluate the clustering 

results. Signals from different cells may be very similar and difficult to distinguish 

even for trained neurophysiologists. It is nearly impossible to create reliable expert- 

based validation data sets. Hence, simulated neural data are often used for evaluating 

the spike sorting algorithm. Recently, simultaneously intro- and extra-cellular 

recording technology has provided the possibility to evaluate the algorithm on real 

data.

In this paper, we focus on how to implement the GMM based spike sorting 

algorithm on hardware. Since evaluations for both simulated and real data show that 

the GMM approach gives very good results [16,29], we just concentrated on the 

software and hardware implementations. When targeting FPGAs, the use of fixed 

point data types is important because of the increase in speed that can be attained. 

Since using fixed point data will lose a certain amount of precision as compared to 

floating point data, we evaluated the FPGA implementation with respect to the 

floating point implementations.

Numeric Considerations in Matlab

The Log EM Algorithm

To reserve all the original information in the spike waveforms, we used the full- 

sampled spikes as the clustering input. The implementation of the training process of 

EM algorithm should be able to handle the 160 dimensional inputs. However, if we 

look at the EM algorithm closely, we will find many places where a number of small 

numbers (between 0 and I) are multiplied together. This easily leads to a numerical


31

zero in Matlab, which is defined to be 2.225le-308 or less for double precision data. 

The calculation of the likelihood of the whole data set and the determination of 

covariance matrix are two places where a numeric zero problem occurs most often. To 

compensate for this, we have to transform these operations into the log domain so that 

the multiplications turn into additions that prevent the underflow problem. Here we 

present a log version of the EM algorithm which solves this problem.

Given the current model, we calculate the probability of each data point in each 

cluster. The original equations are listed as follows:

P(X^O1) I  1  (* i  " " A l )

M m ,  r e
Eq. 3-2

P(Xi IO) = ^ a 1P(Xi Iol) Eq. 3-3

C(OlX) = Y lp (X i IO) Eq. 3-4
i = i

Notice that in Eq. 3-2 and Eq. 3-4, there are product operations that will cause 

underflow. By changing them into the log domain, the equations are:

ln(p(Xi [O1) ) - - - (xt - /J1)7 Tj11 (Xi - / / , ) — — ln ( 2 ^ ) - —ln ( |E , |) Eq. 3-5

M

\n(p(xi IO)) = lnQ^ia leln(p<'x̂ °l)))
Z=I

Eq. 3-6

ln (^ ( 0 J X)) = f jH p ( x i IO)) Eq. 3-7
i=i

After changing into the log domain, the products in Eq. 3-2 and Eq. 3-4 are 

turned into additions and subtractions. Exponential operations are also eliminated


32

except in Eq. 3-6. Since all the eln(p(x'l°')) terms are very small, this can still cause 

underflow problems Qn(O)). To handle this, we need to add another normalization 

process to calculate Eq. 3-5. We do a normalization across all the clusters by 

subtracting all the ln(p(x, | o,)) terms by the maximum value across the clusters. We

then store the max(\n(p(xi Iol) ) ) for future use. After subtraction, the exponential

value e]n(p<-xM  will be guaranteed between [0, I]. We then performed the 

multiplications and additions. The log version of the middle results can be obtained by 

applying In to it. The last step is to add the max( Inipixi | o,))) back to the middle

result. This way, without changing the final result, we go around the numerical zero 

problems and get the same results.

M

IniPixi IO)) = ln ( £  + j3 Eq. 3-8
i=i

P  = max Pixi IO1) Eq. 3-9
0 I

After getting the likelihood of the whole data and before updating the parameters, 

the probability of being in one cluster given input Xi has to be calculated. This 

probability can be calculated by Baye’s rule in Eq. 3-10.

PloAx,) = % XAO‘)Pi0,) = P M O ') m )  Eq. 3-10
Z a lP M o J
k~\

However, this equation can also exhibit underflow problems if all the pix  | Oji) ’s

are very small, which will occur if a spike doesn’t belong to any clusters. So we still 

need to convert Eq. 3-10 into the log domain as follows:


33

1° P(°i Ixi) = In P^xi \ O1) + \n P(O1)-Xn p(xi 10 ) Eq. 3-11

Since we have already completed In p(x, 10 ) from Eq. 3-8, it is fairly easy to get 

the result of Eq. 3-11. We can use exponentiation to get p(o, | Xi) back into the 

normal domain and begin the parameter updates. All the parameter updates of are 

done in the normal domain since no underflow problems will occur in this process. 

The new parameters will be used to calculate the likelihood for the whole data set 

again in the log domain to determine whether the iterative process should stop. The 

whole process can be seen in the Figure 3-4.

Log Domain

STOP

/ I s  log Iikelihooo'' 
L(i)-L(i-1)>thresholi

Initialization
Q = Q ( O ) Calculate log likelihood 

of the data using QW

E-Step

Estimate Q(i+1) 
using QW

M-Step

Figure 3-4 Diagram of the revised EM algorithm. The E-step implemented in the log 
domain is shown inside the dashed rectangle.


34

Using A Diagonal Covariance Matrix

While computing Eq. 3-2, we need the determinant of the covariance matrices 

{ Z,™1}. However, it is difficult to get the determinant of the full covariance matrix

since the data dimension is large. All the values of covariance matrices are 

approximately IO"2, so during the training process the matrices will easily become 

singular because of limited precision.

We tried three different approaches to solve this problem. The first method was 

to recondition the covariance matrices whenever a singular matrix appears. Every 

time when there was a numeric zero in the covariance matrix, we added a very small 

number (IO'4) to prevent it from being zero. This way, the algorithm could go on to 

the next step until it converged. In the second approach, we reduced the data 

dimension by using Principal Component Analysis. As discussed in Chapter I, PCA 

can find the directions that represent the most variation of the original data. We used 

the first 10 components that could represent 98% variation of the data set. By 

reducing the dimension to 10, there was much less chance that full covariance matrix 

couldl become singular. The last method was to use a diagonal matrix instead of the 

full covariance matrix. Using a diagonal matrix can enabled log calculations of the 

determination and prevented numerical zeros. By using the diagonal covariance 

matrix, we made the assumption that there was no correlation between each 

dimension in the 160 dimensional space. Making this assumption can cause a loss of 

information in the original data set. However the first two methods also lost 

information by recondition and reducing the vector dimensions. We tested all these 

methods and the best method was the diagonal matrix as shown in the next section.


35

lull cowrisnc# matrix with record lion r « 0.001 PCA before fol cowriance matrix with I n t  6 components

diagonal covariance matrix

Figure 3-5 Results of three different approach. The diagonal covariance approach was 
the only method that captured the small cluster.

As we mentioned in the previous section, it was hard to validate the cluster result 

since we didn’t actually know the typical spike shapes that came from individual 

neurons. However, by looking at the mean vectors of each cluster, we had a good idea 

about the ability of each approach to capture the clusters among the data. Figure 3-5 

shows all the cluster means of the three methods. All the spikes are in yellow and the 

mean vector of each cluster is shown in a different color. While all of the three 

approaches caught the main clusters, the diagonal matrix approach did the best in 

catching the smallest cluster (red line). The other two methods failed to catch this 

smallest cluster. Notice that in the recondition approach, the prior probabilities for the 

small clusters tend to be big. This is because adding a number to the covariance 

prevents the cluster to be small. Also notice that the biggest cluster in the


36

reconditioned and PCA method is different from the diagonal covariance matrix 

method. This is due to the inability of these two methods to distinguish small variance 

in the low-amplitude waveforms. They actually include more waveforms in this 

cluster where some spike shapes are noticeably different. Using the diagonal 

covariance matrix approach, the small variance was captured and a small cluster was 

generated to represent this specific spike shape. For our data set, the diagonal 

covariance approach lost the least amount of information in the original spike 

waveforms. We chose this approach for all hardware implementations.

Implementation Result in Matlab

Using the initialization condition mentioned in section 3.3.1, the training process 

for 720 160-dimensional inputs took 8 iterations to converge. The mean waveform 

and priors of each cluster are show in Figure 3-6. Figure 3-7 shows the cluster results 

of the 720 training data using the GMM.

As you can see in Figure 3-7, all the main clusters have been recognized. Cluster 

I and Cluster 2 are the clusters with the most data. Cluster 3 and Cluster 4 can be 

regarded as the noisy versions of Cluster I and Cluster 2. Their mean vectors are very 

close to Cluster I and Cluster 2. But their waveforms are noisier, thus the covariance 

of these two clusters are bigger. Cluster 5 has its own specific spike shape. It has the 

least number of spikes among the five clusters so it has the smallest prior probability. 

Overall, this 5-component Gaussian Mixture Model can pretty well model the whole 

data set from visual inspection.


37

Figure 3-6 Clustering result from Matlab.

-1.5 -

160

Figure 3-7 Clustering result for the training data. Five clusters are labeled using 
different color.


38

Matlab Performance

Since we want to implement a high-speed spike sorting system where the models 

are trained as fast as possible, we have to know how fast this algorithm can run on a 

PC. As described in Chapter 2, the whole spike sorting process can be divided into 

two phases: the model training phase and the spike classification phase. We tested the 

algorithm using the newest version (Version 6.5 Release 13) of Matlab on different 

PCs. The test results are shown in Table 3-1.

Table 3-1 Performance of the spike sorting system on several PCs

Pentium 
IE 1.2G1

AMD
1.37G1

Execution Time 
AMD Pentium

1.2G Duo2 IV3.3G3 Average Best

Training (s) 8.33 3.42 1.45 1.31 3.22 1.31
Classification (720 

data points) (s) 0.42 0.14 0.05 0.04 0.16 0.04

Classification (per 
data point) (ms) 0.58 0.19 0.06 0.05 0.21 0.05

1 The Pentium IE 1.2G and AMD 1.37G systems both have 512 MB DDRAM.

2 The AMD 1.2G Duo system has 4 GB DDRAM.

3 The Pentium IV 3.SG system has IGB DDRAM.

The performance of making classifications on a PC is pretty good. With the 

existing models, a spike waveform can be classified within less than 0.2 ms. This 

training process can meet the real time requirement since the typical neural spike 

period is around 2-3 ms. However, the training process ran much slower than the 

classification process. The best performance we got was 1.31 seconds while the 

average speed was around 3 seconds. This is quite slow comparing to the average 

neural spike period. The slow speed is due to the intensive computation needed for the


39

training of the Gaussian Mixture Model based on the EM algorithm. We needed the 

training process to be as fast as possible so that minimum time is spent in the training 

phase. As a result, PCs are not as fast as we would like in order to do this.

In the rest of the thesis, we will focus on how to speed up the training process of 

the EM algorithm. In next chapter, the parallel DSP implementation will be 

introduced. Later an FPGA implementation will also be presented.


40

CHAPTER 4

PARALLEL DSP IMPLEMENTATION 

Introduction

To speed up the GMM based spike sorting system, the first approach we used 

was to implement the EM algorithm on a parallel DSP system with 4 DSPs. Digital 

Signal Processors have become more powerful with the increase in speed and size of 

the on-chip memory. Furthermore, some modem DSPs are optimized for 

multiprocessing. The Analog ADSP-21160M DSP we used for our project has two 

types of integrated multiprocessing support, which are the link ports and a cluster bus. 

We used Bittware's Hammerhead PCI board to test our implementation. The board 

contains four Analog ADSP-21160 floating-point DSPs. We wrote a parallel version 

of the EM algorithm to take advantage of the multiprocessing capability of the board.

In the next section, the structure and components on the Hammerhead board will 

be described. The software we use to develop the system will also be introduced. In 

the rest of Chapter 4, we focus on how to parallelize the EM algorithm and how to 

effectively implement it on a parallel DSP system. In the end of this Chapter, the 

performance of the EM implementation on this DSP system will be presented.

Hardware

SHARC ADSP-21160M DSP Processor

On the Hammerhead PCI board, there are four high performance 32 bit floating 

point DSP processors, namely the Analog SHARC ADSP-21160s. This DSP features


41

4 Mbits on-chip dual-ported SRAM and the Super Harvard Architecture with the 

Single Instruction Multiple Data (SIMD) computational engine. Running at 80MHz, 

ADSP-21160 can perform 480 million math operations (peak) per second.

The ADSP 21160 has two independent, parallel computation units. Each unit 

consists of an ALU, multiplier, and shifter. In SIMD mode, the parallel ALU and 

multiplier operations occur in both processing elements. The ADSP 21160 has two 

independent memories - one for data and the other for program instructions. Two 

independent address generators (DAG) and a program sequencer supply address for 

memory access. ADSP 21160 can access both data memory and program memory 

while fetching an instruction. In addition, it has a zero overhead loop facility with a 

single cycle setup and exit. The on-chip DMA controller allows zero-overhead data 

transfers without processor intervention.

The ADSP 21160 offers powerful features to implement multiprocessing DSP 

systems. The external bus supports a unified address space that allows direct 

interprocessor accesses of each of the ADSP 21160’s internal memory. Six link ports 

provide a second method of multiprocessing. Based on the link ports, a large 

multiprocessor system can be constructed in a 2D or 3D fashion.

Bittware Hammerhead PCI board and VisualDSP++

The structure of Bittware Hammerhead PCI board is shown in Figure 4-1. This 

board has four Analog Device’s ADSP-21160 processors, up to 512 MB of SDRAM, 

2 MB of Flash Memory, and two PMC mezzanine sites. Bittware’s FIN ASIC is used 

to interface the ADSP-21160 DSPs to the 64-bit, 66 MHz PCI bus, the Flash memory, 

and a peripheral bus. The Hammerhead-PCF s four DSPs can communicate with the


42

host PC through the PCI bus. Four of the link ports are connected to other DSPs, 

which allow the DSPs to communicate through the link ports. Each processor is also 

connected to a common 50MHz, 64-bit ADSP-21160 cluster bus, which gives it 

access to the other three processors and to up to 512 MB of SDRAM.

In our application, we used all four ADSP-21160 processors on the board as well 

as the 256 MB SDRAM. We used the cluster bus to communicate between four DSPs 

and SDRAM. The cluster bus approach is relatively easy to implement. It was also 

faster than using the link ports.

We used VisualDSP++ 3.0 to develop the program for the Hammerhead PCI 

board. Most of the code was written in C. Small sections, such as the DMA transfer 

routine were written in assembly language. All the C functions we used were 

optimized for the ADSP-21160’s SIMD mode.

ijt-tiH 66 M ;; / ’C llo f.il flu

ti-brt PcnpitctaI Btn

PMC+ 
Mezzanine 

Site =,

Front Paid

Figure 4-1 Block diagram of Hammerhead PCI system


43

Analysis of the EM Algorithm

To parallelize the EM algorithm, we needed to find out where the computational 

intensive parts of the EM algorithm were. These parts should be mapped into multiple 

DSPs so that each DSP can share part of the computational load. Thus, by handling 

these parts simultaneously, we can speed up the whole system and enhance the 

performance.

We used the profile command in Matlab to analyze the EM algorithm. The profile 

function records execution time for each line in the code as well as for each function 

call in a Matlab (.m) file. By comparing the execution time, we can know which part 

of the algorithm Matlab has spent the most time on. The most computational intensive 

parts then should be parallelized on multiple DSPs.

Figure 4-2 The diagram shows most computational intensive parts in the EM 
algorithm and their execution percentage in the whole process, (a) Calculation of

Pixi I oz) (b) Update of means and covariance matrices, (c) Calculation of ^ 0' Ix'^

and ^ x' I ^  . (d) Rest of the algorithm.


44

Figure 4-2 shows the most computational part in the EM algorithm. The 

calculation of p(x, | O1) and the update of means and covariance matrices take about 

70% of all the execution time. These two parts should be implemented on multiple 

DSPs. The calculation of p(ot \ Xi ) and £ (X110 ) and rest the algorithm is relatively 

less computationally intensive. A close look at parts (a) and (b) shows that most of the 

computational operations are multiply and multiply-accumulation type of instructions. 

Thus the floating point MAC inside the ADSP-21160 will be able to speed up the 

operations.

Parallel Transformation of the EM Algorithm

To fully use the computational power of the four DSPs on the Bittware board, 

the EM algorithm was transformed to a parallel version. The most straightforward 

way is to map the computation of each cluster onto each single DSP. However, there 

were several drawbacks of this approach.

First, there are five clusters in our model while there are only 4 DSPs on the 

Hammerhead board. Hence we have to put the computation of two clusters into one 

DSP creating an uneven computational load. This will reduce the speed of the whole 

system since three DSPs have to wait while the other DSP deals with the extra 

computation for the second cluster. Secondly, the update of parameters is not 

independent among all the clusters. Before updating the parameters of each cluster, 

the probability of each data in the current model has to be calculated. In Eq. 4-1 the 

probability of each data in each cluster has to be added together across all the clusters. 

So at this point all the middle results have to be sent to one DSP. Last but not the least,


45

the limited on-chip memory requires the DSP to read data from external memory. If 

you look closely at the algorithm, for the calculation of p(xt |oz) , as well as the

update of the means and covariance, the whole data set is needed. In other words, 

during these calculations, the DSPs have to get the value of the whole data and put 

them into ALU for computing. However, with the limitation of the internal memory 

size, it is impossible to store all the data set inside a single DSP. The only option is to 

read the data set from external RAM, i.e. the on-board SDRAM. These operations 

will require extra overhead and will slow down the process. They can also disable the 

SIMD capability of DSP. Since all the DSPs and SDRAM share the same 50MHz 

cluster bus, multiple simultaneous uses of the bus will cause extra delays and even 

bus contention.

Due to these drawbacks, we decided to choose another approach. We divide the 

whole data set into small blocks and each processor handles one block of data. This 

way, we eliminated the overhead of transferring data from external memory. With all 

the data inside the DSPs, we fully used the SIMD mode of the DSPs to accelerate the 

process. This also allowed us to divide the computation evenly for each DSP 

regardless of the cluster number. By being independent on the cluster number, this 

approach will allow scaling to more DSPs. We still need to transfer data between 

DSPs due to the dependence of updating the parameters. But the transfer is limited to 

middles results of the parameters which is much smaller in size comparing to the 

whole data set. This transmission can be monitored by one master DSP. DMA mode 

was used to accelerate the parameter transfers.


46

We divided the whole data set into four blocks. Each block contained 180 data 

points. These blocks were fitted into a single DSP’s internal memory. One of the four 

DSPs was considered to be the master while other three were slaves. The master did 

the same computation tasks as the slaves except for two additional tasks: cross-data 

calculation and the management of DMA transfers. Whenever cross-data calculation 

was needed, all the middle results were sent to the master from the slaves through the 

cluster bus. When the master finished the calculation, the results were sent back to the 

slaves. If the DMA mode was used, the master was the only DSP that had the control 

of all the DSP’s DMA controllers. Synchronization between the DSPs was handled 

via semaphores. Semaphores were also used to prevent DMA bus contention. Table 4- 

1 shows all the semaphores that were used in the program. The master semaphore was 

a global semaphore that all slave processors read to know the status of the Master 

processor. Each slave processor had its own local semaphore which told the Master 

DSP their status.

Table 4-1 List of semaphores and their functions.

Function of each semaphore
______________________________SET (I)_________________CLEAR (0)
Master Semaphore Master Processor is running Master task finished
Local Semaphore 1-3 Slave Processors is running Slave tasks finished
DMA Semaphore DMA bus available DMA bus occupied

The system diagram is shown in Figure 4-3. The algorithm was divided into 6 

phases. The phases in shaded blocks were implemented on multiple DSPs. Notice that 

the master processor acts also like a slave while doing the parallel computation. Most 

of the computational intensive parts, which have been discussed in section 4-2, were


47

implemented on multiple DSPs. The parts implemented solely on the master processor 

were just basically additions. So, during the whole training process of the EM 

algorithm, most of the time, all four processors were running at full speed.

Another significant amount of time was spent on transferring data between DSPs. 

The cluster bus ran at 50MHz and was shared by the all four DSPs as well as the on­

board SDRAM. We first let the DSPs resolve the bus contention automatically. The 

on-chip arbitration logic of ADSP- 21160 processors determined which DSP was the 

current bus master based on the Bus Request (BR) pins of the DSPs. Multiple DSP 

implementations can be done in this mode. However the approach was about 700 

times slower than the other approaches because of the bus contention issue.

We then set a semaphore to exclude the use of the cluster bus. By reading this 

semaphore, only one DSP could access the bus at any time. This gave a much better 

performance than the first approach.

Finally, we used the DMA transfer mode to transfer parameters between the 

DSPs. The master processor was in charge of the DMA controllers of all the 

processors. Whenever a data transfer was needed, the master first made sure that no 

DSP was using the bus. The master then wrote into the registers of both the 

transceiver and receiver’s DMA controller. After the DMA transfer is initialized, no 

other DSP could access the cluster bus until this DMA transfer finished. This sped up 

the process and gave the best performance. The performance comparison will be

shown in the next section.


48

Phase 1:

I  Calculate p(xj|l).
2. Calculate log likelihood 

of each  data point.

Phase 2:

1. Calculate log likelihood 
of the w hole data set.

2. T est whether to stop the 
interation or not

Phase 3:

Calculate middle results 
for updating the priors and 
m eans.

Phase 4:

Update the priors and 
m ean s for each  cluster.

Phase 5:

Calculate middle results for 
updating the ovariance 
m atrices.

Phase 6:

Update the covariance  
m atrices for each  cluster.

Data Transferred:
LogLlke(Xi|L) Size: 180 per Node

Data Transferred:
LogPro(L|Xi) Size: 180 per Node

Data Transferred:
Mid_priors(L) Size: 5 pe r Node 
Mid_centers(L) Size: 800 per Node

Data Transferred:
New_priors(L) Size: 5 per Node 
New_centers(L) Size: 800 per Node

Data Transferred:
Mld_covars(L) Size: 800 per Node

Data Transferred:
New_covars(L) Size: 800 per Node

Master

Master

M aster

Figure 4-3 Diagram of multiple DSP implementation of EM algorithm


49

Performance on the Parallel DSP Platform

Table 4-1 gives the execution time for 8 iterations of the EM algorithm until it 

converged. Performance on a single DSP and multiple DSPs are both shown. Also 

different data transfer methods are compared. The results show that the computational 

speed of the 4-DSP system was almost 4 times better than single DSP. Also, the DMA 

transfer sped up the whole system and gave the best performance. Notice that the bus 

contention auto resolution method in the multiple DSP system significantly slowed 

down the performance of the whole system. Using semaphores was necessary to 

accelerate the data transferring process.

Table 4-2 Execution time for :8 iterations of EM algorithm on single and 4 DSP system

Data Transfer Methods Single DSP (sec) Four DSP (sec)
Normal Read 14.5 13.5
DMA transfer 15.9 3.4
Bus Contention N/A >7000

Table 4-3 shows the performance of the same EM application between the DSP 

and Personal Computer (PC) configurations. All performances on the PC were from 

Matlab implementations. The performance of the parallel DSP implementation was at 

the same level as the average PC configuration but not as good as the high end PCs.

Table 4-3 Performance of the EM algorithm on DSP system and PCs

Single DSP Four DSP PC Average1 PC Best2 3
Execution Times 13.5 3.4 3.22 1.31

1 Average from following four systems: (I) AMD 1.37GHz, 512 MB memory; (2) Pentium III 
1.0 GHz, 512 MB memory, (3) AMD 1.2GHz Duo Processor, 4  GB memory; (4) Pentium IV 
3.2 GHz, I GB memory

2 Best performances from Pentium IV 3.2 GHz system with IGB memory.

3 VisualDSP++ 3.0 was used in the DSP implementation.


50

There are two reasons for why the parallel DSP method was similar if not slower 

than the PCs. First, the core speed of DSP is much less than the high end PC CPUs. 

The core speed of ADSP 21160 is running at 80MHz while the core clock of Pentium 

IV is running at 3.3 GHz. The high-end PC is about 40 times fast than the DSP. In 

spite of the optimal structure of DSP core for signal processing operations, the overall 

performance of floating point operations on a high end PC is much better than on 

individual DSPs. Another reason is the bus speed. Since the lack of on chip memory 

of DSPs, the data has to be transferred between DSPs through cluster bus. The speed 

of cluster bus is at 50MHz which is much slower comparing to the front bus on the 

PCs which is up to 133MHz. Although using the DMA transfer mode, the cluster bus 

is still the bottle neck of speed in the multiple DSP system.

Performance on the DSP system could probably be enhanced by more manual 

tuning on the assembly language level. Using DSPs with high core speed or clustering 

more DSPs can also improve the performance. However, the improvement will be 

very limited and too much time and effort will be required. So, to gain better 

performance for the real time spike sorting system, we turned to another solution, 

namely Field Programmable Gate Arrays (FPGA).

In the next Chapter, we will implement our EM algorithm on a Xilinx Virtex II 

FPGA system. To implement the EM algorithm on FPGAs, a fixed point version of 

the algorithm was first developed. Also, specialized math operations have to be 

transformed to facilitate the implementation of the algorithm on a FPGA. These 

problems will be addressed in the next Chapter and the implementation results will be 

presented.


51

CHAPTER 5

FPGAIMPLEMENATION 

Introduction

Field Programmable Gate Arrays (FPGAs) are reconfigurable hardware that can 

be changed in electronic structure either statically or dynamically. Because the 

implementation is on the actual hardware, FPGAs can obtain a significant 

performance improvement over the software implementations based on DSP chips or 

PCs [30].

One drawback of a FPGA implementation is that, for complex algorithms like 

the FM algorithm, it usually takes a long time to design and optimize the algorithms 

for the FPGA structures. We used a recently developed tool, namely AccelFPGA [34], 

to shorten the design time. AccelFPGA can compile Matlab code directly into VHDL 

code. By using AccelFPGA, we could focus on optimizing the algorithm in Matlab 

without dealing with hardware details inside the FPGA. The design cycle of the 

FPGA implementation using this method can be reduced from months to weeks. The 

FPGA implementation through AccelFPGA is not as optimized as directly VHDL 

coding. However, for a prototype design, AccelFPGA can provide a time-efficient 

FPGA solution with reasonably good performance. We will show that our 

implementation on a FPGA gives much better performance than on either the PC or 

the parallel DSP platforms.

The structure of Chapter 5 is as follows: In the next section, the internal structure

of Xilinx 3000 FPGA will be introduced. Then we will introduce the AccelFPGA


52

compiler for FPGAs. In section 5.4, we address some practical problems of FPGA 

implementations. A fixed point version of the FM algorithm is presented and 

compared to the floating point version. The transformation of mathematics operations 

and parallelism of the algorithm for FPGA implementation are also discussed. In the 

last section, the performance of the EM algorithm on the FPGA platform is presented 

and compared to the parallel DSP platform and PCs.

The FPGA and Its Structure 

Field Programmable Gate Arrays

A Field Programmable Gate Array is an ASIC where logic functions can be 

changed to suit a user’s needs. The circuit elements within these devices can be 

configured repeatedly because the current configuration is stored in SRAM cells [31]. 

The FPGAs can be configured to operate in a nearly unlimited number of user- 

controlled ways by downloading various configuration bit streams.

The FPGA devices are usually organized into generic logic blocks that can 

perform a limited number of logic functions. The specific function being performed is 

determined by the configuration bits preloaded into the hardware. These logic blocks 

are connected by programmable routing, which is also determined by the 

configuration bits.

The particular structures of the logic blocks and routing are different inside the 

FPGAs from different manufacturers. The FPGA we use for our application is the 

Xilinx Virtex II series FPGAs. The structure of this FPGA is described in the

following section.


53

Structure of Xilinx’s Viitex IIFPGA

The Xilinx Virtex II family is a FPGA developed for high-speed high-density 

logic designs with low power consumption. As shown in Figure 5-1, the 

programmable device is comprised of input/out blocks (IOBs) and internal 

configurable logic blocks (CLBs). In additional to the internal configurable logic 

blocks there are three major elements:

1) Block SelectRam memory provide 18 Kbit of dual-port RAMs

2) Embedded 18 x 18-bit dedicated multipliers

3) Digital Clock Manager (DCM) blocks

The functionality of these' elements will be introduced later. All of these 

elements are interconnected by routing resources. The general routing matrix is an 

array of routing switches. Each programmable element is tied to a switch matrix, 

allowing multiple connections to the general routing matrix. All programmable 

elements, including the routing resources, are controlled by values stored in static 

memory cells. These values are loaded in memory cells during configuration and can 

be reloaded to change the functions of the programmable elements.

Configurable Logic Blocks (CLB) are the basic functional units of Xilinx FPGAs. 

Theses blocks are organized in an array and are used to build combinational and 

synchronous logic designs. Each CLB element comprises 4 similar slices. The four 

slices are split in two columns with two independent carry logic chains and one 

comman shift chain. Figure 5-2 shows the details of a slice inside one CLB. Each 

slice includes two 4-input function generators, carry logic, arithmetic logic gates, 

wide function multiplexers and two storage elements. Each 4-input function generator


54

can be programmed as a 4-input LUT, 16 bits of distributed SelectRAM memory, or a 

16-bit variable-tap shift register. Each CLB is wired to the switch matrix to access the 

general routing matrix to be interconnected to other CLBs.

DCM DCM IOB

Global Clock Mux-

CLB Block SeIectRAM

Configurable Logic

P r o g r a m m a b l e  I / O s

□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□ D D i
D D i

D D i
D D i
D D i
D D i

Figure 5-1 Virtex II architecture overview [31].

SHIFTIN COUT

ORCY
- O  SOPOUT

YBMUX

GYMUX

XORG

DYMUX

Shared bohwoen

WSG
WE|ZO|
WE
CUC

WSF

SR REV

□  LATCH

MCI 5

ON

Figure 5-2 Slice Structure of Virtex IIFPGA [31].


55

The Virtex II devices incorporate large amouts of 18 Kbit block memories, 

called Block SelectRAM or BlockRAM. Each Virtex II blockRAM is an 18 Kbit true 

dual-pot RAM with two independently controlled synchronous ports that access a 

common storage area. The blockRAM supports various configurations, including 

single- and dual-port RAM and various data/address aspect ratios. Supported memory 

configurations are shown in Table 5-1.

Table 5-1 Supported configuration of the block SelectRAM

Ib K x lb i t  2K x 9 bits
8Kx 2 bits IK x  18 bits
4K x 4 bits 512x36 bits

The Virtex II devices also incorporate a number of embedded multiplier blocks. 

These multiplier blocks are 18-bit by 18-bit 2’s complement signed multipliers. Some 

of the interconnection to the Switch Matrix is shared between the BlockRAM and the 

embedded multiplier. This sharing of the interconnection is optimized for an 18-bit­

wide BlockRAM feeding the multiplier and allows a fast implementation of a 

multiplier-accumulator (MAC) function which is commonly used in DSP applications.

The Virtex II family is comprised of 11 members, ranging from 40K to SM 

system gates. The FPGA we used for the thesis was the XC2V3000, which features 

3M system gates. Table 5-2 shows the specifications of the XC2V3000 FPGA.

Table 5-2 The specifications of the Virtex I I XC2V3000 FPGA

Device System
Gates

CLB
Slices

Multiplier
Blocks

SelectRAM
Blocks DCMs Max I/O 

pads
XC2V3000 3M 14336 96 96 8 720


56

AccelFPGA

FPGA programming can be done using a Hardware Descriptive Language (HDL) 

such as VHDL or Verilog. These HDLs are low level languages which have a very 

fine control of the actual hardware implementation. However, for the complex 

applications, the HDL coding process will be tedious and error-prone and requires 

costly debugging iterations. Furthermore, a lot of low-level work is repetitive and can 

be automated. To solve this problem, many researchers have focused on how to raise 

the level of abstraction to a general purpose programming language such as C/C++ 

[32] or Java [33]. Recently, a compiler has been developed that can take Matlab code 

and generate optimized hardware for a FPGA. The AccelFPGA Compiler V1.6 was 

used for this thesis.

Figure 5-3 shows the basic steps in the process of converting an algorithm in 

MATLAB into an FPGA implementation using AccelFPGA. The first step is to 

convert the floating-point model of the algorithm into a fixed-point model that meets 

the algorithm’s signal fidelity criterion. Next, the user adds compiler directives for 

AccelFPGA. These directives can specify the usage of FPGA’s embedded hardware 

resources such as the BlockRAM and the embedded multiplier blocks. The third step 

is the creation of the RTL model in VHDL or Verilog and the associated testbenches 

which are automatically created by the AccelFPGA compiler. The fourth step is to 

synthesize the VHDL or Verilog models using a logic synthesis tool. The gate- level 

netlist is then simulated to ensure functional correctness against the system 

specification. Finally the gate-level netlist is placed and routed by the place-and-route 

appropriate for the FPGAs used in the implementation.


57

One advantage of AccelFPGA is that bit-true simulations can be done in the 

Matlab environment. Testbenches are automatically generated along with the 

simulations. You can use these testbenches later in the hardware simulation and 

compare the results with the Matlab simulation. As a result, it is very easy to know 

whether your hardware implementation is correct or not.

Since Matlab is a high level language, the compiled implementation may not 

give as good performance as direct HDL coding. However, by using proper directives, 

you can specify the parallelism of your algorithm and maximize the usage of the on- 

chip hardware resources. Hence, you can get a reasonably good performance of your

implementation.

Matlab

AcceIFPGA

Synthesys tool

Matlab Design in floating 
point representation

Convert to fixed-point 
representation

Apply AcceIFPGA compiler 
directives

Logic simulation for bit-true 
verification

AcceIFPGA gen erates RTL 
m odels and testbench

Logic synthesis to map RTL 
m odels to FPGA primitives

Figure 5-3 Top-down design process using AccelFPGA. The versions of all the tools 
used in our FPGA implementation can be found in Table 5-5.


58

The Fixed Point Representation of the EM Algorithm

To use AccelFPGA for the hardware implementation, the first step is to convert 

the algorithm from a floating point into a fixed point representation. The main 

difference between these two representations is the limited precisions. For the FM 

implementation, the problem becomes whether the algorithm will still converge to the 

right result as in the floating point domain. We used the quantizer function in Matlab 

to change the algorithm into a fixed point format and tested the whole algorithm to 

make sure it gave the correct results.

The most important part of the fixed point implementation is to determine the 

word length and binary point of each variable in the algorithm. AccelFPGA can 

support up to 50-bit wide words. Although long word length can provide better 

precisions, more bits will be needed to store and process each variable. Thus, more 

resources on the FPGA will be required and the system performance will be reduced. 

To maximum the system performance, we should use as short a word length as 

possible while still maintaining enough precision for the algorithm. We chose to use 

18 bits for normal variables and 30 bits for accumulation results. Variables with 18 bit 

word length can be fitted into the BlockRAM blocks and are easy to feed into the 

embedded multiplier in the FPGA. Since the data dimension was high for our 

algorithm, it. was likely to generate a large number after many accumulations. Using 

30 bit word length prevented overflow problems for the accumulation results.

After deciding on the word length, the next step was to choose the binary point 

for each variable. This step actually determined the precision we got from the FPGA 

implementation. Since some parts of the algorithm have been transformed into the log


59

domain, the dynamic ranges of each variable are quite different. So with the same 

word length, each variable will need its own binary point to meet the precision 

requirement. This would be quite hard to implement in a FPGA if an HDL coding 

approach was used. Whenever a math operation occurs between variables with 

different binary points, a shift is needed to align these variables. A lot of work has to 

be done in the low level HDL programming to make sure every math operation has 

the binary point aligned correctly. But in AccelFPGA, the compiler will automatically 

align the binary point. This allows the binary point of each variable to be easily 

changed and the algorithm tested to make sure the fixed point implementation is 

working correctly. The newest version of AccelFPGA (V2.0) can automatically find 

the suitable word length and binary point for your algorithm. However, at the time the 

thesis was done, this feature was not available and all the binary points were tuned 

manually in AccelFPGA Y 1.6.

Using Lookup Tables

The structure of the FPGAs is optimized for high-speed fixed-point addition, 

subtraction and multiplication. However, a number of other math operations such as 

division or exponentiation have to be customized.

Using Taylor series is an option which can transform the complicated operation 

into simple additions and multiplications. However, for the FM algorithm, the 

dynamic ranges of these operations are quite large so that it will require a very high 

order Taylor series to obtain good approximations. Another way is to use a look up 

table (LUT). Knowing the dynamic range of these math operations, we can use a one- 

to-one look up table to approximate them. By dividing the dynamic range input into


60

substantially smaller intervals, this approach will gain enough precision for the 

algorithm to converge. The BlockRAM on the Virtex II FPGA can be used to store 

the content of the look up tables and their address bus can be used as the look up table 

inputs. The look up table operations can be done within one system clock cycle based 

on the BlockRAMs.

Nonlinear mapping and output interpolation can be used to improve the LUT 

performance. Here, we just used a single linear LUT approach, in which we divided 

the input data space evenly into small spaces and mapped every point in the input 

space to the output space. Two factors, input precision and the LUT size have to be 

determined for each LUT depending on the input dynamic range. These two factors 

have the following relationship.

LUT size = DynamicRanse Eq.5-1
Precision

Ideally, we want the dynamic range to be as large as possible while the precision 

as small as possible, which leads to an infinitely large LUT size. However, in most of 

the real world applications, the input dynamic range is limited to a certain range. The 

precision requirement may vary for different applications. For the EM algorithm, all 

we needed was to estimate the parameters for the GMM and be able to classify the 

neural spikes. As long as the EM algorithm converges to the right solution, the 

precision of each variable is not extremely important. Hence, a reasonable size of the 

LUT can be used for each math operation in the EM algorithm and can be 

implemented on the Virtex IIFPGA platform.


61

We built the LUTs using the BlockRAM in the Xilinx Virtex II FPGA. As 

described in section 5.2.2, the BlockRAM are 18 Kbit block memories. Each 18 Kbit 

block memory can be configured into various address/data aspect ratios as shown in 

Table 5-1. We chose the IK 18bit configuration for our LUT implementation. The 

ISbit output bit width gave enough precision for the LUT outputs. Furthermore, the 

output width matched the embedded multipliers’ input width which maximized the 

usage of the embedded multiplier resources.

After choosing the LUT size, which was 1024 (IK) in our case, we found the 

input precision by tracking the input dynamic range. Once the input dynamic range 

was determined, the precision was found by dividing the dynamic range using the 

LUT size. Then, we ran the training process of the EM algorithm to see with if a 

precision caused the EM algorithm to converge or not.

If problems occurred and the algorithm didn’t converge, we had to improve the 

input precision by increasing the LUT size. There are two ways to do this. One 

approach was to increase the LUT size by shortening the output bit width. With a 

shorter word width, the BlockRAM can hold more than 1024 elements in a single 

bank. But the decreased output precision may not meet the output precision 

requirement. Another way was to use more than one bank of block SelectRAM to 

form one single LUT. Theoretically, you can use as many BlockRAMs as is- available 

to build a LUT. However, additional resources and control circuit will be needed to 

linearly address all these memory blocks. These additional circuits will use more area 

of the FPGA and decrease the system performance. In our application, the 1024 LUT


62

size worked properly with all the operations so neither of the above methods were 

used.

An example of