**David M. McQueen and Charles S. Peskin
Courant Institute of Mathematical Sciences
New York University
251 Mercer Street
New York, NY 10012 USA
mcqueen@cims.nyu.edu
peskin@cims.nyu.edu
**

**May 12, 1997**

This paper describes the parallel implementation of the immersed boundary method on a shared-memory machine such as the Cray C-90 computer. In this implementation, outer loops are parallelized and inner loops are vectorized. The sustained computation rates achieved are 0.258 Gflops with a single processor, 1.89 Gflops with 8 processors, and 2.50 Gflops with 16 processors. An application to the computer simulation of blood flow in the heart is presented.

Journal of Supercomputing Vol 11, Issue 3, Pages 213:236, 1997

SHARED-MEMORY PARALLEL VECTOR IMPLEMENTATION OF THE IMMERSED BOUNDARY METHOD FOR THE COMPUTATION OF BLOOD FLOW IN THE BEATING MAMMALIAN HEART

Introduction

Problems arising in biofluid dynamics typically involve the interaction of a viscous incompressible fluid with an immersed elastic boundary. Such problems can be solved by the immersed boundary method [1-5]. Our purpose here is to discuss the parallel implementation of this method on machines like the Cray C-90 computer, which is a shared-memory machine with a moderate number (e.g., 16) of vector processors. Results will be shown for the case of blood flow in the heart, which is the problem for which the immersed boundary method was originally developed.

Summary of the Immersed Boundary Method

Consider a viscous incompressible fluid containing an immersed system (a continuum) of elastic fibers. The fibers are pure force generators. They move at the local fluid velocity and apply force locally to the fluid in which they immersed. Otherwise, they do not alter the properties of the fluid. Thus, we assume that the fiber-fluid composite material is incompressible and has the same constant density and constant viscosity as the fluid itself.

The equations of motion of such a system are as follows:

Equations 1-2 are the Navier-Stokes equations of a
viscous incompressible fluid. They are written in Eulerian form, in
which the independent variables are the position and the
time *t*. The functions of these variables that appear in
Eqs. 1-2 are the fluid velocity , the fluid
pressure *p*, and the Eulerian fiber force density .

Equations 5-7 are the fiber equations. They are
written in terms of the Lagrangian independent variables *q*,*r*,*s*,*t*,
where fixed values of *q*,*r*,*s* designate a material point, fixed
values of *q*,*r* designate a fiber, and *t* is the time. The functions
of these independent variables are the fiber position , the
fiber tension *T* (defined so that is the force
transmitted by the bundle of fibers ), the unit tangent to the
fibers , and the Lagrangian fiber force density .
Equation 5 is the generalized Hooke's law that gives the
fiber stress *T* in terms of the fiber strain, which is determined by
. Note that the
stress-strain relationship allows for explicit position and
time dependence. It is the time dependence, in particular, that
allows the heart muscle to contract and relax. Equation 6
defines the unit tangent to the fibers. Equation 7
gives the Lagrangian form of the force density applied by the
fibers to the fluid. For a derivation of the relationship , see [1,4,5].

Equations 3-4 are the interaction equations. Each of them involves the three-dimensional Dirac delta function , which acts as the kernel of an integral transformation. These equations convert from Lagrangian to Eulerian variables and vice versa. Equation 3 represents the force of the fibers on the fluid as a continuous sum (integral) of point forces applied at the locations of the fiber points. Equation 4 is the no-slip condition of a viscous fluid. Here it serves as an equation of motion of the fibers by asserting that the fibers move at the local fluid velocity.

In the immersed boundary method, the fluid equations are discretized
on a regular cubic lattice in the physical space of the position
variable , and the fiber equations are discretized on a
rectangular lattice in the space of the Lagrangian parameters *q*,*r*,*s*.
Let the mesh widths of these two lattices be denoted and ,
respectively.

On the fluid lattice, the forward, backward, and central divided difference operators (corresponding to derivatives) will be denoted , , and , respectively. A subscript will indicate the Cartesian direction in which the difference is to be applied. The notation will be used for the vector central difference operator , which is analogous to .

On the Lagrangian fiber lattice, the only difference operator we need
is a central difference in the fiber direction, which we denote .
This has the definition
so it connects functions defined on staggered lattices. In particular,
the functions *T* and are associated with the links between fiber
points, whereas the functions and are associated with
the fiber points themselves.

Finally, let time proceed in steps of duration , and let superscripts denote the timestep index. The discretized equations are as follows:

In these equations, the following notations require further
explanation. First, in Eq. 8, the notation
refers to the ``upwind'' difference: the forward difference is used if
the convection velocity is negative, and the backward difference
if the convection velocity is positive. This is done for numerical
stability. Second, in Eqs. 10-11, the notations
and refer to sums over the fiber and
fluid lattices, respectively. Finally, in these same equations, the
notation refers to an approximation to the
three-dimensional Dirac delta function, depending on the mesh width
*h*. For purposes of the present paper, it is only important to note
that we construct this approximate delta function in such a way that
is nonzero only for in a
array of lattice points surrounding the lattice
cell that contains . The specific construction of
is described in [4,5].

Each timestep of the immersed boundary method proceeds as follows. At the beginning of the timestep, the fiber configuration and the fluid velocity are known. Equations 12-14 are used to evaluate the fiber tension , the fiber direction , and then the Lagrangian fiber force density . Equation 10 is then used to transfer the fiber force density from the fiber lattice to the fluid lattice. The result of this computation is the Eulerian fiber force density . Next, Eqs. 8-9 are solved simultaneously for the unknowns . Despite the nonlinearity of the Navier-Stokes equations, the fluid equations that we solve at each timestep are linear, constant-coefficient difference equations. This is because the nonlinearity is expressed in terms of the velocity field which is known from the result of the previous timestep. Once the new velocity field has been found, Eq. 11 is used to interpolate this velocity field to the fiber points, and to move them accordingly. Once this has been done and are known, and the timestep is complete.

Parallel Vector Implementation: Fiber Equations

In this section we consider the computation of the Lagrangian fiber
force density **f**. Since the different fibers do not interact
with each other during this computation, there is a lot of opportunity
for parallelization. The overall fiber structure is intrinsically
three-dimensional, as indicated by the three Lagrangian parameters
*q*,*r*,*s*. Thus, the structure of our code will involve triply nested
loops. To obtain good efficiency the innermost loop will be
vectorized, and the outermost loop will be parallelized. This
strategy proceeds most simply if the parameter *s*, which varies along
a fiber, is incremented during the middle of the three loops,
which is not how one would instinctively write the program. The
following example involves the computation of once *T* and
have already been determined.

call zero(fL1,fLsize) call zero(fL2,fLsize) call zero(fL3,fLsize) CMIC$ DO ALL CMIC$1SHARED(T,tau1,tau2,tau3,fL1,fL2,fL3,nq,ns,nr) CMIC$2PRIVATE(iq,is,ir,T1,T2,T3) do iq=1,nq do is=1,ns-1 do ir=1,nr T1=T(is,ir,iq)*tau1(is,ir,iq) T2=T(is,ir,iq)*tau2(is,ir,iq) T3=T(is,ir,iq)*tau3(is,ir,iq) fL1(is ,ir,iq)=fL1(is ,ir,iq)+T1 fL2(is ,ir,iq)=fL2(is ,ir,iq)+T2 fL3(is ,ir,iq)=fL3(is ,ir,iq)+T3 fL1(is+1,ir,iq)=fL1(is+1,ir,iq)-T1 fL2(is+1,ir,iq)=fL2(is+1,ir,iq)-T2 fL3(is+1,ir,iq)=fL3(is+1,ir,iq)-T3 end do end do end do

In the above code, the force vector `(T1,T2,T3)`

associated with
each fiber link is computed and then added in, with the appropriate
sign, to the resultant force on the points at the two ends of the
link. Note that the above code would not vectorize if the inner loop
were to run along a fiber. In fact, however, the inner loop runs
across the fibers, so there is no difficulty about vectorization. By
not assigning indices to the temporary variables `T1,T2,T3,`

we
tell the compiler that these values can be held in vector registers
and need not be stored to memory.

Another important aspect of vectorization is to avoid memory bank conflicts. The Cray C-90 computer has 16 memory banks, which we can think of as being numbered 0...15, and the individual memory addresses are assigned to the banks according to the rule that the bank number is the address modulo 16. Thus, successive addresses are in different banks, which means that vectorized loops with stride 1 through address space proceed without conflict. There is a danger in the above code, however. The stride though the inner loop is not 1 but is rather the size of the first dimension of the various arrays, which would ordinarily be ns, the number of points on a fiber. If this happens to be a multiple of 16, the vectorized inner loop will be drastically slowed because it will only be using one of the memory banks. This is easily fixed, however, by padding the relevant dimension, i.e., by allocating more memory than is actually used. This affects only the dimension statements; the code given above is unchanged. From now on, we shall assume without further comment that memory bank conflicts have been avoided in this way.

Parallelization of the outer loop is accomplished by the `CMIC$`

compiler directives. The directive `CMIC$ DO ALL`

tells the compiler
that the next `do`

loop (in our case the loop over
`iq`

) is to be parallelized. Each pass through the loop may be
done by a different processor, if there are enough processors
available for that to happen. If not, when a processor is done with a
particular pass through the loop it declares itself available and is
given another pass through the loop to do. Note that the programmer
does not need to know in advance how many processors will be available
(this may depend on other jobs that are running simultaneously), and
also that the programmer does not need to make a specific plan as to
which passes through the loop will be done by the different
processors. Load balancing in this style of parallelization is
automatic: If a processor is given a pass through the loop that
involves only a little work, then it returns early and is given more
work to do. Only when the parallel loop is almost done can the
situation arise that a processor returns and then remains idle while
waiting for the other processors to finish their tasks because there
are no more passes through the loop left to do. This end-effect
inefficiency can be kept small by making sure that there are a lot
more passes through the loop than processors. In the example given
above, load balancing is not really an issue, since each pass through
the loop involves the same amount of work. Later, however, we shall
encounter examples where this is not the case, so it is important to
realize that load balancing will be achieved, to a good approximation,
even when different passes through a parallel loop involve different
amounts of work.

Note that the `CMIC$`

directives also tell the compiler which variables
are shared and which must be private to the individual processor.
Shared variables are those that all of the processors can access
simultaneously. If they are to updated, it is up to the programmer to
make sure that the results are not dependent on the order in which the
shared variables happen to be accessed by the different processors.
For example, in the code above, the shared arrays `fL1`

,
`fL2`

, and `fL3`

are updated, but the different processors
work in different parts of these arrays, so there is no danger.
Private variables should be thought of as local to the processor.
They have no global meaning and cannot be accessed after the
parallelized loop is complete.

Parallel Vector Implementation: Fluid Equations

We now consider the fluid solver defined by Eqs. 8-9. The computation of and from and can be broken down into two steps. The first one, which we call the ``upwind'' step, involves evaluation of the vector field

Once this has been done, and are determined by solving the system

The main complication concerning the parallel vector implementation of
the upwind step is the ``if'' statement that is implicit in the use of
upwind differencing. This is handled by the library function
`CVMGP(A,B,C)`

, which returns `A`

if `C`

is positive
and `B`

otherwise. This function, known as ``conditional vector
merge (positive)'', can be used inside a vectorized loop without
impeding vectorization.

The following code assumes that the values of **u** are known not
only on the domain of interest but also on the neighboring planes
given by `i,j,k = 0 or ng+1`

. In the case of periodic boundary
conditions, this can be established by first copying the data from the
plane `i=1`

into the plane `i=ng+1`

and the data from the
plane `i=ng`

into the plane `i=0`

, and similarly for
`j`

and `k`

. We assume this has already been done.

Note that the following code uses ``program units'', in which the unit
of length is *h*, the unit of time is , and the unit of mass
is . Thus, the constants *h*, , and do not
appear, and we save quite a few unnecessary multiplications and
divisions.

CMIC$ DO ALL CMIC$1SHARED(u1,u2,u3,FE1,FE2,FE3,w1,w2,w3) CMIC$2PRIVATE(i,j,k) do k=1,ng do j=1,ng do i=1,ng w1(i,j,k) = u1(i,j,k) + FE1(i,j,k) 1 -u1(i,j,k)*CVMGP(u1(i ,j ,k )-u1(i-1,j ,k ), 2 u1(i+1,j ,k )-u1(i ,j ,k ), 3 u1(i ,j ,k )) 4 -u2(i,j,k)*CVMGP(u1(i ,j ,k )-u1(i ,j-1,k ), 5 u1(i ,j+1,k )-u1(i ,j ,k ), 6 u2(i ,j ,k )) 7 -u3(i,j,k)*CVMGP(u1(i ,j ,k )-u1(i ,j ,k-1), 8 u1(i ,j ,k+1)-u1(i ,j ,k ), 9 u3(i ,j ,k )) w2(i,j,k) = u2(i,j,k) + FE2(i,j,k) 1 -u1(i,j,k)*CVMGP(u2(i ,j ,k )-u2(i-1,j ,k ), 2 u2(i+1,j ,k )-u2(i ,j ,k ), 3 u1(i ,j ,k )) 4 -u2(i,j,k)*CVMGP(u2(i ,j ,k )-u2(i ,j-1,k ), 5 u2(i ,j+1,k )-u2(i ,j ,k ), 6 u2(i ,j ,k )) 7 -u3(i,j,k)*CVMGP(u2(i ,j ,k )-u2(i ,j ,k-1), 8 u2(i ,j ,k+1)-u2(i ,j ,k ), 9 u3(i ,j ,k )) w3(i,j,k) = u3(i,j,k) + FE3(i,j,k) 1 -u1(i,j,k)*CVMGP(u3(i ,j ,k )-u3(i-1,j ,k ), 2 u3(i+1,j ,k )-u3(i ,j ,k ), 3 u1(i ,j ,k )) 4 -u2(i,j,k)*CVMGP(u3(i ,j ,k )-u3(i ,j-1,k ), 5 u3(i ,j+1,k )-u3(i ,j ,k ), 6 u2(i ,j ,k )) 7 -u3(i,j,k)*CVMGP(u3(i ,j ,k )-u3(i ,j ,k-1), 8 u3(i ,j ,k+1)-u3(i ,j ,k ), 9 u3(i ,j ,k )) end do end do end do

Thus, the upwind computation is vectorized over the rows of the computational
lattice and it is parallelized over the planes of that lattice. When each
processor is given a plane to work on, the task that it has to perform
involves `ng`

vectorized computations, each of size `ng`

.

Once the upwind computation has been performed and is known, it is time to solve the system of Eqs. 16-17 for and . As remarked above, these are linear difference equations with constant coefficients, and they may be conveniently solved with the help of the Fast Fourier Transform (FFT) algorithm. The three-dimensional FFT that we use was kindly provided by Koushik Ghosh [6], then of Cray Research. Like the rest of the code that we have already described, it uses vectorization of inner loops and parallelization of outer loops to obtain high speed.

Parallel Vector Implementation: Interaction Equations

This is the hard part! We now consider the implementation of Eqs. 10-11 in which the Lagrangian (fiber) and Eulerian (fluid) variables interact. It is important to note that this interaction is local: The fibers move at the local fluid velocity (Eq. 11) and exert forces locally on the fluid (Eq. 10). (Because the fluid is incompressible, these forces have instantaneous nonlocal effects that are mediated by the pressure field, but that occurs in the fluid equations rather than in the interaction equations per se.)

The locality of interaction between the fibers and the fluid is
expressed in the immersed boundary method by the bounded support of
the function . This function has the property that
if for . (For further details concerning , see [4,5].)
Thus is nonzero only for in
the interior of a cube of edge 4*h* centered on the point
and aligned with the coordinate axes.

Given the coordinates of a fiber point, it is therefore easy to identify the points of the fluid lattice with which it interacts. This suggests that we implement both Eq. 10 and Eq. 11 by looping over the fiber points in some arbitrary order, and for each fiber point making use of just the sub-array of the Eulerian fluid mesh that is ``within range'' of that fiber point (the range being established by the support of the function .

The implementation of this strategy on a parallel vector computer faces two obstacles. The first is that the Eulerian (fluid) lattice has to be accessed in an unstructured way: there is no constant stride through memory. Note that even the fluid mesh points associated with a given fiber point are not stored in contiguous memory (recall the manner in which multidimensional arrays are mapped into the one-dimensional address space of a computer). Then there is even less structure when we move from one fiber point to another. The lack of constant stride impedes vectorization.

The second obstacle affects both parallelization and vectorization. Note that Eq. 10 gives the Eulerian force density at a given Eulerian lattice point as a sum of contributions from a collection of fiber points, and that Eq. 11 gives the displacement of a fiber point as a sum of contributions from a collection of Eulerian mesh velocities. The computation of a sum involves a loop like the following:

sum=0 do i=1,nterms sum=sum+term(i) end do

As is well known, this loop does not produce correct results when
vectorized or parallelized. What happens is this: First the value
`0`

that is stored in the memory location assigned to `sum`

will be fetched from memory `nterms`

times. Then each of the values
of `term`

will be fetched and added to each of these `nterms`

zeros. Finally, the `nterms`

results will be stored in the memory
location assigned to `sum`

, and the alleged answer will be
whichever one was stored there last, which is wrong no matter which
one it is! Thus we cannot vectorize or parallelize the computation
of over the fiber points, nor can we vectorize or parallelize
the computation of over the points of the fluid lattice.

The strategy that we have adopted to overcome these difficulties is based
on an organization of the data into columns. Let a cell of the
Eulerian (fluid) mesh be a cube of edge *h* with mesh points at its
corners. Consider the set of all fiber points within a given
column of such cells. The part of the fluid lattice with which
the fiber points of interact is a sub-array of dimension
`(4,4,ng)`

. The fiber points in lie in the central
column of this sub-array. If we consider two different columns constructed
in this way, their `(4,4,ng)`

sub-arrays may overlap. But they
will not overlap if their locations differ by at least 4*h* in
either the or in the coordinate direction. Thus, it is
advantageous to partition the columns into 16 groups, defined as follows:
Let `(izero,jzero)`

be indices that define a group, where `izero`

and `jzero`

can each take on the values `1,2,3,4`

. Then,
within a group, let the individual columns be identified by the indices
`(ii,jj)`

, where `ii=izero,izero+4,izero+8,...`

and
`jj=jzero,jzero+4,jzero+8,...`

. There is then no overlap between
the `(4,4,ng)`

sub-arrays of the fluid lattice associated with the
different columns belonging to one particular group, see Figure 1.
This makes it possible to parallelize over the different columns
within a group, though the processing of one group must be completed
before the processing of the next one may begin. On a fluid
grid, which is our typical case, there are columns in a
group, so there is plenty of parallel work to do. (Recall that the
Cray C-90 computer has 16 processors.) This is good from the standpoint
of load balancing: Different columns contain different numbers of
fiber points (some columns may even be empty), so some processors will
return from processing a column earlier than others. Those that do so
will immediately be given another column to process, so the overall load
will remain balanced despite the heterogeneity in the amount of work
to be done per column.

Besides making parallelization possible, the strategy that we have
just outlined is also advantageous from the standpoint of
vectorization. The library function `gather`

can be used to
collect Lagrangian position and force density data from the fiber
points in a column, and to organize these data in one-dimensional
arrays. Similarly, `gather`

can be used to collect Eulerian
velocity and force density data from the associated `(4,4,ng)`

sub-array of the fluid lattice, and to store those data in
one-dimensional arrays as well. Once the data have been collected and
organized in this way, vectorized computations can be performed on the
data, and then the library function `scatter`

can be used to
return results to their proper locations in memory. Finally, we have
to take care that when sums are being computed the vectorization is
over the multiple sums that have to be done, rather than over the
multiple terms within a sum.

At this point the reader may protest that the strategy outlined above
will founder on the task of locating all of the fiber points in a
given column. How can this be done efficiently, without searching
through all of the fiber points? The answer is to use a linked-list
data structure in which an array of pointers `firstn(ii,jj)`

gives the index of the ``first'' (actually, the order is of no
importance) fiber point in column `(ii,jj)`

, and then the
one-dimensional array `nextn`

is used to find the index of the
next point, and so on until a `0`

is found, indicating that
there are no more points in the column. Note that a single array
`nextn`

is adequate to hold the information about all
of the columns. The linked-list data structure that we have just
described is depicted in Figure 1.

But how can the linked-list data structure be efficiently computed? Since the fiber points move, this has to be done at every timestep, and unless we can do it in parallel, there will be a substantial serial bottleneck. Fortunately, we can impose a restriction on how far the fiber points can move in a single timestep. Let be an upper bound on any component of the fluid velocity, and make the restriction that . This restriction (actually, a slightly stronger one than this) is needed in any case for the stability of the upwind scheme for the fluid dynamics. Then fiber points in any particular column on a given timestep can, at worst, be found in a neighboring column on the next timestep (including diagonal neighbors). This means that the linked lists for all of the columns in a group, as defined above, can be updated in parallel, since the columns in a group are far enough apart that even their neighborhoods are disjoint, where the neighborhood of a column is defined as the union of that column and its immediate neighbors (including diagonal neighbors), see Figure 1.

The code to update the linked-list data structure that defines the partition of fiber points into columns will now be given. (We omit the initialization of this data structure, since that does not need to be done efficiently, as it is only done once.)

c Code to update linked-list data structure c that partitions fiber points into columns c c Loop over the 16 groups of columns: do izero=1,4 do jzero=1,4 c Now set up a parallel loop over all of the columns in a group: CMIC$ DOALL CMIC$1SHARED(firstn,nextn,number,izero,jzero,x1,x2,ng) CMIC$2PRIVATE(ii,ij,in,jj,jn,n,nextold,nprev) do ij=0,(ng/4)**2-1 jj=jzero+4*(ij/(ng/4)) ii=izero+4*mod(ij,ng/4) c Consider the column whose coordinates are (ii,jj): c Initialize a trailing pointer: nprev=0 n=firstn(ii,jj) c n is the index of the first Lagrangian point c that should be in the column (ii,jj). c (If n=0, there are no such points.) do while(n.ne.0) jn=int(x2(n)) in=int(x1(n)) c (in,jn) is the column that actually contains the Lagrangian point n. c Test to see whether it is different from (ii,jj): if ((in.ne.ii).or.(jn.ne.jj)) then c Remember the value of the pointer to the next Lagrangian point c that should be in column (ii,jj): nextnold=nextn(n) c Insert the Lagrangian point n into the linked list of column (in,jn): nextn(n)=firstn(in,jn) firstn(in,jn)=n number(in,jn)=number(in,jn)+1 c Delete the Lagrangian point n from the linked list of column (ii,jj): if (nprev.eq.0) then firstn(ii,jj)=nextnold else nextn(nprev)=nextnold end if number(ii,jj)=number(ii,jj)-1 c Locate the next Lagrangian point that should be in the column (ii,jj), c but leave the trailing pointer unchanged, c since a Lagrangian point has been deleted c from the linked list of column (ii,jj). n=nextnold else c Lagrangian point n belongs in column (ii,jj). c Update the trailing pointer: nprev=n c and then locate the next Lagrangian point c that should be in the column (ii,jj): n=nextn(n) end if end do while end do c The parallel loop over a group of columns ends here. c When the whole group is done, move on to the next group of columns. end do end do

In the foregoing code, all of the processors may be simultaneously
updating the shared array `nextn`

, which holds the linked-list
data structure. Moreover, the processors skip around that array
seemingly with abandon, and the whole procedure seems very dangerous.
We know, however, that these parallel processes cannot conflict,
since the columns involved are sufficiently well separated, as
described above and depicted in Figure 1.

We now consider how the linked-list data structure is used in the computation of the Eulerian force-density according to Eq. 10. The following code accomplishes this task:

c This code uses the Lagrangian force density fL1,fL2,fL3 c to compute the Eulerian force density FE1,FE2,FE3. c The Lagrangian point coordinates are stored in the arrarys x1,x2,x3. c These points have been sorted into columns. c This columnar organization is stored in the shared linked list nextn, c which is accessed via the pointers contained in firstn. c Note that here the Lagrangian coordinates and force density components c are stored in one-dimensional arrays. c c Initialize the Eulerian force density arrays to zero: call zero(FE1,FEsize) call zero(FE2,FEsize) call zero(FE3,FEsize) c Set up a loop over 16 different groups of columns. c Parallelization will be done WITHIN each of these groups. do izero=1,4 do jzero=1,4 CMIC$ DO ALL CMIC$1SHARED(izero,jzero,FE1,FE2,FE3,firstn,nextn,number,nptmax,ng, CMIC$2 fL1,fL2,fL3,x1,x2,x3) CMIC$3PRIVATE(ijcount,ii,jj,L,i,j,k,i3D,j3D,k3D,indc3D,delta, CMIC$4 FEc1,FEc2,FEc3,numrem,npoints,npt,indx,xc1,xc2,xc3, CMIC$5 fLc1,fLc2,fLc3,fi,fj,fk,m,mzero) c Begin parallel loop over the columns in a group: do ijcount=0,(ng/4)**2-1 c Find the coordinates (ii,jj) of the current column. c Columns within a group are 4 meshwidths apart in each coordinate direction. jj=jzero+4*(ijcount/(ng/4)) ii=izero+4*mod(ijcount,ng/4) c Locate the first Lagrangian point in column (ii,jj): L=firstn(ii,jj) c Proceed with the column only if there at least one such point: if(L.gt.0)then c Consider the 4x4 column of the Eulerian mesh that is influenced c by the Lagrangian points in the column (ii,jj). c Let the coordinates of points in this column be denoted (i3D,j3D,k3D), c where i3D=ii-1...ii+2, j3D=jj-1...jj+2, k3D=1...ng. c The inline function index3D converts (i3D,j3D,k3D) c to a one-dimensional address, which is stored in the array indc3d c for future gather and scatter operations. The details of this c function depend on the dimensions of the Eulerian arrays. c The order of storage within indc3d is such that the 16 Eulerian points in a c horizontal plane (x3=constant) are stored in contiguous memory locations. do i=0,3 i3D=ii+i-1 do j=0,3 j3D=jj+j-1 do k3D=1,ng indc3D(1+i+4*j+16*(k3D-1))=index3D(i3D,j3D,k3D) end do end do end do c Now used indc3D to direct the gathering of the force data for one column: call gather(16*ng,FEc1,FE1,indc3D) call gather(16*ng,FEc2,FE2,indc3D) call gather(16*ng,FEc3,FE3,indc3D) c Consider the Lagrangian points of the column c in bunches of size at most nptmax. c (This is done to limit the amount of storage that will be needed.) numrem=number(ii,jj) do while (numrem.gt.0) npoints=min(numrem,nptmax) c Use the linked list to identify the next bunch of Lagrangian points c to be considered. Store their indices in the array indx c which will be used to gather the data concerning these points. do npt=1,npoints indx(npt)=L L=nextn(L) end do c Gather the Lagrangian coordinate and force density data: call gather(npoints,xc1,x1,indx) call gather(npoints,xc2,x2,indx) call gather(npoints,xc3,x3,indx) call gather(npoints,fLc1,fL1,indx) call gather(npoints,fLc2,fL2,indx) call gather(npoints,fLc2,fL3,indx) c For each gathered Lagrangian point, c compute the 4x4x4=64 delta function weights c and store these in the array delta. c Note that deltaf is an inline function, c and that vectorization is over the Lagrangian points in the bunch c that is being considered. The 64 delta function weights of a given c Lagrangian point are stored in contiguous memory. do j=0,3 fj=float(jj+j-1) do i=0,3 fi=float(ii+i-1) do k=0,3 m=1+i+4*j+16*k do npt=1,npoints fk=float(int(xc3(npt))+k-1) delta(m,npt)=deltaf(xc1(npt)-fi,xc2(npt)-fj,xc3(npt)-fk) end do end do end do end do c Now use the delta function weights and the Lagrangian force density c to update the Eulerian force density for the 4x4 column of Eulerian c mesh points surrounding the column (ii,jj) of Lagrangian points. c Here the vectorization is over the 64 Eulerian points associated c with a given Lagrangian point. These 64 Eulerian points are c stored in locations mzero+1...mzero+64 of the arrays FEc1,FEc2,FEc3. do npt=1,npoints mzero=16*(int(xc3(npt))-1) do m=1,64 FEc1(m+mzero)=FEc1(m+mzero)+delta(m,npt)*fLc1(npt) FEc2(m+mzero)=FEc2(m+mzero)+delta(m,npt)*fLc2(npt) FEc3(m+mzero)=FEc3(m+mzero)+delta(m,npt)*fLc3(npt) end do end do c Scatter the results for this column back to the 3D Eulerian mesh. c Although different processors are doing this at the same time, c their operations cannot interfere, since the 4x4 columns considered c within this parallel loop do not overlap. call scatter(16*ng,FE1,indc3D,FEc1) call scatter(16*ng,FE2,indc3D,FEc2) call scatter(16*ng,FE3,indc3D,FEc3) c Calculate the number of points that remain to be done in column (ii,jj) c If the result is zero, the do while loop will terminate. numrem=numrem-npoints end do while end if end do c The parallel loop over columns in a group ends here, so this c is a point at which the processors must synchronize. The two c outer loops are over the 16 different groups of columns. c Parallelization is over the (ng/4)**2 columns within a group. end do end do

The code that implements Eq. 11 for the computation of
is similar in structure to the foregoing and will
not be given in detail here. The main difference appears
near the end, after the delta-function weights have been computed
and stored in the array `delta(m,npt)`

. Then we proceed as
follows to find the displacements of the fiber points:

c Calculate separately the contribution to the displacement of c each Lagrangian point from each of the Eulerian points with c which it interacts: do npt=1,npoints mzero=16*(int(xc3(npt))-1) do m=1,64 disp1(m,npt)=uc1(mzero+m)*delta(m,npt) disp2(m,npt)=uc2(mzero+m)*delta(m,npt) disp3(m,npt)=uc3(mzero+m)*delta(m,npt) end do end do c Now combine the contributions computed above c and move the Lagrangian points accordingly: do m=1,64 do npt=1,npoints xc1(npt)=xc1(npt)+disp1(m,npt) xc2(npt)=xc2(npt)+disp2(m,npt) xc3(npt)=xc3(npt)+disp3(m,npt) end do end do

In the foregoing code, note the reversal of the order of the loops.
The first pair of loops is vectorized over the 64 Eulerian points with
which a Lagrangian point interacts, but the second pair of loops is
vectorized over the different Lagrangian points. In both cases, it
has to be done that way. In the first pair of loops, constant stride
is achieved only because the vectorization is over `m`

rather
than `npt`

. The second pair of loops illustrates the solution
to the problem of vectorizing a sum, as discussed above. We obtain
vectorization while avoiding incorrect results by vectorizing over
the many sums that have to be computed, rather than over the terms
of each individual sum.

For clarity, we have simplified the discussion of the interaction code
by eliminating the technicalities associated with the edges of the
domain. The code given above will work as written if all Lagrangian
point coordinates lie within the half-open interval `[1,ng+1)`

,
and if the Eulerian mesh has dimensions that include
`(0:ng+2,0:ng+2,0:ng+2)`

. In practice, we use periodic boundary
conditions with period `ng`

in all three coordinate directions.
This requires the following modifications and additions to the code as
presented above: First, whenever the any coordinate of a Lagrangian
point is used, that coordinate must be replaced by its periodic image
within the interval `[1,ng+1)`

. Second, before the Eulerian
velocity field can be used, its periodicity must be assured by copying
plane `1`

onto plane `ng+1`

, plane `2`

onto plane
`ng+2`

, and plane `ng`

onto plane `0`

, in all three
coordinate directions. Third, after the Eulerian force density has
been computed, it must be ``folded'' into itself by adding the data on
plane `ng+1`

into the data on plane `1`

, the data on plane
`ng+2`

into the data on plane `2`

, and the data on plane
`0`

into the data on plane `ng`

, again in all three
coordinate directions. These plane-copying or folding operations must
be done sequentially over the three coordinate directions to get the
edges and corners right, but there is plenty of opportunity for
parallelization and vectorization within each copy or fold. Since
each plane has two coordinate directions, we can parallelize over one
and vectorize over the other.

Performance Results

The performance results are summarized by the following table,
which was obtained by using the Cray C-90 computer at the
Pittsburgh Supercomputing Center running in dedicated mode
(no competition with other users). The fluid lattice in this
computation was and the number of fiber points was
roughly 500,000. The first column gives the number of
processors employed by the computation. The next four columns
give the elapsed time in seconds for one timestep, broken
down into the four major components of a timestep. The component
denoted `FLUID`

refers to the integration of the Navier-Stokes
equations, the component denoted `FIBER`

refers to the
computation of the fiber force, the component denoted `PUSH`

is the interaction routine in which the Lagrangian fiber force
converted to an Eulerian force density on the fluid lattice, and
the component denoted `MOVE`

is the interaction routine in
which the fluid velocity is interpolated and the fiber points are
moved. The sixth column labeled `GFLOPS`

is the computation
rate in units of floating point operations per second. This
was obtained by dividing the floating point operations
of a timestep by the total elapsed time for a timestep, in seconds.
Finally, the last column gives the speedup in comparison to the
single-processor case.

TABLE 1: PARALLEL VECTOR PERFORMANCE --elapsed time (seconds)-- #CPUs FLUID FIBER PUSH MOVE GFLOPS SPEEDUP 1 2.71 0.107 2.64 2.75 0.258 1.0 2 1.37 0.0574 1.34 1.40 0.508 2.0 4 0.703 0.0325 0.677 0.704 1.00 3.9 8 0.383 0.0198 0.356 0.365 1.89 7.3 16 0.399 0.0149 0.234 0.199 2.50 9.7

The noteworthy features here are as follows. First, the
single-processor computation rate is about floating point
operations per second, which about 1/4 the theoretical maximum of a
single Cray C-90 processor. Second, the strictly Lagrangian part of
the computation, i.e., `FIBER`

, takes a negligible amount of
time, but the other three parts (`FLUID`

, `PUSH`

, and
`MOVE`

) all take comparable amounts of time, and this remains
true at all processor numbers. Thus it appears that the
parallelization is equally effective in the different parts of the
code. (If one part contained a serial bottleneck, its time would have
started to dominate as the number of processors was increased.)
Finally, note that the speedup is near-optimal (twice the speed when
twice as many processors are employed) in going from 1 to 2, from 2 to
4, and from 4 to 8 processors. In the jump from 8 to 16 processors,
however, the speedup is considerably less than a factor of 2,
indicating that some inefficiencies have begun to creep in. These are
most extreme in the `FLUID`

part of the code, which is slightly
slower with 16 processors than with 8.

Based on these results, it seems reasonable to specify 8 processors as the upper limit that should be used on this particular code (though if one is in a real hurry, 16 will do the job significantly faster at the price of some inefficiency). Fortunately, the code occupies only about 50 MWords (400 MBytes) of central memory, which is much less than half of the machine, so the remaining 8 processors can still be busy servicing other users, or even another version of the same code running a different case.

Results Concerning the Heart

The methodology described above has been applied to the computation of blood flow in the beating mammalian heart. Results are shown in Figures 2-4. The parameters used to obtain these results are those of the human heart. (This includes the viscosity, which in our earlier work had to be artificially elevated to achieve numerical stability.)

The construction of the heart model is described in [5]. It includes left and right atria and ventricles; mitral, aortic, tricuspid, and pulmonic valves; ascending aorta and main pulmonary artery; superior and inferior vena cavae; and the four pulmonary veins. These simulated tissues are all comprised of elastic fibers, some of which are contractile, to simulate cardiac muscle. Contraction and relaxation of the contractile fibers are simulated by appropriately timed changes in elastic parameters such as stiffness and rest length. The arteries and veins of the model have blind ends but are equipped with sources and sinks that establish appropriate pressure loads for the model heart.

In Figures 2-4, fluid flow is depicted in terms of streaklines that
show the trajectories of fluid particles over the recent past, relative
to the time of the frame that is being shown. The particles that generate
the streaklines are introduced into the computation at *t*=0 and are
moved at the local fluid velocity by the same algorithm that is used
to move the fiber points. The fluid markers differ from the fiber points
only in this: that the fluid markers do not exert any force on the fluid.

Figure 2 shows ventricular diastole (relaxation) and ventricular systole (contraction), each from two different viewpoints. One of these viewpoints is such that the interventricular septum is seen on edge. In this view both the left and right ventricles can be seen. The left ventricle contains red fluid markers to indicate oxygenated blood, and the right ventricle contains blue fluid markers to indicate partially deoxygenated blood. In the second viewpoint, the heart is rotated so that the right ventricle appears in front, obscuring the left ventricle. On the left side of the heart, the diastolic frame shows filling through the mitral valve. A vortex ring that forms in the left ventricle during this process is also visible. On the right side of the heart, the diastolic frame shows two converging streams from the superior and inferior vena cavae merging together to fill the right ventricle. Later in diastole (not shown), these streams form a prominent vortex that swirls around the right ventricular cavity. On both sides of the heart, the systolic frame shows ejection through the appropriate artery, oxygenated blood on the left side of the heart entering the aorta, and de-oxygenated blood on the right side entering the main pulmonary artery.

Detail of the left-sided valves is shown in Figures 3-4. In diastole (Figure 3) the inflow (mitral) valve is open and the outflow (aortic) valve is closed. In systole (Figure 4), the state of the valves reverses, so that the inflow (mitral) valve is closed but the outflow (aortic) valve is open. The mitral valve is supported by chordae tendinease that insert into two papillary muscles, each of which is an outgrowth of the left ventricular wall. The aortic valve, by contrast, is self supporting. The mathematical model that we use to define the fiber architecture of the aortic valve is described in [7].

The most serious problem with the version of the heart model shown here is that it suffers from substantial aortic regurgitation. During systole, while forward flow is proceeding, the fibers near the edge of the aortic leaflets twist around each other to such an extent that the leaflets become effectively smaller. During the subsequent diastole (not shown), these smaller leaflets are too small to meet properly. A hole develops through which an intense jet of blood pours backwards from the high-pressure aorta into the low-pressure (during diastole) left ventricle, thus forming a vortex with the opposite polarity to the one that forms normally during diastole. Our current work, therefore, is directed towards eliminating this malfunction of the aortic valve so that the model heart will function in a healthy manner.

Once the aortic regurgitation has been ``cured'', we plan to evaluate the model by comparison with the results of imaging methods that can now be used for non-invasive measurement of the cardiac flow pattern, e.g., color-Doppler ultrasound, and magnetic resonance imaging. We then plan to use the model heart as a computer test chamber for the design of prosthetic cardiac valves, as a tool for studying heart physiology by perturbing that physiology and observing the effects on cardiac performance, and also as a means of investigating a variety of disease states affecting the mechanical function of the heart and its valves. For examples of such applications of an earlier, two-dimensional model of the left heart only, see [8-12].

Conclusion

The work described in this paper was begun on the ``Cray-0'' (better known as the CDC6600) and has made use of the Cray-1, Cray-2, Cray-XMP, Cray-YMP, and most recently the Cray-C90 computer. It therefore seems fitting to conclude with a word about Seymour Cray, the creator of this remarkable line of hardware. The user of a Cray computer cannot escape the feeling that he knows Cray personally, and, moreover, that Cray has deeply understood his particular application and designed the machine accordingly. Cray understood that fields like fluid dynamics require highly repetitive structured computations to proceed at lightning speed. He introduced vectorization to fill this need. He understood that the scope of a physical simulation is entirely determined by the speed and size of the computer on which it is to be performed, and he struggled to build the very best hardware that could possibly be created at any given time. He understood that serious scientists would rather have a machine that required some hard work to get the job done than a machine that was easy to use but couldn't get the job done at all. And he built what was needed to predict the weather, to design aircraft, and to simulate the human heart.

Acknowledgment

The authors are indebted to the National Science Foundation for support of this work under research grants BIR-9224743 and BIR-9302545. Computation was performed at the Pittsburgh Supercomputing Center (PSC) on the Cray C-90 computer under an allocation of resources MCA93S004P from the MetaCenter Allocation Committee. We are particularly indebted to the PSC for the early access that we were given to the Cray C-90 computer when that machine was new, and to Koushik Ghosh, then of Cray Research, for providing the three-dimensional FFT code that is used in this work. Visualization of results was done at the Scientific Visualization Laboratory, Academic Computing Facility, New York University.

References

[1] Peskin CS and McQueen DM: A three-dimensional computational method for blood flow in the heart: (I) immersed elastic fibers in a viscous incompressible fluid. J. Comput. Phys. 81:372-405, 1989

[2] McQueen DM and Peskin CS: A three-dimensional computational method for blood flow in the heart: (II) contractile fibers. J. Comput. Phys. 82:289-297, 1989

[3] Peskin CS and McQueen DM: Cardiac fluid dynamics. Crit. Rev. Biomed. Eng. 20:451-459, 1992

[4] Peskin CS and McQueen DM: A general method for the computer simulation of biological systems interacting with fluids. In: Biological Fluid Dynamics (Ellington CP and Pedley TJ, eds.), The Company of Biologists Limited, Cambridge UK, 1995, pp. 265-276.

[5] Peskin CS and McQueen DM: Fluid dynamics of the heart and its valves. In: Case Studies in Mathematical Modeling: Ecology, Physiology, and Cell Biology (Othmer HG, Adler FR, Lewis MA, and Dallon JC, eds.), Prentice-Hall, Englewood Cliffs NJ, 1996, pp. 309-337.

[6] Ghosh K: Unpublished communication.

[7] Peskin CS and McQueen DM: Mechanical equilibrium determines the fractal fiber architecture of the aortic heart valve leaflets. Am. J. Physiol. 266: H319-H328, 1994

[8] McQueen DM, Peskin CS, and Yellin EL: Fluid dynamics of the mitral valve: physiological aspects of a mathematical model. Am. J. Physiol. 242:H1095-H1110, 1982

[9] Peskin CS: The fluid dynamics of heart valves: experimental, theoretical, and computational methods. Ann. Rev. Fluid Mech. 14:235-259, 1982

[10] McQueen DM and Peskin CS: Computer-assisted design of pivoting-disc prosthetic mitral valves. J. Thorac. Cardiovasc. Surg. 86:126-135, 1983

[11] McQueen DM and Peskin CS: Computer-assisted design of butterfly bileaflet valves for the mitral position. Scand. J. Thor. Cardiovasc. Surg. 19:139-148, 1985

[12] McQueen DM and Peskin CS: CURVED BUTTERFLY BILEAFLET PROSTHETIC CARDIAC VALVE. US Patent Number 5026391, June 25, 1991.

Figures and Figure Legends

Figure 1: Top view of the three-dimensional Eulerian mesh of the fluid
computation, showing the columnar organization that is used to
parallelize the interaction routines. The shaded columns (which
appear as squares, since they are seen from the top) all belong to the
same group of columns. They are far enough apart (at least 4
mesh widths in either coordinate direction) that their `(4,4,ng)`

surrounding columns of mesh points do not overlap. This is indicated
by the dotted lines in the figure. The linked-list data structure
that keeps track of which Lagrangian (fiber) points belong to each
column is shown schematically at the bottom of the figure. A pointer
from each column locates the first Lagrangian point in that column.
Then, pointers within the shared linked list indicate the next
Lagrangian point in the column, the next one after that, and so on
until a `0`

is encountered, to indicate that there are no more
Lagrangian points in the column. At each timestep, the Lagrangian
points in one of the shaded columns will move, at most, into a
neighboring column (including possibly a diagonal neighbor). The
neighborhoods formed by the shaded columns and their neighbors do not
overlap, so the shared linked-list data structure can be updated in
parallel by parallelizing over the different columns within a group.

Figure 2: Computer model of the heart in diastole and systole. Flow
patterns within the heart are shown at selected times during
ventricular diastole (relaxation) in the upper part of the figure and
during ventricular systole (contraction) in the lower part of the
figure. The flow is depicted in terms of streaklines showing the
computed trajectories of fluid particles over the recent past. Both
in the diastolic and in the systolic case, the view at the left shows
the heart in such a way that the interventricular septum is seen on
edge, and both ventricles can be visualized. The left ventricle
(which is seen to the right of the septum, since the heart is viewed
from the front) contains red markers depicting oxygenated blood, and
the right ventricle contains blue markers depicting partially
deoxygenated blood. The three vessels or chambers that can be seen
connecting to the ventricles in this view are, from left to right in
the figure, the main pulmonary artery, the ascending aorta, and the
left atrium. The corresponding valves, which can also be seen, are
the pulmonic, aortic, and mitral valves. In the diastolic frame, note
vortex formation as blood flows through the mitral valve. The right
atrium is behind the pulmonary artery in this view and cannot be seen
clearly. In the diastolic and systolic frames on the right side of
the figure, the heart has been rotated so that the right ventricle and
right atrium appear in front of other structures and can be clearly
seen. In the diastolic frame (above) two streams of blood from the
superior and inferior vena cavae are seen merging in the right atrium
and flowing into the right ventricle through the tricuspid valve. In
the systolic frame (below), blood is pumped out of the right ventricle
into the main pulmonary artery.

Figure 3: Detail of the valves of the left ventricle during diastole.
The aortic valve is seen above and to the left in the figure, and the
mitral valve is seen below and to the right. The aortic valve has
three self-supporting leaflets, and the mitral valve has two major
leaflets which are supported by chordae tendineae which insert
into papillary muscles that project from the left ventricular wall.
Since the left ventricle is relaxed at this time, the aortic valve
is closed and the mitral valve is open.

Figure 4: Detail of the valves of the left ventricle during systole. See legend of Figure 3. Since the ventricle is contracting at this time, the mitral valve is closed and the aortic valve is open.

Tue Dec 16 13:38:57 EST 1997