next up previous
Next: About this document


David M. McQueen and Charles S. Peskin

Courant Institute of Mathematical Sciences
New York University
251 Mercer Street
New York, NY 10012 USA

May 12, 1997


This paper describes the parallel implementation of the immersed boundary method on a shared-memory machine such as the Cray C-90 computer. In this implementation, outer loops are parallelized and inner loops are vectorized. The sustained computation rates achieved are 0.258 Gflops with a single processor, 1.89 Gflops with 8 processors, and 2.50 Gflops with 16 processors. An application to the computer simulation of blood flow in the heart is presented.

Journal of Supercomputing Vol 11, Issue 3, Pages 213:236, 1997



Problems arising in biofluid dynamics typically involve the interaction of a viscous incompressible fluid with an immersed elastic boundary. Such problems can be solved by the immersed boundary method [1-5]. Our purpose here is to discuss the parallel implementation of this method on machines like the Cray C-90 computer, which is a shared-memory machine with a moderate number (e.g., 16) of vector processors. Results will be shown for the case of blood flow in the heart, which is the problem for which the immersed boundary method was originally developed.

Summary of the Immersed Boundary Method

Consider a viscous incompressible fluid containing an immersed system (a continuum) of elastic fibers. The fibers are pure force generators. They move at the local fluid velocity and apply force locally to the fluid in which they immersed. Otherwise, they do not alter the properties of the fluid. Thus, we assume that the fiber-fluid composite material is incompressible and has the same constant density tex2html_wrap_inline414 and constant viscosity tex2html_wrap_inline416 as the fluid itself.

The equations of motion of such a system are as follows:








Equations 1-2 are the Navier-Stokes equations of a viscous incompressible fluid. They are written in Eulerian form, in which the independent variables are the position tex2html_wrap_inline422 and the time t. The functions of these variables that appear in Eqs. 1-2 are the fluid velocity tex2html_wrap_inline426 , the fluid pressure p, and the Eulerian fiber force density tex2html_wrap_inline430 .

Equations 5-7 are the fiber equations. They are written in terms of the Lagrangian independent variables q,r,s,t, where fixed values of q,r,s designate a material point, fixed values of q,r designate a fiber, and t is the time. The functions of these independent variables are the fiber position tex2html_wrap_inline440 , the fiber tension T (defined so that tex2html_wrap_inline444 is the force transmitted by the bundle of fibers tex2html_wrap_inline446 ), the unit tangent to the fibers tex2html_wrap_inline448 , and the Lagrangian fiber force density tex2html_wrap_inline450 . Equation 5 is the generalized Hooke's law that gives the fiber stress T in terms of the fiber strain, which is determined by tex2html_wrap_inline454 . Note that the stress-strain relationship tex2html_wrap_inline456 allows for explicit position and time dependence. It is the time dependence, in particular, that allows the heart muscle to contract and relax. Equation 6 defines the unit tangent tex2html_wrap_inline448 to the fibers. Equation 7 gives the Lagrangian form of the force density tex2html_wrap_inline450 applied by the fibers to the fluid. For a derivation of the relationship tex2html_wrap_inline462 , see [1,4,5].

Equations 3-4 are the interaction equations. Each of them involves the three-dimensional Dirac delta function tex2html_wrap_inline464 , which acts as the kernel of an integral transformation. These equations convert from Lagrangian to Eulerian variables and vice versa. Equation 3 represents the force of the fibers on the fluid as a continuous sum (integral) of point forces applied at the locations of the fiber points. Equation 4 is the no-slip condition of a viscous fluid. Here it serves as an equation of motion of the fibers by asserting that the fibers move at the local fluid velocity.

In the immersed boundary method, the fluid equations are discretized on a regular cubic lattice in the physical space of the position variable tex2html_wrap_inline422 , and the fiber equations are discretized on a rectangular lattice in the space of the Lagrangian parameters q,r,s. Let the mesh widths of these two lattices be denoted tex2html_wrap_inline470 and tex2html_wrap_inline472 , respectively.

On the fluid lattice, the forward, backward, and central divided difference operators (corresponding to derivatives) will be denoted tex2html_wrap_inline474 , tex2html_wrap_inline476 , and tex2html_wrap_inline478 , respectively. A subscript tex2html_wrap_inline480 will indicate the Cartesian direction in which the difference is to be applied. The notation tex2html_wrap_inline482 will be used for the vector central difference operator tex2html_wrap_inline484 , which is analogous to tex2html_wrap_inline486 .

On the Lagrangian fiber lattice, the only difference operator we need is a central difference in the fiber direction, which we denote tex2html_wrap_inline488 . This has the definition tex2html_wrap_inline490 so it connects functions defined on staggered lattices. In particular, the functions T and tex2html_wrap_inline448 are associated with the links between fiber points, whereas the functions tex2html_wrap_inline440 and tex2html_wrap_inline450 are associated with the fiber points themselves.

Finally, let time proceed in steps of duration tex2html_wrap_inline500 , and let superscripts denote the timestep index. The discretized equations are as follows:





In these equations, the following notations require further explanation. First, in Eq. 8, the notation tex2html_wrap_inline502 refers to the ``upwind'' difference: the forward difference is used if the convection velocity is negative, and the backward difference if the convection velocity is positive. This is done for numerical stability. Second, in Eqs. 10-11, the notations tex2html_wrap_inline504 and tex2html_wrap_inline506 refer to sums over the fiber and fluid lattices, respectively. Finally, in these same equations, the notation tex2html_wrap_inline508 refers to an approximation to the three-dimensional Dirac delta function, depending on the mesh width h. For purposes of the present paper, it is only important to note that we construct this approximate delta function in such a way that tex2html_wrap_inline512 is nonzero only for tex2html_wrap_inline422 in a tex2html_wrap_inline516 array of lattice points surrounding the lattice cell that contains tex2html_wrap_inline440 . The specific construction of tex2html_wrap_inline508 is described in [4,5].

Each timestep of the immersed boundary method proceeds as follows. At the beginning of the timestep, the fiber configuration tex2html_wrap_inline522 and the fluid velocity tex2html_wrap_inline524 are known. Equations 12-14 are used to evaluate the fiber tension tex2html_wrap_inline526 , the fiber direction tex2html_wrap_inline528 , and then the Lagrangian fiber force density tex2html_wrap_inline530 . Equation 10 is then used to transfer the fiber force density from the fiber lattice to the fluid lattice. The result of this computation is the Eulerian fiber force density tex2html_wrap_inline532 . Next, Eqs. 8-9 are solved simultaneously for the unknowns tex2html_wrap_inline534 . Despite the nonlinearity of the Navier-Stokes equations, the fluid equations that we solve at each timestep are linear, constant-coefficient difference equations. This is because the nonlinearity is expressed in terms of the velocity field tex2html_wrap_inline524 which is known from the result of the previous timestep. Once the new velocity field has been found, Eq. 11 is used to interpolate this velocity field to the fiber points, and to move them accordingly. Once this has been done tex2html_wrap_inline538 and tex2html_wrap_inline540 are known, and the timestep is complete.

Parallel Vector Implementation: Fiber Equations

In this section we consider the computation of the Lagrangian fiber force density f. Since the different fibers do not interact with each other during this computation, there is a lot of opportunity for parallelization. The overall fiber structure is intrinsically three-dimensional, as indicated by the three Lagrangian parameters q,r,s. Thus, the structure of our code will involve triply nested loops. To obtain good efficiency the innermost loop will be vectorized, and the outermost loop will be parallelized. This strategy proceeds most simply if the parameter s, which varies along a fiber, is incremented during the middle of the three loops, which is not how one would instinctively write the program. The following example involves the computation of tex2html_wrap_inline450 once T and tex2html_wrap_inline448 have already been determined.

      call zero(fL1,fLsize)
      call zero(fL2,fLsize)
      call zero(fL3,fLsize)
      do iq=1,nq
       do is=1,ns-1
        do ir=1,nr
         fL1(is  ,ir,iq)=fL1(is  ,ir,iq)+T1
         fL2(is  ,ir,iq)=fL2(is  ,ir,iq)+T2
         fL3(is  ,ir,iq)=fL3(is  ,ir,iq)+T3
        end do
       end do
      end do

In the above code, the force vector (T1,T2,T3) associated with each fiber link is computed and then added in, with the appropriate sign, to the resultant force on the points at the two ends of the link. Note that the above code would not vectorize if the inner loop were to run along a fiber. In fact, however, the inner loop runs across the fibers, so there is no difficulty about vectorization. By not assigning indices to the temporary variables T1,T2,T3, we tell the compiler that these values can be held in vector registers and need not be stored to memory.

Another important aspect of vectorization is to avoid memory bank conflicts. The Cray C-90 computer has 16 memory banks, which we can think of as being numbered 0...15, and the individual memory addresses are assigned to the banks according to the rule that the bank number is the address modulo 16. Thus, successive addresses are in different banks, which means that vectorized loops with stride 1 through address space proceed without conflict. There is a danger in the above code, however. The stride though the inner loop is not 1 but is rather the size of the first dimension of the various arrays, which would ordinarily be ns, the number of points on a fiber. If this happens to be a multiple of 16, the vectorized inner loop will be drastically slowed because it will only be using one of the memory banks. This is easily fixed, however, by padding the relevant dimension, i.e., by allocating more memory than is actually used. This affects only the dimension statements; the code given above is unchanged. From now on, we shall assume without further comment that memory bank conflicts have been avoided in this way.

Parallelization of the outer loop is accomplished by the CMIC$ compiler directives. The directive CMIC$ DO ALL tells the compiler that the next do loop (in our case the loop over iq) is to be parallelized. Each pass through the loop may be done by a different processor, if there are enough processors available for that to happen. If not, when a processor is done with a particular pass through the loop it declares itself available and is given another pass through the loop to do. Note that the programmer does not need to know in advance how many processors will be available (this may depend on other jobs that are running simultaneously), and also that the programmer does not need to make a specific plan as to which passes through the loop will be done by the different processors. Load balancing in this style of parallelization is automatic: If a processor is given a pass through the loop that involves only a little work, then it returns early and is given more work to do. Only when the parallel loop is almost done can the situation arise that a processor returns and then remains idle while waiting for the other processors to finish their tasks because there are no more passes through the loop left to do. This end-effect inefficiency can be kept small by making sure that there are a lot more passes through the loop than processors. In the example given above, load balancing is not really an issue, since each pass through the loop involves the same amount of work. Later, however, we shall encounter examples where this is not the case, so it is important to realize that load balancing will be achieved, to a good approximation, even when different passes through a parallel loop involve different amounts of work.

Note that the CMIC$ directives also tell the compiler which variables are shared and which must be private to the individual processor. Shared variables are those that all of the processors can access simultaneously. If they are to updated, it is up to the programmer to make sure that the results are not dependent on the order in which the shared variables happen to be accessed by the different processors. For example, in the code above, the shared arrays fL1, fL2, and fL3 are updated, but the different processors work in different parts of these arrays, so there is no danger. Private variables should be thought of as local to the processor. They have no global meaning and cannot be accessed after the parallelized loop is complete.

Parallel Vector Implementation: Fluid Equations

We now consider the fluid solver defined by Eqs. 8-9. The computation of tex2html_wrap_inline540 and tex2html_wrap_inline554 from tex2html_wrap_inline524 and tex2html_wrap_inline532 can be broken down into two steps. The first one, which we call the ``upwind'' step, involves evaluation of the vector field


Once this has been done, tex2html_wrap_inline540 and tex2html_wrap_inline554 are determined by solving the system


The main complication concerning the parallel vector implementation of the upwind step is the ``if'' statement that is implicit in the use of upwind differencing. This is handled by the library function CVMGP(A,B,C), which returns A if C is positive and B otherwise. This function, known as ``conditional vector merge (positive)'', can be used inside a vectorized loop without impeding vectorization.

The following code assumes that the values of u are known not only on the domain of interest but also on the neighboring planes given by i,j,k = 0 or ng+1. In the case of periodic boundary conditions, this can be established by first copying the data from the plane i=1 into the plane i=ng+1 and the data from the plane i=ng into the plane i=0, and similarly for j and k. We assume this has already been done.

Note that the following code uses ``program units'', in which the unit of length is h, the unit of time is tex2html_wrap_inline500 , and the unit of mass is tex2html_wrap_inline568 . Thus, the constants h, tex2html_wrap_inline500 , and tex2html_wrap_inline414 do not appear, and we save quite a few unnecessary multiplications and divisions.

      do k=1,ng
       do j=1,ng
        do i=1,ng
         w1(i,j,k) = u1(i,j,k) + FE1(i,j,k)
     1              -u1(i,j,k)*CVMGP(u1(i  ,j  ,k  )-u1(i-1,j  ,k  ),
     2                               u1(i+1,j  ,k  )-u1(i  ,j  ,k  ),
     3                                               u1(i  ,j  ,k  ))
     4              -u2(i,j,k)*CVMGP(u1(i  ,j  ,k  )-u1(i  ,j-1,k  ),
     5                               u1(i  ,j+1,k  )-u1(i  ,j  ,k  ),
     6                                               u2(i  ,j  ,k  ))
     7              -u3(i,j,k)*CVMGP(u1(i  ,j  ,k  )-u1(i  ,j  ,k-1),
     8                               u1(i  ,j  ,k+1)-u1(i  ,j  ,k  ),
     9                                               u3(i  ,j  ,k  ))
         w2(i,j,k) = u2(i,j,k) + FE2(i,j,k)
     1              -u1(i,j,k)*CVMGP(u2(i  ,j  ,k  )-u2(i-1,j  ,k  ),
     2                               u2(i+1,j  ,k  )-u2(i  ,j  ,k  ),
     3                                               u1(i  ,j  ,k  ))
     4              -u2(i,j,k)*CVMGP(u2(i  ,j  ,k  )-u2(i  ,j-1,k  ),
     5                               u2(i  ,j+1,k  )-u2(i  ,j  ,k  ),
     6                                               u2(i  ,j  ,k  ))
     7              -u3(i,j,k)*CVMGP(u2(i  ,j  ,k  )-u2(i  ,j  ,k-1),
     8                               u2(i  ,j  ,k+1)-u2(i  ,j  ,k  ),
     9                                               u3(i  ,j  ,k  ))
         w3(i,j,k) = u3(i,j,k) + FE3(i,j,k)
     1              -u1(i,j,k)*CVMGP(u3(i  ,j  ,k  )-u3(i-1,j  ,k  ),
     2                               u3(i+1,j  ,k  )-u3(i  ,j  ,k  ),
     3                                               u1(i  ,j  ,k  ))
     4              -u2(i,j,k)*CVMGP(u3(i  ,j  ,k  )-u3(i  ,j-1,k  ),
     5                               u3(i  ,j+1,k  )-u3(i  ,j  ,k  ),
     6                                               u2(i  ,j  ,k  ))
     7              -u3(i,j,k)*CVMGP(u3(i  ,j  ,k  )-u3(i  ,j  ,k-1),
     8                               u3(i  ,j  ,k+1)-u3(i  ,j  ,k  ),
     9                                               u3(i  ,j  ,k  ))
        end do
       end do
      end do

Thus, the upwind computation is vectorized over the rows of the computational lattice and it is parallelized over the planes of that lattice. When each processor is given a plane to work on, the task that it has to perform involves ng vectorized computations, each of size ng.

Once the upwind computation has been performed and tex2html_wrap_inline576 is known, it is time to solve the system of Eqs. 16-17 for tex2html_wrap_inline540 and tex2html_wrap_inline554 . As remarked above, these are linear difference equations with constant coefficients, and they may be conveniently solved with the help of the Fast Fourier Transform (FFT) algorithm. The three-dimensional FFT that we use was kindly provided by Koushik Ghosh [6], then of Cray Research. Like the rest of the code that we have already described, it uses vectorization of inner loops and parallelization of outer loops to obtain high speed.

Parallel Vector Implementation: Interaction Equations

This is the hard part! We now consider the implementation of Eqs. 10-11 in which the Lagrangian (fiber) and Eulerian (fluid) variables interact. It is important to note that this interaction is local: The fibers move at the local fluid velocity (Eq. 11) and exert forces locally on the fluid (Eq. 10). (Because the fluid is incompressible, these forces have instantaneous nonlocal effects that are mediated by the pressure field, but that occurs in the fluid equations rather than in the interaction equations per se.)

The locality of interaction between the fibers and the fluid is expressed in the immersed boundary method by the bounded support of the function tex2html_wrap_inline508 . This function has the property that tex2html_wrap_inline584 if tex2html_wrap_inline586 for tex2html_wrap_inline588 . (For further details concerning tex2html_wrap_inline508 , see [4,5].) Thus tex2html_wrap_inline512 is nonzero only for tex2html_wrap_inline422 in the interior of a cube of edge 4h centered on the point tex2html_wrap_inline440 and aligned with the coordinate axes.

Given the coordinates of a fiber point, it is therefore easy to identify the points of the fluid lattice with which it interacts. This suggests that we implement both Eq. 10 and Eq. 11 by looping over the fiber points in some arbitrary order, and for each fiber point making use of just the tex2html_wrap_inline516 sub-array of the Eulerian fluid mesh that is ``within range'' of that fiber point (the range being established by the support of the function tex2html_wrap_inline602 .

The implementation of this strategy on a parallel vector computer faces two obstacles. The first is that the Eulerian (fluid) lattice has to be accessed in an unstructured way: there is no constant stride through memory. Note that even the tex2html_wrap_inline604 fluid mesh points associated with a given fiber point are not stored in contiguous memory (recall the manner in which multidimensional arrays are mapped into the one-dimensional address space of a computer). Then there is even less structure when we move from one fiber point to another. The lack of constant stride impedes vectorization.

The second obstacle affects both parallelization and vectorization. Note that Eq. 10 gives the Eulerian force density tex2html_wrap_inline430 at a given Eulerian lattice point tex2html_wrap_inline422 as a sum of contributions from a collection of fiber points, and that Eq. 11 gives the displacement of a fiber point as a sum of contributions from a collection of Eulerian mesh velocities. The computation of a sum involves a loop like the following:

      do i=1,nterms
      end do

As is well known, this loop does not produce correct results when vectorized or parallelized. What happens is this: First the value 0 that is stored in the memory location assigned to sum will be fetched from memory nterms times. Then each of the values of term will be fetched and added to each of these nterms zeros. Finally, the nterms results will be stored in the memory location assigned to sum, and the alleged answer will be whichever one was stored there last, which is wrong no matter which one it is! Thus we cannot vectorize or parallelize the computation of tex2html_wrap_inline532 over the fiber points, nor can we vectorize or parallelize the computation of tex2html_wrap_inline538 over the points of the fluid lattice.

The strategy that we have adopted to overcome these difficulties is based on an organization of the data into columns. Let a cell of the Eulerian (fluid) mesh be a cube of edge h with mesh points at its corners. Consider the set tex2html_wrap_inline616 of all fiber points within a given column of such cells. The part of the fluid lattice with which the fiber points of tex2html_wrap_inline616 interact is a sub-array of dimension (4,4,ng). The fiber points in tex2html_wrap_inline616 lie in the central column of this sub-array. If we consider two different columns constructed in this way, their (4,4,ng) sub-arrays may overlap. But they will not overlap if their locations differ by at least 4h in either the tex2html_wrap_inline624 or in the tex2html_wrap_inline626 coordinate direction. Thus, it is advantageous to partition the columns into 16 groups, defined as follows: Let (izero,jzero) be indices that define a group, where izero and jzero can each take on the values 1,2,3,4. Then, within a group, let the individual columns be identified by the indices (ii,jj), where ii=izero,izero+4,izero+8,... and jj=jzero,jzero+4,jzero+8,.... There is then no overlap between the (4,4,ng) sub-arrays of the fluid lattice associated with the different columns belonging to one particular group, see Figure 1. This makes it possible to parallelize over the different columns within a group, though the processing of one group must be completed before the processing of the next one may begin. On a tex2html_wrap_inline628 fluid grid, which is our typical case, there are tex2html_wrap_inline630 columns in a group, so there is plenty of parallel work to do. (Recall that the Cray C-90 computer has 16 processors.) This is good from the standpoint of load balancing: Different columns contain different numbers of fiber points (some columns may even be empty), so some processors will return from processing a column earlier than others. Those that do so will immediately be given another column to process, so the overall load will remain balanced despite the heterogeneity in the amount of work to be done per column.

Besides making parallelization possible, the strategy that we have just outlined is also advantageous from the standpoint of vectorization. The library function gather can be used to collect Lagrangian position and force density data from the fiber points in a column, and to organize these data in one-dimensional arrays. Similarly, gather can be used to collect Eulerian velocity and force density data from the associated (4,4,ng) sub-array of the fluid lattice, and to store those data in one-dimensional arrays as well. Once the data have been collected and organized in this way, vectorized computations can be performed on the data, and then the library function scatter can be used to return results to their proper locations in memory. Finally, we have to take care that when sums are being computed the vectorization is over the multiple sums that have to be done, rather than over the multiple terms within a sum.

At this point the reader may protest that the strategy outlined above will founder on the task of locating all of the fiber points in a given column. How can this be done efficiently, without searching through all of the fiber points? The answer is to use a linked-list data structure in which an array of pointers firstn(ii,jj) gives the index of the ``first'' (actually, the order is of no importance) fiber point in column (ii,jj), and then the one-dimensional array nextn is used to find the index of the next point, and so on until a 0 is found, indicating that there are no more points in the column. Note that a single array nextn is adequate to hold the information about all of the columns. The linked-list data structure that we have just described is depicted in Figure 1.

But how can the linked-list data structure be efficiently computed? Since the fiber points move, this has to be done at every timestep, and unless we can do it in parallel, there will be a substantial serial bottleneck. Fortunately, we can impose a restriction on how far the fiber points can move in a single timestep. Let tex2html_wrap_inline632 be an upper bound on any component of the fluid velocity, and make the restriction that tex2html_wrap_inline634 . This restriction (actually, a slightly stronger one than this) is needed in any case for the stability of the upwind scheme for the fluid dynamics. Then fiber points in any particular column on a given timestep can, at worst, be found in a neighboring column on the next timestep (including diagonal neighbors). This means that the linked lists for all of the columns in a group, as defined above, can be updated in parallel, since the columns in a group are far enough apart that even their neighborhoods are disjoint, where the neighborhood of a column is defined as the union of that column and its immediate neighbors (including diagonal neighbors), see Figure 1.

The code to update the linked-list data structure that defines the partition of fiber points into columns will now be given. (We omit the initialization of this data structure, since that does not need to be done efficiently, as it is only done once.)

c Code to update linked-list data structure 
c that partitions fiber points into columns
c Loop over the 16 groups of columns:
      do izero=1,4
       do jzero=1,4
c Now set up a parallel loop over all of the columns in a group:
        do ij=0,(ng/4)**2-1
c Consider the column whose coordinates are (ii,jj):
c Initialize a trailing pointer:
c n is the index of the first Lagrangian point 
c that should be in the column (ii,jj).
c (If n=0, there are no such points.)
         do while(
c (in,jn) is the column that actually contains the Lagrangian point n.
c Test to see whether it is different from (ii,jj):
          if (( then
c Remember the value of the pointer to the next Lagrangian point
c that should be in column (ii,jj):
c Insert the Lagrangian point n into the linked list of column (in,jn):
c Delete the Lagrangian point n from the linked list of column (ii,jj):
           if (nprev.eq.0) then
           end if
c Locate the next Lagrangian point that should be in the column (ii,jj),
c but leave the trailing pointer unchanged, 
c since a Lagrangian point has been deleted 
c from the linked list of column (ii,jj).
c Lagrangian point n belongs in column (ii,jj).
c Update the trailing pointer:
c and then locate the next Lagrangian point 
c that should be in the column (ii,jj):
          end if
         end do while
        end do
c The parallel loop over a group of columns ends here.
c When the whole group is done, move on to the next group of columns.
       end do
      end do

In the foregoing code, all of the processors may be simultaneously updating the shared array nextn, which holds the linked-list data structure. Moreover, the processors skip around that array seemingly with abandon, and the whole procedure seems very dangerous. We know, however, that these parallel processes cannot conflict, since the columns involved are sufficiently well separated, as described above and depicted in Figure 1.

We now consider how the linked-list data structure is used in the computation of the Eulerian force-density tex2html_wrap_inline430 according to Eq. 10. The following code accomplishes this task:

c This code uses the Lagrangian force density fL1,fL2,fL3
c to compute the Eulerian force density FE1,FE2,FE3.
c The Lagrangian point coordinates are stored in the arrarys x1,x2,x3.
c These points have been sorted into columns.
c This columnar organization is stored in the shared linked list nextn,
c which is accessed via the pointers contained in firstn.
c Note that here the Lagrangian coordinates and force density components
c are stored in one-dimensional arrays.
c Initialize the Eulerian force density arrays to zero:      
      call zero(FE1,FEsize)
      call zero(FE2,FEsize)
      call zero(FE3,FEsize)
c Set up a loop over 16 different groups of columns.
c Parallelization will be done WITHIN each of these groups.
      do izero=1,4
       do jzero=1,4
CMIC$2       fL1,fL2,fL3,x1,x2,x3)
CMIC$4        FEc1,FEc2,FEc3,numrem,npoints,npt,indx,xc1,xc2,xc3,
CMIC$5        fLc1,fLc2,fLc3,fi,fj,fk,m,mzero)
c Begin parallel loop over the columns in a group:
        do ijcount=0,(ng/4)**2-1
c Find the coordinates (ii,jj) of the current column.
c Columns within a group are 4 meshwidths apart in each coordinate direction.
c Locate the first Lagrangian point in column (ii,jj):
c Proceed with the column only if there at least one such point:
c Consider the 4x4 column of the Eulerian mesh that is influenced 
c by the Lagrangian points in the column (ii,jj).
c Let the coordinates of points in this column be denoted (i3D,j3D,k3D), 
c where i3D=ii-1...ii+2, j3D=jj-1...jj+2,
c The inline function index3D converts (i3D,j3D,k3D) 
c to a one-dimensional address, which is stored in the array indc3d
c for future gather and scatter operations.  The details of this
c function depend on the dimensions of the Eulerian arrays.
c The order of storage within indc3d is such that the 16 Eulerian points in a 
c horizontal plane (x3=constant) are stored in contiguous memory locations.
          do i=0,3
           do j=0,3
            do k3D=1,ng
            end do
           end do
          end do
c Now used indc3D to direct the gathering of the force data for one column:
          call gather(16*ng,FEc1,FE1,indc3D)
          call gather(16*ng,FEc2,FE2,indc3D)
          call gather(16*ng,FEc3,FE3,indc3D)
c Consider the Lagrangian points of the column 
c in bunches of size at most nptmax.
c (This is done to limit the amount of storage that will be needed.)
          do while (
c Use the linked list to identify the next bunch of Lagrangian points  
c to be considered.  Store their indices in the array indx
c which will be used to gather the data concerning these points.
           do npt=1,npoints
           end do
c Gather the Lagrangian coordinate and force density data:
           call gather(npoints,xc1,x1,indx)
           call gather(npoints,xc2,x2,indx)
           call gather(npoints,xc3,x3,indx)
           call gather(npoints,fLc1,fL1,indx)
           call gather(npoints,fLc2,fL2,indx)
           call gather(npoints,fLc2,fL3,indx)
c For each gathered Lagrangian point, 
c compute the 4x4x4=64 delta function weights 
c and store these in the array delta.  
c Note that deltaf is an inline function, 
c and that vectorization is over the Lagrangian points in the bunch
c that is being considered.  The 64 delta function weights of a given
c Lagrangian point are stored in contiguous memory.
           do j=0,3
            do i=0,3
             do k=0,3
              do npt=1,npoints
              end do
             end do
            end do
           end do
c  Now use the delta function weights and the Lagrangian force density
c  to update the Eulerian force density for the 4x4 column of Eulerian 
c  mesh points surrounding the column (ii,jj) of Lagrangian points.
c  Here the vectorization is over the 64 Eulerian points associated
c  with a given Lagrangian point.  These 64 Eulerian points are
c  stored in locations mzero+1...mzero+64 of the arrays FEc1,FEc2,FEc3.
           do npt=1,npoints
            do m=1,64
            end do
           end do
c  Scatter the results for this column back to the 3D Eulerian mesh.
c  Although different processors are doing this at the same time,
c  their operations cannot interfere, since the 4x4 columns considered
c  within this parallel loop do not overlap. 
           call scatter(16*ng,FE1,indc3D,FEc1)
           call scatter(16*ng,FE2,indc3D,FEc2)
           call scatter(16*ng,FE3,indc3D,FEc3)
c  Calculate the number of points that remain to be done in column (ii,jj)
c  If the result is zero, the  do while  loop will terminate.
          end do while
         end if
        end do
c  The parallel loop over columns in a group ends here, so this
c  is a point at which the processors must synchronize.  The two
c  outer loops are over the 16 different groups of columns.
c  Parallelization is over the (ng/4)**2 columns within a group.
       end do
      end do

The code that implements Eq. 11 for the computation of tex2html_wrap_inline538 is similar in structure to the foregoing and will not be given in detail here. The main difference appears near the end, after the delta-function weights have been computed and stored in the array delta(m,npt). Then we proceed as follows to find the displacements of the fiber points:

c Calculate separately the contribution to the displacement of
c each Lagrangian point from each of the Eulerian points with
c which it interacts:
           do npt=1,npoints
            do m=1,64
            end do
           end do
c Now combine the contributions computed above
c and move the Lagrangian points accordingly:
           do m=1,64
            do npt=1,npoints
            end do
           end do

In the foregoing code, note the reversal of the order of the loops. The first pair of loops is vectorized over the 64 Eulerian points with which a Lagrangian point interacts, but the second pair of loops is vectorized over the different Lagrangian points. In both cases, it has to be done that way. In the first pair of loops, constant stride is achieved only because the vectorization is over m rather than npt. The second pair of loops illustrates the solution to the problem of vectorizing a sum, as discussed above. We obtain vectorization while avoiding incorrect results by vectorizing over the many sums that have to be computed, rather than over the terms of each individual sum.

For clarity, we have simplified the discussion of the interaction code by eliminating the technicalities associated with the edges of the domain. The code given above will work as written if all Lagrangian point coordinates lie within the half-open interval [1,ng+1), and if the Eulerian mesh has dimensions that include (0:ng+2,0:ng+2,0:ng+2). In practice, we use periodic boundary conditions with period ng in all three coordinate directions. This requires the following modifications and additions to the code as presented above: First, whenever the any coordinate of a Lagrangian point is used, that coordinate must be replaced by its periodic image within the interval [1,ng+1). Second, before the Eulerian velocity field can be used, its periodicity must be assured by copying plane 1 onto plane ng+1, plane 2 onto plane ng+2, and plane ng onto plane 0, in all three coordinate directions. Third, after the Eulerian force density has been computed, it must be ``folded'' into itself by adding the data on plane ng+1 into the data on plane 1, the data on plane ng+2 into the data on plane 2, and the data on plane 0 into the data on plane ng, again in all three coordinate directions. These plane-copying or folding operations must be done sequentially over the three coordinate directions to get the edges and corners right, but there is plenty of opportunity for parallelization and vectorization within each copy or fold. Since each plane has two coordinate directions, we can parallelize over one and vectorize over the other.

Performance Results

The performance results are summarized by the following table, which was obtained by using the Cray C-90 computer at the Pittsburgh Supercomputing Center running in dedicated mode (no competition with other users). The fluid lattice in this computation was tex2html_wrap_inline628 and the number of fiber points was roughly 500,000. The first column gives the number of processors employed by the computation. The next four columns give the elapsed time in seconds for one timestep, broken down into the four major components of a timestep. The component denoted FLUID refers to the integration of the Navier-Stokes equations, the component denoted FIBER refers to the computation of the fiber force, the component denoted PUSH is the interaction routine in which the Lagrangian fiber force converted to an Eulerian force density on the fluid lattice, and the component denoted MOVE is the interaction routine in which the fluid velocity is interpolated and the fiber points are moved. The sixth column labeled GFLOPS is the computation rate in units of tex2html_wrap_inline642 floating point operations per second. This was obtained by dividing the tex2html_wrap_inline644 floating point operations of a timestep by the total elapsed time for a timestep, in seconds. Finally, the last column gives the speedup in comparison to the single-processor case.


        --elapsed time (seconds)-- 
  1     2.71   0.107   2.64   2.75   0.258     1.0
  2     1.37   0.0574  1.34   1.40   0.508     2.0
  4     0.703  0.0325  0.677  0.704  1.00      3.9
  8     0.383  0.0198  0.356  0.365  1.89      7.3
 16     0.399  0.0149  0.234  0.199  2.50      9.7

The noteworthy features here are as follows. First, the single-processor computation rate is about tex2html_wrap_inline646 floating point operations per second, which about 1/4 the theoretical maximum of a single Cray C-90 processor. Second, the strictly Lagrangian part of the computation, i.e., FIBER, takes a negligible amount of time, but the other three parts (FLUID, PUSH, and MOVE) all take comparable amounts of time, and this remains true at all processor numbers. Thus it appears that the parallelization is equally effective in the different parts of the code. (If one part contained a serial bottleneck, its time would have started to dominate as the number of processors was increased.) Finally, note that the speedup is near-optimal (twice the speed when twice as many processors are employed) in going from 1 to 2, from 2 to 4, and from 4 to 8 processors. In the jump from 8 to 16 processors, however, the speedup is considerably less than a factor of 2, indicating that some inefficiencies have begun to creep in. These are most extreme in the FLUID part of the code, which is slightly slower with 16 processors than with 8.

Based on these results, it seems reasonable to specify 8 processors as the upper limit that should be used on this particular code (though if one is in a real hurry, 16 will do the job significantly faster at the price of some inefficiency). Fortunately, the code occupies only about 50 MWords (400 MBytes) of central memory, which is much less than half of the machine, so the remaining 8 processors can still be busy servicing other users, or even another version of the same code running a different case.

Results Concerning the Heart

The methodology described above has been applied to the computation of blood flow in the beating mammalian heart. Results are shown in Figures 2-4. The parameters used to obtain these results are those of the human heart. (This includes the viscosity, which in our earlier work had to be artificially elevated to achieve numerical stability.)

The construction of the heart model is described in [5]. It includes left and right atria and ventricles; mitral, aortic, tricuspid, and pulmonic valves; ascending aorta and main pulmonary artery; superior and inferior vena cavae; and the four pulmonary veins. These simulated tissues are all comprised of elastic fibers, some of which are contractile, to simulate cardiac muscle. Contraction and relaxation of the contractile fibers are simulated by appropriately timed changes in elastic parameters such as stiffness and rest length. The arteries and veins of the model have blind ends but are equipped with sources and sinks that establish appropriate pressure loads for the model heart.

In Figures 2-4, fluid flow is depicted in terms of streaklines that show the trajectories of fluid particles over the recent past, relative to the time of the frame that is being shown. The particles that generate the streaklines are introduced into the computation at t=0 and are moved at the local fluid velocity by the same algorithm that is used to move the fiber points. The fluid markers differ from the fiber points only in this: that the fluid markers do not exert any force on the fluid.

Figure 2 shows ventricular diastole (relaxation) and ventricular systole (contraction), each from two different viewpoints. One of these viewpoints is such that the interventricular septum is seen on edge. In this view both the left and right ventricles can be seen. The left ventricle contains red fluid markers to indicate oxygenated blood, and the right ventricle contains blue fluid markers to indicate partially deoxygenated blood. In the second viewpoint, the heart is rotated so that the right ventricle appears in front, obscuring the left ventricle. On the left side of the heart, the diastolic frame shows filling through the mitral valve. A vortex ring that forms in the left ventricle during this process is also visible. On the right side of the heart, the diastolic frame shows two converging streams from the superior and inferior vena cavae merging together to fill the right ventricle. Later in diastole (not shown), these streams form a prominent vortex that swirls around the right ventricular cavity. On both sides of the heart, the systolic frame shows ejection through the appropriate artery, oxygenated blood on the left side of the heart entering the aorta, and de-oxygenated blood on the right side entering the main pulmonary artery.

Detail of the left-sided valves is shown in Figures 3-4. In diastole (Figure 3) the inflow (mitral) valve is open and the outflow (aortic) valve is closed. In systole (Figure 4), the state of the valves reverses, so that the inflow (mitral) valve is closed but the outflow (aortic) valve is open. The mitral valve is supported by chordae tendinease that insert into two papillary muscles, each of which is an outgrowth of the left ventricular wall. The aortic valve, by contrast, is self supporting. The mathematical model that we use to define the fiber architecture of the aortic valve is described in [7].

The most serious problem with the version of the heart model shown here is that it suffers from substantial aortic regurgitation. During systole, while forward flow is proceeding, the fibers near the edge of the aortic leaflets twist around each other to such an extent that the leaflets become effectively smaller. During the subsequent diastole (not shown), these smaller leaflets are too small to meet properly. A hole develops through which an intense jet of blood pours backwards from the high-pressure aorta into the low-pressure (during diastole) left ventricle, thus forming a vortex with the opposite polarity to the one that forms normally during diastole. Our current work, therefore, is directed towards eliminating this malfunction of the aortic valve so that the model heart will function in a healthy manner.

Once the aortic regurgitation has been ``cured'', we plan to evaluate the model by comparison with the results of imaging methods that can now be used for non-invasive measurement of the cardiac flow pattern, e.g., color-Doppler ultrasound, and magnetic resonance imaging. We then plan to use the model heart as a computer test chamber for the design of prosthetic cardiac valves, as a tool for studying heart physiology by perturbing that physiology and observing the effects on cardiac performance, and also as a means of investigating a variety of disease states affecting the mechanical function of the heart and its valves. For examples of such applications of an earlier, two-dimensional model of the left heart only, see [8-12].


The work described in this paper was begun on the ``Cray-0'' (better known as the CDC6600) and has made use of the Cray-1, Cray-2, Cray-XMP, Cray-YMP, and most recently the Cray-C90 computer. It therefore seems fitting to conclude with a word about Seymour Cray, the creator of this remarkable line of hardware. The user of a Cray computer cannot escape the feeling that he knows Cray personally, and, moreover, that Cray has deeply understood his particular application and designed the machine accordingly. Cray understood that fields like fluid dynamics require highly repetitive structured computations to proceed at lightning speed. He introduced vectorization to fill this need. He understood that the scope of a physical simulation is entirely determined by the speed and size of the computer on which it is to be performed, and he struggled to build the very best hardware that could possibly be created at any given time. He understood that serious scientists would rather have a machine that required some hard work to get the job done than a machine that was easy to use but couldn't get the job done at all. And he built what was needed to predict the weather, to design aircraft, and to simulate the human heart.


The authors are indebted to the National Science Foundation for support of this work under research grants BIR-9224743 and BIR-9302545. Computation was performed at the Pittsburgh Supercomputing Center (PSC) on the Cray C-90 computer under an allocation of resources MCA93S004P from the MetaCenter Allocation Committee. We are particularly indebted to the PSC for the early access that we were given to the Cray C-90 computer when that machine was new, and to Koushik Ghosh, then of Cray Research, for providing the three-dimensional FFT code that is used in this work. Visualization of results was done at the Scientific Visualization Laboratory, Academic Computing Facility, New York University.


[1] Peskin CS and McQueen DM: A three-dimensional computational method for blood flow in the heart: (I) immersed elastic fibers in a viscous incompressible fluid. J. Comput. Phys. 81:372-405, 1989

[2] McQueen DM and Peskin CS: A three-dimensional computational method for blood flow in the heart: (II) contractile fibers. J. Comput. Phys. 82:289-297, 1989

[3] Peskin CS and McQueen DM: Cardiac fluid dynamics. Crit. Rev. Biomed. Eng. 20:451-459, 1992

[4] Peskin CS and McQueen DM: A general method for the computer simulation of biological systems interacting with fluids. In: Biological Fluid Dynamics (Ellington CP and Pedley TJ, eds.), The Company of Biologists Limited, Cambridge UK, 1995, pp. 265-276.

[5] Peskin CS and McQueen DM: Fluid dynamics of the heart and its valves. In: Case Studies in Mathematical Modeling: Ecology, Physiology, and Cell Biology (Othmer HG, Adler FR, Lewis MA, and Dallon JC, eds.), Prentice-Hall, Englewood Cliffs NJ, 1996, pp. 309-337.

[6] Ghosh K: Unpublished communication.

[7] Peskin CS and McQueen DM: Mechanical equilibrium determines the fractal fiber architecture of the aortic heart valve leaflets. Am. J. Physiol. 266: H319-H328, 1994

[8] McQueen DM, Peskin CS, and Yellin EL: Fluid dynamics of the mitral valve: physiological aspects of a mathematical model. Am. J. Physiol. 242:H1095-H1110, 1982

[9] Peskin CS: The fluid dynamics of heart valves: experimental, theoretical, and computational methods. Ann. Rev. Fluid Mech. 14:235-259, 1982

[10] McQueen DM and Peskin CS: Computer-assisted design of pivoting-disc prosthetic mitral valves. J. Thorac. Cardiovasc. Surg. 86:126-135, 1983

[11] McQueen DM and Peskin CS: Computer-assisted design of butterfly bileaflet valves for the mitral position. Scand. J. Thor. Cardiovasc. Surg. 19:139-148, 1985

[12] McQueen DM and Peskin CS: CURVED BUTTERFLY BILEAFLET PROSTHETIC CARDIAC VALVE. US Patent Number 5026391, June 25, 1991.

Figures and Figure Legends

Figure 1: Top view of the three-dimensional Eulerian mesh of the fluid computation, showing the columnar organization that is used to parallelize the interaction routines. The shaded columns (which appear as squares, since they are seen from the top) all belong to the same group of columns. They are far enough apart (at least 4 mesh widths in either coordinate direction) that their (4,4,ng) surrounding columns of mesh points do not overlap. This is indicated by the dotted lines in the figure. The linked-list data structure that keeps track of which Lagrangian (fiber) points belong to each column is shown schematically at the bottom of the figure. A pointer from each column locates the first Lagrangian point in that column. Then, pointers within the shared linked list indicate the next Lagrangian point in the column, the next one after that, and so on until a 0 is encountered, to indicate that there are no more Lagrangian points in the column. At each timestep, the Lagrangian points in one of the shaded columns will move, at most, into a neighboring column (including possibly a diagonal neighbor). The neighborhoods formed by the shaded columns and their neighbors do not overlap, so the shared linked-list data structure can be updated in parallel by parallelizing over the different columns within a group.

Figure 2: Computer model of the heart in diastole and systole. Flow patterns within the heart are shown at selected times during ventricular diastole (relaxation) in the upper part of the figure and during ventricular systole (contraction) in the lower part of the figure. The flow is depicted in terms of streaklines showing the computed trajectories of fluid particles over the recent past. Both in the diastolic and in the systolic case, the view at the left shows the heart in such a way that the interventricular septum is seen on edge, and both ventricles can be visualized. The left ventricle (which is seen to the right of the septum, since the heart is viewed from the front) contains red markers depicting oxygenated blood, and the right ventricle contains blue markers depicting partially deoxygenated blood. The three vessels or chambers that can be seen connecting to the ventricles in this view are, from left to right in the figure, the main pulmonary artery, the ascending aorta, and the left atrium. The corresponding valves, which can also be seen, are the pulmonic, aortic, and mitral valves. In the diastolic frame, note vortex formation as blood flows through the mitral valve. The right atrium is behind the pulmonary artery in this view and cannot be seen clearly. In the diastolic and systolic frames on the right side of the figure, the heart has been rotated so that the right ventricle and right atrium appear in front of other structures and can be clearly seen. In the diastolic frame (above) two streams of blood from the superior and inferior vena cavae are seen merging in the right atrium and flowing into the right ventricle through the tricuspid valve. In the systolic frame (below), blood is pumped out of the right ventricle into the main pulmonary artery.

Figure 3: Detail of the valves of the left ventricle during diastole. The aortic valve is seen above and to the left in the figure, and the mitral valve is seen below and to the right. The aortic valve has three self-supporting leaflets, and the mitral valve has two major leaflets which are supported by chordae tendineae which insert into papillary muscles that project from the left ventricular wall. Since the left ventricle is relaxed at this time, the aortic valve is closed and the mitral valve is open.

Figure 4: Detail of the valves of the left ventricle during systole. See legend of Figure 3. Since the ventricle is contracting at this time, the mitral valve is closed and the aortic valve is open.

next up previous
Next: About this document

David M. McQueen
Tue Dec 16 13:38:57 EST 1997