## Abstract

This paper addresses three questions related to the use of parallel computing in multibody dynamics (MBD) simulation. The “why parallel computing?” question is answered based on the argument that in the upcoming decade parallel computing represents the main source of speed improvement in MBD simulation. The answer to “when is it relevant?” is built around the observation that MBD software users are increasingly interested in multi-physics problems that cross disciplinary boundaries and lead to large sets of equations. The “how?” question is addressed by providing an overview of the state of the art in parallel computing. Emphasis is placed on parallelization approaches and support tools specific to MBD simulation. Three MBD applications are presented where parallel computing has been used to increase problem size and/or reduce time to solution. The paper concludes with a summary of best practices relevant when mapping MBD solutions onto parallel computing hardware.

## Introduction and Purpose

Multibody dynamics (MBD) is an area of computational dynamics concerned with the dynamics of mechanical systems consisting of collections of bodies experiencing large displacements and possibly large deformations. The 3D bodies are subject to external forces and their time evolution can be constrained through kinematic constraints that couple the dynamics of pairs of bodies. The bodies can interact mutually through friction and contact, or alternatively, in fluid-solid interaction (FSI) problems, their dynamics interplays with the dynamics of a fluid through a two way coupling mechanism.

Historically, MBD problems have relatively been small. A full vehicle dynamics analysis or an engine dynamics simulation in a commercial code such as ADAMS [1] typically leads to fewer than 50,000 equations. By finite element analysis (FEA) or computational fluid dynamics (CFD) standards, these are small problems. While nonetheless complex, MBD simulation has mostly led to computational problems that by and large have not justified heavy reliance on parallel computing. In fact, compared to solution providers in the FEA and CFD markets, all major players in the MBD arena: ADAMS, RecurDyn, LMS Virtual.Lab, or Simpack, have, until recently, been slow in adopting parallel computing in the equation formulation and numerical solution stages. There are still numerous studies in the MBD literature regarding how parallelism can reduce time to solution, see, for instance, Refs. [2–5]. Parallel computing has already been applied in our community to address domain specific problems, see, Refs. [6–8]. Finally, software for general purpose parallel MBD is discussed in Refs. [9–11].

Two forces are gradually changing the MBD landscape in relation to the simulation of very large systems. First, users of these simulation tools are requesting support for modeling features that cross the boundary of traditional MBD into multi-physics problems. These problems might include a fluid component, a granular component, a terrain component, etc., to yield models characterized by large sets of equations that stand to benefit from parallel computing. Second, once the privilege of a select group of research institutions, hardware that supports effective parallel computing is now ubiquitous. Due to low costs, less than $4000 for a 60 core Intel Xeon Phi chip or$3000 for a 2600 scalar processor Nvidia Kepler graphics processing unit (GPU) card, there is a manifest democratization of the access to parallel computing hardware.

Given these ongoing changes in the MBD simulation arena, the purpose of this paper is fourfold: (i) briefly present actual MBD problems whose modeling results in large sets of equations hence confirming the aforementioned trend; (ii) summarize the state of the art in parallel computing and comment on how it can be leveraged in MBD simulation; (iii) present several applications in which parallel computing was used to reduce the time to solution and/or increase the size of the problem investigated; and (iv) summarize good practices relevant when mapping MBD solutions onto parallel computing hardware.

The paper is structured around these four goals as follows. Section 2 summarizes how MBD problems in granular dynamics and fluid-solid interaction lead to large and challenging numerical problems that stand to benefit from parallel computing. Section 3 discusses parallel computing issues: Is parallel computing a fad?, What parallel computing avenues should one adopt in the context of MBD modeling and simulation? and finally, What are the tools of the trade (libraries, debuggers, profilers, integrated development environments) that one can use to leverage parallel computing in MBD?. Section 4 concentrates on issues that can influence the choice of (a) modeling formulation; (b) numerical methods; and (c) the parallel computing platform, three factors that combine to provide the foundation of a parallel MBD simulation tool. Section 5 describes three applications that relied on different parallel computing platforms to increase problem size and/or reduce time to solution. Section 6 rounds off the paper with closing remarks and several recommendations.

Note that the discussion herein was predicated on the assumption that the MBD problem size is large to the extent that the overhead associated with using high level parallel programming approaches is negligible. Approaches suitable for real-time simulation, when the interest is in characterizing on a tight and predefined schedule the dynamics of small problems, would revolve around programming techniques drawing on low level parallel libraries: e.g., pthreads. These libraries often work within the context of a dedicated and stripped down version of an operating system running on a multicore CPU chip. This discussion; i.e., parallel computing for real-time simulation, falls outside the scope of this contribution.

## MBD: The Need for Parallel Computing

MBD simulation plays an important role in the overall product life cycle by providing early insights in engineering designs through virtual prototyping. The complexity of the models in MBD simulation has evolved from component to subsystem to system. Recent advances in processor performance and memory speed/size, combined with the widespread availability of parallel computing platforms, provide renewed interest in what might be called many-body dynamics: understanding through direct numerical simulation the emergent behavior of large systems of 3D bodies interacting through friction, contact, and potentially long range electrostatic forces. Similar to Molecular Dynamics, the macroscale dynamics represents the emergent behavior of millions of 3D bodies whose motion is monitored in time. A list of applications that rely on many-body dynamics includes: packing of polyhedra for crystalline pattern formation studies; packing of compounds in pharmaceutical pills; drug delivery for powders and suspensions; ground vehicle mobility on granular terrain; additive manufacturing, in which layer upon layer of granular material is moved, shaped, and melted by a 3D printer; granular impact cratering of relevance in astrophysics, etc.

At its core, the many-body dynamics problem is posed like the typical MBD problem. The equations of motion are formulated herein using an absolute, or Cartesian, representation of the attitude of each rigid body in the system. The state of the system is represented by the generalized positions $q=[r1T,ɛ1T,…,rnbT,ɛnbT]T∈R7nb$ and their time derivatives $q·$, where nb is the number of bodies, $rj∈R3$ is the absolute location of the center of mass of body j and the quaternions (Euler parameters) $ɛj∈R4$ are used to represent body orientation. Instead of using quaternion derivatives, $ɛ·$, it is more effective to work with the angular velocities $ω¯j∈R3$, expressed in a local reference frame attached to body j; in other words, the formulation is derived in terms of the generalized velocities $v=[r·1T,ω¯1T,…,r·nbT,ω¯nbT]T∈R6nb$. Given that $ɛ·j=12GjTω¯j$, with the 3 × 4 matrix $Gj≡G(ɛj)$ defined as in Ref. [12], the relationship between the time derivatives of the generalized positions and the generalized velocities is the linear mapping $q·=L(q)v$.

Bilateral constraints representing kinematic joints (spherical, prismatic, revolute, etc.) are included as algebraic equations restricting the relative position of two rigid bodies. Each constraint $i∈B$, the set of bilateral constraints, is represented by the scalar algebraic equation $Ψi(q,t)=0$ and transmits reaction forces to the connected bodies by means of a multiplier $γ∧i,b$. Assuming smoothness of the constraint manifold, $Ψi(q,t)$ can be differentiated to obtain the Jacobian $∇qΨi=[∂Ψi/∂q]T$. In what follows we will also use the notation $∇Ψi≡LT(q)·∇qΨi$.

Contact and friction forces are included in the system through complementarity constraints. Given a large number of rigid bodies with different shapes, modern collision detection algorithms are able to efficiently find a set of contact points, that is, points where a gap function Φ(q) can be defined for each pair of near-enough geometry features. Where defined, such a gap function must satisfy the nonpenetration condition Φ(q) ≥ 0 for all contact points. More precisely, when a contact i is active (Φi(q) = 0), a normal force and a tangential friction force act on each of the two bodies at the contact point; if the contact is not active (Φi(q) > 0), no contact or friction forces exist. This results in a mathematical description of the model known as a complementarity problem [13]. In what follows, $A(q(t))$ denotes the set of all active contacts for a given configuration q of the system at time t.

The contact model can be augmented with the classical Coulomb model to define the friction forces acting at the interface between two bodies in contact. Consider the contact $i∈A$ between two bodies A and B as shown in Fig. 1. Let ni be the normal at the contact pointing toward the exterior of the body of lower index, which by convention is considered to be body A. Let ui and wi be two unit vectors in the contact plane such that ni, ui, $wi∈R3$ are mutually orthogonal. The frictional contact force is applied on the system by means of multipliers $γ∧i,n≥0,γ∧i,u$, and $γ∧i,w$, leading to the normal and tangential components $Fi,N=γ∧i,nni$ and $Fi,T=γ∧i,uui+γ∧i,wwi$, respectively. The Coulomb model is expressed as the maximum dissipation principle

Fig. 1
Fig. 1
Close modal
$(γ∧i,u,γ∧i,w)=argminγ∧i,u2+γ∧i,w2≤μiγ∧i,nvi,TT(γ∧i,uui+γ∧i,wwi)$
(1)

where μi is the friction coefficient associated with contact i.

The time evolution of the dynamical system is governed by the following differential problem with set-valued functions and complementarity constraints, which is equivalent to a differential variational inequality (DVI) [14]:
$q·=L(q)vMv·=f(t,q,v)+∑i∈Bγ∧i,b∇Ψi+∑i∈A(γ∧i,nDi,n+γ∧i,uDi,u+γ∧i,wDi,w)i∈B:Ψi(q,t)=0i∈A(q(t)):0≤γ∧i,n ⊥ Φi(q)≥0, and(γ∧i,u,γ∧i,w)=argmin(γ∧i,u)2+(γ∧i,w)2≤μiγ∧i,nvT(γ∧i,uDi,u+γ∧i,wDi,w)$
(2)

where $f(t,q,v)$ denotes the external (applied) generalized forces and the notation ⊥ is used to indicate orthogonality, i.e., $γ∧i,nTΦi(q)=0$.

The tangent space generators $Di=[Di,n,Di,u,Di,w]∈R6nb×3$ are defined as
$DiT=[0…-AiT AiTAAs¯˜i,A…AiT -AiTABs¯˜i,B…0]$
(3)

where $Ai=[ni,ui,wi]∈R3×3$ is the orientation matrix associated with contact i, $AA=A(ɛA)$ and $AB=A(ɛB)$ are the rotation matrices of bodies A and B, respectively, and the vectors $s¯i,A$ and $s¯i,B$ represent the contact point positions in body-relative coordinates as illustrated in Fig. 1. The mass matrix M is constant and diagonal due to the choice of body-fixed centroidal reference frames.

Under numerical discretization, see, for instance, Refs. [15–18], the numerical solution calls at each time step for solving a quadratic optimization problem with conic constraints:
$minq(γ)=12γTNγ+rTγsubject toγi,u2+γi,w2≤μiγi,n, ∀i∈A$
(4)
where N and r are a constant symmetric positive semidefinite matrix and a constant vector that depend on system state at the beginning of the time step; $γ=[γ1T,γ2T,…,γncT]T$ and $γi=[γi,n,γi,u,γi,w]T=h[γ∧i,n,γ∧i,u,γ∧i,w]T$ is the triplet expressing the magnitude of the contact impulses for contact i, with h denoting the integration step-size. This constrained optimization problem is solved by an iterative method or interior point method [18]. In either case, the dimension of the problem is large. As an example, the penetration study illustrated in Fig. 2
Fig. 2

Cross-section view of 3D normal contact forces in granular material (modeled using 0.6 × 106 rigid bodies), during impact by spherical object

Fig. 2

Cross-section view of 3D normal contact forces in granular material (modeled using 0.6 × 106 rigid bodies), during impact by spherical object

Close modal
used 0.6 × 106 bodies [18]. There are approximately 2 × 106 contacts at any time step, each introducing a set of three unknowns γi, resulting in an optimization problem of dimension around 6 × 106. Parallel computing is essential in solving this subproblem and the collision detection that determines the set of contacts in the first place. For a typical integration step-size h = 10−4 s, a number of 10,000 solutions of Eq. (4) and 10,000 collision detections are carried out for each second of simulation.

There are several alternative approaches to modeling frictional contact beyond the one presented above. The most widely used relies on a penalty/regularization approach: a small penetration is allowed between two bodies and the interface frictional-contact force is proportional to the amount of penetration and relative velocity of the bodies at the contact point. A recent review of literature on penalty methods is provided in Ref. [19]. While penalty methods do not call for the costly solution of the optimization problem in Eq. (4), they are hampered by much smaller integration time steps (h ≈ 10−6 s). Like the DVI approach described above, penalty methods also have to deal with very large sets of differential equations resulting from the use of seven generalized positions and six generalized velocities for each 3D body in the system (almost 8 × 106 equations for the problem in Fig. 2).

The FSI problem is very briefly discussed herein as another MBD application that leads to large sets of equations and whose solution requires complex numerical methods. The fluid flow can be represented in either an Eulerian or a Lagrangian framework. Provided that the interfacial forces are captured thoroughly, the Lagrangian framework is capable of tracking the domain deformation introduced by the motion of the solid phase at almost no extra cost. Smoothed Particle Hydrodynamics (SPH) [20–22], its modifications [23,24], and variations [25] have been widely used for the simulation of the fluid domain in a Lagrangian framework. The main evolution equations of the fluid flow using SPH are expressed as
$dρadt=ρa∑bmbρb(va-vb)·∇aWab$
(5)
$dvadt=-∑bmb[(pbρa2+paρb2)∇aWab -(μa+μb)rab·∇aWabρ¯ab2(rab2+ɛh¯ab2)vab]$
(6)

which are solved in conjunction with dra/dt = va to update the fluid properties at each SPH marker a. Above, ρ, μ, v, and p are local fluid density, viscosity, velocity, and pressure, respectively, m is the representative fluid mass assigned to the SPH marker, and rab is the distance between two fluid markers a and b. The kernel function $Wab≡W(rab,h)$ is used to smooth out the local fluid properties within a resolution length h. The fluid flow evolution equations are solved explicitly, where pressure is related to density via an appropriate state equation to maintain the compressibility below 1%. Accuracy and stability can be improved through an XSPH modification [23] and Shephard filtering [26].

The fluid-solid coupling should satisfy the no-slip and impenetrability conditions on the surface of the solid obstacles. By attaching boundary condition enforcing (BCE) markers on the surface of the solid objects (see Fig. 3), the local relative velocity between the two phases is enforced to be zero at the marker locations. The position and velocity of the BCE markers are updated according to the motion of the solid phase, resulting in the propagation of the solid motion to the fluid domain. In turn, the interaction forces evaluated at the BCE markers are used to calculate the total force and torque exerted by the fluid on the solid.

Fig. 3
Fig. 3
Close modal

FSI applications lead to large problems owing to the number of SPH markers that need to be considered in the solution. The results reported in Refs. [27,28] drew on models with close to one million degrees of freedom and thus provide ample opportunities for parallel computing. For the specifics of a parallel implementation of the FSI solution, the interested reader is referred to Ref. [11].

## Parallel Computing: The Long Term View, Alternatives, and Solution Development Support

For the purpose of this discussion, parallel computing refers to the process of enlisting multiple hardware assets (cores or processors) to simultaneously execute, in a fashion at least partially controlled by the application developer, either (a) multiple sequences of instructions of a program at the same time; or (b) the same sequence of instructions acting at the same time on multiple instances of data. Following Flynn's taxonomy of parallel computers [29], the former covers multiple instruction multiple data (MIMD) architectures, while the latter refers to single instruction multiple data (SIMD) architectures. Herein, we concentrate on (i) distributed computing using the message passing interface (MPI) [30], which falls into (a) above; (ii) general-purpose GPU computing (GPGPU), which is an example of (b) above; and (iii) symmetric multiprocessing (SMP) with OpenMP [31], which can fall under (a) or (b). Details about MPI, GPGPU, and OpenMP will be provided after discussing the grounds for a migration to parallel computing in CAE in general and MBD in particular. The section will conclude with a brief summary of productivity tools: compilers, debuggers, profilers, development environments, and libraries that support the three main parallel computing platform considered in this paper; i.e., GPGPU, OpenMP, and MPI.

### Is Parallel Computing a Fad?.

There were three factors that precipitated the rise of parallel computing in the mid 2000s. Also called the three walls to sequential computing, these factors marked a shift in the chip industry towards multicore designs.

The memory wall is a manifestation of the fact that, for decades, memory speed increased at a pace significantly slower than that of data processing. For instance, from 1986 to 2000, the CPU speed improved at an annual rate of 55%, while the main memory access speed improved only at 10% annually [32]. It became apparent that making a faster and faster processor will only lead to scenarios where this fast processor would be idle waiting on data to be moved back and forth. Indeed, a memory request that leads to a cache miss and thus triggers a main memory (RAM) transfer typically requires hundreds of cycles. If the cache miss is combined with a translation look-aside buffer (TLB) miss, this cost can be orders of magnitude larger. Comparing this to the one cycle required nowadays to carry out an add-multiply operation, led to the assertion that computation is not about number crunching, which basically takes no time, but rather data transfer.

There are two aspects to memory speed: latency and bandwidth. Latency is essentially the amount of time it takes for a memory transaction to begin; once the transaction has begun, bandwidth is a measure of the rate at which data can be transferred from the provider of data to the consumer of data. Reducing latency is difficult; increasing bandwidth is relatively easy by adding more lanes in the memory bus, increasing the memory clock frequency, or using techniques that allow moving bits over several phases of the signal and/or wavelengths. In other words, there is little difference between moving 4 bytes or 512 bytes since the cost is not dictated by the bandwidth constraint as much as it is by the latency associated with the data movement request. As such, it became apparent that if data must be moved then it is advantageous to move it in big chunks. This favored parallel execution, especially in its SIMD flavor.

The power wall has manifested as a constraint placed on the clock frequency at which a chip could operate. It was postulated that the transistors used in computers”continue to function as voltage-controlled switches—while all key figures of merit such as layout density, operating speed, and energy efficiency improve, provided geometric dimensions, voltages, and doping concentrations are consistently scaled to maintain the same electric field” [33]. It thus makes sense to decrease both the typical feature size and voltage in a chip, since when these changes are combined with appropriate adjustments in doping concentrations the end product is superior: more transistors per unit area, which is essentially what fueled Moore's law for the last four decades. However, there are physical limits that challenge this quest for smaller operation voltage and feature length. Presently, the voltage can only be marginally decreased—aiming for voltages below 0.3 V introduces risk because controlling the transistors becomes very hard in an electrically noisy environment. Decreasing the feature length is challenging too since the insulator that separates the gate from the rest of the MOS transistor becomes narrower as the feature length decreases. This translates into power leaks, which are important since power dissipation on a chip changes as follows: it is increased by the leaks, it increases quadratically with the voltage and linearly with the clock frequency at which the transistors are operated. As seen, lowering the voltage is not an option anymore and as such, as the feature length goes down, the frequency must be reduced or at best can stay constant. Otherwise, the amount of heat dissipated goes up, which can trigger a thermal runaway; i.e., a situation where an increase in temperature raises the resistance of the transistors leading to even more power being dissipated. The device either fails sooner or functions unreliably.

Finally, the instruction level parallelism (ILP) wall reflects the inability of parallel execution at the chip level to yield further substantial efficiency gains in sequential programming. In other words, techniques that have benefited sequential computing in the past—instruction pipelining, superscalar execution, out-of-order execution, register renaming, branch prediction, and speculative execution—have fulfilled to a very large extent their potential. Interestingly, despite the use of the word parallelism, the above ILP techniques were all meant to speed up sequential computing. For instance, pipelining allows the simultaneous execution of several instructions at the same time; however, all these instructions belonged to one sequential stream of instruction, which now was executed faster since at each clock tick one instruction would be finished. The obvious mechanism of obtaining higher speedups by making the pipeline deeper is counter-balanced by the consequent rise in instruction dependency. Register forwarding, register renaming, and several other techniques eliminate some of these conflicts but pipeline stalls are inherent in deep pipelines. Superscalar solutions employ two parallel pipelines that nonetheless work on the same stream of instructions and support speculative execution, out-of-order execution, branch prediction, etc. All these tricks eventually reached a level where the complexity of managing them did not justify the return on investment in terms of power and/or die area costs.

In the early to mid 2000s, it became apparent that the momentum behind the sequential computing in the server and consumer market was waning. Memory access latencies could not be hidden behind one stream of instructions and the gap between data processing and data movement was getting wider; the frequency could not be increased any further without resorting to solutions that were more expensive than the market was accepting; and the ILP had for most part exhausted its set of tricks. Moore's law remained one bright spot in the picture and provided a safe bet for continued performance improvement. Two of the leading chip manufacturers, Intel and IBM, had roadmaps in place that suggested that the feature length was bound to keep decreasing for close to two decades. For instance, Intel released Ivy Bridge in 2012, which is based on 22 nm technology. It will release a chip based on 14 nm technology next year and 10 nm technology in 2016, with 7 nm and 5 nm estimated for 2020 and 2022, respectively. This ongoing miniaturization trend sends a clear message that more and more cores will be available on a chip, making the parallel computing more and more prevalent.

To put things in perspective and understand the pace at which the technology advanced, less than one decade ago, a 1 TB distributed memory IBM Blue Gene/L with 1024 PowerPC cores used to deliver 5.7 TFlops. It came at a price of approximately $1.4 × 106, required a dedicated power source, and needed hardware and software support staff. Today, a powerful desktop with 1 TB of RAM, a motherboard with two Intel Xeon chips for 32 cores (64 virtual cores), and four Kepler GPU cards delivers more than 6 Tflops at a price of$70,000. It does not have steep power requirements, it does not require a system administrator to maintain it, can run Linux or Windows, and offers a comprehensive set of productivity tools. This trend toward more performance for lower costs is expected to continue in the same directions, namely more cores packaged on a chip and an increase in memory bandwidth allowing to hide memory latency with useful execution. In other words, the new parallel computing paradigm is anchored into technology realities.

### Relevant Flavors of Parallel Computing.

While the hardware transition to multi- and many-core chip layouts happened quickly, the software shift to using parallel computing languages and libraries to produce programs that leverage this hardware has been and continues to be arduous.

For fine gain parallelism that suits the SIMD model, GPU computing is the best match in terms of hardware and software support. The most recent GPU card from Nvidia belongs to the Kepler architecture. It has 250–320 GB/s bandwidth2 to feed about 2500 scalar processors, which can deliver up to 1.31 TFlops. This is approximately 25% of an IBM Blue Gene/L of one decade ago. The GPU's weak spot is associated with its memory size. At 6–8 GB is in many cases sufficient but very large flexible body dynamics problems or fluid-solid interaction problems might run into memory space issues. These are usually resolved by shuttling data back and forth between the GPU and the host CPU. Yet this happens over a PCI Express bus that usually sustains 8 GB/s bandwidths and thus becomes a simulation bottleneck. In terms of software support, these GPUs are programmed using either OpenCL, which is an open standard [34], or CUDA [35]. The latter currently has wider adoption and counts on a more mature set of productivity tools.

In our experience, GPU computing turned out to be extremely suitable for simulations of granular material, as further discussed in Sec. 5.3. Another application suitable for GPU computing is large scale flexible body dynamics where one has to handle tens to hundreds of thousands of elements. Such a problem was approached in the framework of absolute nodal coordinate formulation (ANCF) in Ref. [36]. Likewise, any analysis that involves a large set of data that needs to be processed is very likely amenable to GPU computing. For instance, FEA would fall into this category, since each node and/or element in the mesh should be visited in order to evaluate a stiffness matrix and, for implicit integration-based nonlinear FEA, to compute the ingredients used in assembling the numerical discretization Jacobian. Likewise, N-body problems, where one has to loop through all bodies to assess the influence of the neighboring bodies, provide a good setup for effective use of GPU computing.

GPU computing falters when dealing with problems with relatively small data sets or when the tasks performed are different for each member of the data set. Unfortunately, this turns out to be the case of classical MBD. Most systems include less than 500 rigid bodies. Even when one brings into the picture a floating frame of reference formulation for handling flexibility, typically the size of the problem does not justify the use of GPU computing. Providing a size after which GPU makes sense depends on the type of GPU being considered. Today's high end GPUs have 2500 scalar processors or more. To get a good occupancy of the hardware, one would have to use in CUDA close to 30,000 threads to keep these 2500 busy. In other words, one would need to deal with an MBD problem where there are at least 30,000 quantities that need to be computed. Moreover, since the GPU is a SIMD-type device, it is desirable to have these threads to work on very similar problems e.g., compute the stiffness matrix of 30,000 elements, or evaluate the normal force at 30,000 contact points, etc.

The OpenMP standard is supported very well by Intel and AMD chips. OpenMP can be presently used on a power workstation that might have 0.5 TB of RAM and 64 cores. Additional compute power can be accessed if the system has Intel Xeon Phi accelerators, each of which brings to the table more than 250 parallel threads. Unlike GPU computing, OpenMP (a) supports very well coarse grain parallelism and (b) allows one to work in a more convenient fashion with very large data sets. Existing workstations routinely have hundreds of GB of main memory, which is 2 orders of magnitude more memory than one can access on the GPU. The latter can directly access the host main memory, but this comes at a stiff price as the transaction is carried out over the PCI express bus. The GPU wins when it comes to sheer memory bandwidth and flop rate. Number crunching is basically free as long as data are available; the problems crop up when cache misses require costly trips to main memory. GPU can do a better job here since (i) it operates at higher bandwidth (roughly 300 GB/s as opposed to roughly 50 GB/s for multicore CPUs); and (ii) it is better at hiding memory latency with useful work. The latter is accomplished by overcommitting the 2500 scalar processors on the GPU to handle tens of thousands of threads to increases the chance that the scheduler will find a warp of threads that are ready for execution and thus avoid stalls of the scalar processors. This is very similar to the idea of hyper-threading except that while on the Intel architecture one can have two candidates for occupying the hardware, in GPU computing one warp scheduler has, on average, up to 16 choices from which it can select the next warp of threads that will be executed. This is because each stream multiprocessor on a GPU card can be assigned for execution up to 64 warps (groups of 32 threads) and it uses four warp schedulers.

In terms of source code modification, OpenMP provides a very low entry point into parallel computing. For instance, a for-loop that might evaluate the generalized forces acting on each body in the system can be parallelized by simply adding one line of code. Unfortunately, OpenMP does not scale well; i.e., it works on a powerful workstation but cannot be implemented for distributed memory architectures. Specifically, one cannot spread parallelism on two workstations solely using OpenMP.

Going beyond one workstation is typically achieved by using implementations of the Message Passing Interface (MPI) standard [30], such as MPICH2 [37] or Open MPI [38]. MPI will become indispensable when large amounts of memory are needed, such as for large granular dynamics or fluid-solid interaction problems. The MPI nodes are linked together through a fast interconnect for passing data back and forth. Typical bandwidths are of the order of 40 Gb/s; i.e., close to what one would get over a PCI connection. However, at 1 μs [39], the latency is significantly larger, which is likely to impact MBD simulation since dynamics simulation is communication bound. Specifically, at each time step, the nodes need to exchange data to advance the dynamics simulation. If, for instance, the step size in a penalty approach to handling frictional contact is 10−6 s, this means that each second of simulation requires 1,000,000 data exchanges. This places a significant bottleneck on the simulation essentially overshadowing any other aspect of the computation to the point where the execution time will be mostly spent in sending messages back and forth between the collaborating nodes and waiting for such messages to arrive.

### Productivity Tools and Library Support.

Today, a developer can write parallel software without writing any line of code that will execute directly on the GPU. There are several relatively mature libraries that provide a low entry point for GPU computing; the user only needs to make a call into such a library to obtain, for instance, the solution of a sparse linear system using an iterative method that will be executed on the GPU. CUSP [40] is an open source library of generic parallel algorithms for sparse matrix and graph computations. It provides several solvers for linear systems (conjugate-gradient, biconjugate gradient, biconjugate gradient stabilized, generalized minimum residual, multimass conjugate-gradient, and multi-mass biconjugate gradient stabilized), preconditioners (algebraic multigrid based on smoothed aggregation, approximate inverse, diagonal), and several utilities related to manipulating dense and sparse matrices. Basic operations such as prefix scans, gather, scatter, sort, etc. are supported by the open source template library Thrust [41]. Operations that combine several arrays to produce an output array can be implemented by writing only a few lines of C++ code that are subsequently executed in an optimized fashion on the GPU.

As an alternative to using GPGPU libraries, a developer can implement functions (kernels) that are directly executed on the GPU. This is not a low entry point to GPGPU computing since writing efficient kernels requires a good understanding of the GPU's deep memory hierarchy (registers, local memory, shared memory, constant memory, texture memory, global memory, cache levels) and how to set up an optimal kernel execution configuration, namely defining (i) the number of parallel threads in a block and (ii) the number of such blocks of threads launched in a grid. For instance, if a number of 0.5 × 106 force computations are necessary, the user should specify how a number of at least 0.5 × 106 threads will be launched to compute the forces in parallel. One can choose a large number of threads per block and a small number of blocks; alternatively, one can decide on a small number of threads but large number of blocks. These decisions are based on several factors, including the number of registers (per thread) used by the kernel and the amount of required shared memory (per block). Moreover, these execution configurations may have to change from GPU to GPU, based on hardware capabilities. The implementation effort for developing GPU kernels can be significantly reduced by available development tools. Nvidia's Nsight development platform offers an integrated environment for building, debugging, profiling and tracing parallel applications using CUDA C/C++. Available both under Windows and Linux and integrated with Visual Studio and Eclipse, respectively, the Nsight debugger supports all common debugging tasks (setting breakpoints in CUDA source, viewing local variables and inspecting memory, navigate the current call stack, etc.) as well as switching the debugging context to any thread or warp of threads. The latter is critical for effective debugging of GPU parallel software which routinely involves launching tens of thousands of parallel threads. The analysis and profiling tools provided with Nsight help the developer understand overall performance and identify hot spots and bottlenecks. The user can see the combined CPU/GPU activity along a timeline or query performance information at a very fine level of granularity (e.g., level 1 and 2 cache hits, effective memory bandwidths and device utilization, register usage).

OpenMP enjoys a similar level of library and productivity tools support. The philosophy behind OpenMP, shared also by the language cilk and the Intel threading building blocks TBB library is that the developer need only expose parallelism after which the runtime environment, through an internal scheduler, will manage the distribution of the work to parallel execution threads. Exposing parallelism can be done with compile-time directives (OpenMP), simple language extensions (cilk), or library calls (TBB). In terms of library support, the Intel Math Kernel Library (MKL) is a mature solution that leverages multicore chips for fast math operations, including the solution of sparse, dense, and banded linear systems. Preconditioners are available, as are iterative solvers. Finally, debugging and profiling OpenMP in particular and any multi-core code in general is facilitated by a series of productivity tools released by Intel: Intel VTune Amplifier, Intel Inspector, and Intel Advisor. The tools can be used for debugging, code profiling, and code analysis.

Distributed (message passing) computing remains the privilege of a relatively small number of organizations. This is reflected in the relatively low number of options available for MPI distributed computing, of which two solutions stand out: OpenMPI [42] and MPICH2 [43]. The latter implements MPI-3, the most recent version of the standard, and is the origin of several other vendor implementations tuned to specific hardware; Intel, IBM, and Cray each have their own MPI derivative based on MPICH2.

Several open source libraries are available for large scale distributed simulation using MPI. LAMMPS [44,45] provides functionality originally supporting molecular dynamics simulations. Over time, LAMMPS or derivatives have been used in peridynamics [46] or more recently in granular dynamics applications [47]. Unlike LAMMPS, which focuses on many-body dynamics, PETSc [48] is an infrastructure that leverages MPI parallel computing for scientific applications modeled with partial differential equations. It provides extensive support for linear algebra, such as algebraic multigrid, direct solvers, iterative solvers, and preconditioners. It offers explicit and implicit integrators, including an interface to the multistep implicit integrator CVODE from the Sundials suite [49].

## Parallel Computing in MBD

### Equation Formulation Aspects.

Broadly speaking, multibody formulations based on the Newton-Euler equations can be categorized into Cartesian (or descriptor form) or recursive formulations. The Cartesian formulation [12,50–52] gained broader acceptance early on as it was used in commercial packages such as ADAMS and DADS. In such an approach, the equations of motion can be formed in a straightforward manner, as illustrated in Sec. 2, simply by writing the free body differential equations for all bodies in the system and append the algebraic (nonlinear) equations associated with each joint and constraint in the system.

A variety of recursive formulations have been proposed beginning with the early 1980s, first in the robotics community for chain manipulators and open-loop systems [53,54] and later on extended to more general multibody systems, including systems with closed-loop kinematic chains. Recursive formulations are inherently O(n) for the inverse dynamics problem, where n represents the number of degrees of freedom. For the more complicated forward dynamics problem, various formulations have been proposed which, based on how they deal with the factorization of the resulting generalized mass matrix, can be classified as O(n3), O(n2), or O(n) and arise successively as a “natural consequence of a progressive exploitation of the structure of the mass matrix” as noted by Jain in Ref. [55]. Jain also showed that, except for minor differences, all linear complexity recursive algorithms, i.e., those that result in explicit derivations of the inverse of the mass matrix [54,56–61], are essentially the same algorithm. The equations of motion, expressed in terms of a minimal set of generalized states, are formed implicitly through successive outward (base-to-tips) and inward (tips-to-base) sweeps of the underlying multibody tree to recursively propagate kinematics and articulated inertias, respectively.

More recent divide and conquer formulations [3,62,63] achieve O(n) complexity for serial implementation but, due to their underlying binary tree structure, have logarithmic complexity when implemented in parallel.

Our goal here is neither to provide an exhaustive review of previous work in this area nor to advocate one choice over another. Instead, we will briefly comment on how they relate to the use of parallel computing in MBD. The differences in terms of type, structure, and size of the resulting equations in the two formulations can be summarized as follows:

Type. For any nontrivial multibody system, the Cartesian formulation always results in a set of differential algebraic equations (DAEs).

On the other hand, recursive formulations result in systems of ordinary differential equations (ODEs) for open-loop systems and minimal-size DAEs for closed-loop system (using, for example, the cut joint approach [64]).

Structure. The Cartesian formulation does not take into account the structure (topology) of the MBD, at least not at the equation formulation phase, but results in a set of equations with a very well defined and, most importantly, simple structure. Addition/removal of bodies or constraints is a trivial task.

On the other hand, recursive formulations effectively impose certain computational paths (namely outward/inward recursive traversals of the topological tree) which result in much smaller but more complex, especially for O(n) formulations, system of equations. Addition or removal of bodies or constraints is a more complex task as it requires a re-evaluation of the system's topology.

Size. There is a significant disparity in the size of the resulting DAE or ODE systems for the two main classes of formulations considered here. For a system consisting of nb bodies and having n degrees of freedom, a Cartesian formulation results in a DAE with 6nb and 6nbn differential and algebraic equations, respectively. If the system does not contain any kinematic closed loops, a recursive formulation leads to a system of n ODEs. However, in the presence of closed loops, the m constraints introduced in the process of extracting the topological spanning tree lead to larger-size (namely 2n + m) DAE systems. Note that if the multibody system contains many loops, such that m = O(n), in addition to the increase in problem size, one must also re-evaluate the algorithmic complexity of recursive formulations since O(n) algorithms actually perform as O(n + nm2).

From the above observations, it becomes apparent that there is significant potential for fine grain parallelism within the Cartesian framework. The fashion in which the equations of motion are generated and assembled, particularly the fact that constraints induced by a joint only involve states associated with the bodies connected by that joint, exposes a large number of independent/decoupled calculations. Hence, large problems posed in the Cartesian framework fit well the SIMD model supported by GPGPU computing. On the other hand, the problem size and the computational flow imposed by recursive formulations favor task parallelism that mirrors the structure of the spanning tree. OpenMP represents a good parallel computing platform with the caveat that attention needs to be paid to load balancing, which is influenced by the spanning tree associated with the system: vastly different branch sizes hurt load balancing as they will require different levels of traversing effort.

### Numerical Solution Aspects.

The numerical solution of dynamics problems involving frictional contact in a DVI formulation (such as that in Eq. (2)) require new numerical approaches as one moves towards very large-scale many-body problems involving hundreds of thousands to millions of bodies and contacts. Classical approaches based on posing and solving linear complementarity problems (LCP) obtained as a result of polyhedral approximations of the friction cone not only suffer from accuracy issues stemming from the induced anisotropies but also lead to an increase in problem size that quickly becomes prohibitive. These limitations can be avoided by transforming the original nonlinear complementarity problem into a cone complementarity problem (CCP) which can be shown [15,16] to represent the first-order optimality conditions of the constrained optimization problem of Eq. (4). Usual approaches to solving this problem rely on fixed-point iterations (Jacobi or Gauss-Seidel) with projection on a convex set; however, since these methods require a large number of iterations when handling large systems of engineering relevance, the CCP solution, which can account for up to two thirds of the total solution time, leads to prohibitively long simulations. Moreover, some of these methods (e.g., Gauss-Seidel) are not amenable to parallel computing. Alternative solvers, such as gradient projected minimum residual (for frictionless problems), preconditioned spectral projected gradient with fallback, and primal-dual interior point methods, have been shown to have improved convergence properties leading to significant speedups [17,18].

The efficiency of MBD simulations is directly influenced by the size of the integration step-size. Large step sizes are desirable especially in many-body problems with frictional contact, regardless of the modeling approach (DVI or DEM), since each time step requires a costly collision detection stage that can involve millions of contact events. As discussed in Sec. 2, the existing integration schemes require typical step sizes of h = 10−4 for a DVI approach and as low as h = 10−6 for DEM. In our implementation, each step of the DVI uses half-implicit numerical integration that calls for the solution of a large optimization problem. It quickly becomes apparent that the promise of large step-sizes associated with fully implicit methods is very attractive, since in this case the step size is unrestricted by stability considerations. However, fully implicit integration schemes lead to nonlinear systems typically solved using Newton-like methods and hence they require (i) the ability to evaluate Jacobian information and (ii) efficient linear solvers. Note that linear solvers are important not only for implicit integration but also for interior point methods used to solve the CCP associated with the problem in Eq. (4). Even though very sparse, the sheer size of the resulting matrices in problems such as granular dynamics or Lagrangian formulations for fluid-solid interaction (e.g., SPH) make direct methods for linear systems less attractive. The alternative is to use matrix-free iterative techniques, such as the Krylov subspace iterative methods, which do not require explicitly forming the system's matrix but only matrix-vector products [36]. Unfortunately, for fast convergence these iterative methods still require the use of adequate preconditioners. While domain-specific preconditioners for MBD problems are still lacking, general-purpose preconditioners are available for all parallel computing platforms discussed herein: PETSc [48] and hypre [65] libraries for MPI computing, Intel's MKL for OpenMP, and CUSP [40] for GPGPU.

### Software Development Aspects.

The parallel computing of mid 1990s relied extensively on posix threads [66]. A very low level approach to parallel computing, posix threads are currently used only when performance arguments trump the development and maintenance effort associated with this parallel programming approach. Today, alternative solutions provide a parallel programming entry point in which the technical effort is focused more on modeling and numerical solution aspects and less on the implementation constructs required by these programming platforms/models.

While it is clear that when measured by the amount of data that needs to be processed during simulation very large models will always fall back on a distributed computing model supported by the MPI standard, simulations that fit on one workstation can be run either with OpenMP or relying on GPGPU computing. Debating which of the two platforms is superior falls outside the scope of this paper. However, for task parallelism, where several independent and sizable tasks need to be performed, OpenMP has intrinsic support that will give it an edge. For instance, for a typical multibody dynamics problem with 300 bodies, 100 joints, and 200 forces, OpenMP parallelizes the evaluation of the Jacobian associated with implicit integration very naturally: one task will be associated with the computation of the contribution of the 300 bodies, another task will be associated with the evaluation of the contribution of the 100 joints, and yet a third task will handle the contribution of the 200 forces. Three OpenMP tasks can be launched, each one of them with an inner loop that can be parallelized, and native dynamic load balancing will distribute the load and keep all cores busy. Here GPU computing is less attractive since the problem and associated coarse grain parallelism does not fit the SIMD model.

For many-body dynamics though, such as granular dynamics simulation, mobility on discrete terrain, fluid-solid interaction using SPH, etc., where there are tens of thousands of data instances to be processed, GPGPU is very attractive. Collision detection, evaluating the force at each contact point, or advancing the state of the system through explicit integration are computations that map well on GPUs owing to the SIMD nature of the job at hand. It is this type of fine grain parallelism; i.e., very large number of identical operations that need to be performed on a large number of items, that is the forte of GPGPU computing.

There is a gray area with no clear answer to whether OpenMP or GPU computing has the upper hand. In this case we implemented a coding style that allowed us to experiment with either platform by setting an environment variable at compile time. To support this, the code will have to be modified in a very limited and localized fashion. Specifically, both for data structures and function definitions a different syntax is needed depending on whether GPU or CPU computing is used. As an example, consider the code in the following listings. The purpose of this code is to update the velocity of each body given the force acting on it. If GPU computing is used, this should be performed with a kernel with one thread per body. If CPU computing is used, the velocity update would be achieved by looping over all bodies in the system using OpenMP through the inclusion of a simple #pragma omp parallel for.

First, a function is written to perform the desired operation. In this case, the function should increment the velocity of a given body based on the mass, inertia, force, torque, and time step. Note that this function is defined with _ _host_ _ _ _device_ _, which means it is compiled to be executed on both the CPU (host) and GPU (device).

The above function, in which the value of the step size h has already been factored in the inverse of the mass matrix, should be called in different ways if CPU or GPU is used. If the GPU is used, a GPU kernel (shown in the listing below) should call the function. Note the inclusion of _ _global_ _, which signifies a kernel function, and the macro INIT_CHECK_THREAD_BOUNDED, which ensures that the thread IDs executing the kernel are appropriate.

For parallel execution on the CPU, the code should loop over all bodies in the system using an OpenMP for-loop, invoking the single-body update function as in the following listing.

Finally, a compile-time flag, ENABLE_GPU, is used to control which parallel computing model is used. If the flag is defined, the GPU kernel is called; otherwise, the host function is called.

This strategy is used for all code sections in which parallel computing can be leveraged. A single function is written to perform the necessary operations, capable of executing on the CPU and GPU (Listing 1). One wrapper function is written for the GPU, where a kernel calls the function within each thread (Listing 2), while a second wrapper is written for the CPU, where the function is called from within an OpenMP for-loop (Listing 3). Finally, the compile-time flag is used to selectively compile the appropriate wrapper function in the code (Listing 4).

Listing 1

Function to update the velocity of a body

 _ _host_ _ _ _device_ _ voidaddForces(uint& i, real* inv_mass, real3* forces, real3* vel) { vel[i] += forces[i] * inv_mass[i]; }
 _ _host_ _ _ _device_ _ voidaddForces(uint& i, real* inv_mass, real3* forces, real3* vel) { vel[i] += forces[i] * inv_mass[i]; }
Listing 2

GPU kernel to update the velocity of all bodies in the system

Listing 3

Host loop to update the velocity of all bodies in the system

 voidhost_addForces(real* mass, real3* forces, real3* vel) { #pragma omp parallel for for (uint index = 0; index < num_obj; index++) { addForces(index,mass,forces,vel); } }
 voidhost_addForces(real* mass, real3* forces, real3* vel) { #pragma omp parallel for for (uint index = 0; index < num_obj; index++) { addForces(index,mass,forces,vel); } }
Listing 4

Velocity update, selecting between CPU and GPU execution

 # ifdef ENABLE_GPU COPY_TO_CONST_MEM(num_obj); device_addForces <<>>( device_mass_data, device_frc_data, device_vel_data); # else host_addForces( device_mass_data.data ( ), device_frc_data.data ( ), device_vel_data.data ( )); # endif
 # ifdef ENABLE_GPU COPY_TO_CONST_MEM(num_obj); device_addForces <<>>( device_mass_data, device_frc_data, device_vel_data); # else host_addForces( device_mass_data.data ( ), device_frc_data.data ( ), device_vel_data.data ( )); # endif

Note that primitive functions are handled by Thrust directly. For example, sorting a vector can be accomplished by calling thrust::sort(myVec.begin(), myVec.end()); If myVec is a host vector, without any user intervention, Thrust will use OpenMP to sort it. If myVec is a device vector, Thrust will use the GPU.

## Examples: GPGPU, OpenMP, and MPI Parallel Computing in MBD

Three applications are discussed next to showcase the use of the parallel computing techniques discussed above. GPU computing is used in conjunction with the dynamics of a tracked vehicle operating on granular terrain. An OpenMP-enabled granular dynamics simulation is used to identify pattern formation in granular material that is shaken at a critical frequency. Finally, MPI-enabled distributed computing is used to investigate the dynamics of a wheeled vehicle operating on discrete deformable terrain. These models were run in Chrono, an open source parallel simulation framework for computational dynamics [11,67].

### GPU Computing Example: Light Tracked Vehicle Mobility Simulation.

A detailed account of how GPU computing is used to solve multibody dynamics problems can be found in Ref. [68]. The key kernels of the negrut2012GPUgems implementation are listed in the following pseudocode:

 Parallel Kernels for Solution of Dynamics Problem (1) Parallel Collision Detection (2) (Body parallel) Force kernel (3) (Contact parallel) Contact preprocessing kernel (4) Inner Iteration Loop: (1) (Contact parallel) CCP contact kernel (2) (Bilateral-Constraint parallel) CCP constraint kernel (3) (Body parallel) Body velocity update kernel (5) (Body parallel) Time integration kernel
 Parallel Kernels for Solution of Dynamics Problem (1) Parallel Collision Detection (2) (Body parallel) Force kernel (3) (Contact parallel) Contact preprocessing kernel (4) Inner Iteration Loop: (1) (Contact parallel) CCP contact kernel (2) (Bilateral-Constraint parallel) CCP constraint kernel (3) (Body parallel) Body velocity update kernel (5) (Body parallel) Time integration kernel

Using a combination of joints and linear actuators, a tracked vehicle model was simulated navigating over deformable terrain made up of gravel-type granular material. The vehicle is modeled to represent a small, lightweight tracked vehicle much like an autonomous robot.

There are two tracks, each with 61 track shoes (see Fig. 4). Each track shoe is made up of two cylinders and three rectangular plates and has a mass of 0.34 kg. Each shoe is connected to its neighbors using one revolute joint on each side. Within each track there are five rollers, each with a mass of 15 kg, one idler and one sprocket both with a mass of 15 kg. The chassis has a mass of 200 kg and moments of inertia were computed for all parts using a CAD package. The purpose of the rollers is to keep the tracks separated and support the weight of the vehicle as it moves forward. The idler keeps the track tensioned. It is usually modeled with a linear spring/actuator but for the purposes of demonstration it was attached, using a revolute joint, to the vehicle chassis. The sprocket is used to drive the vehicle and is attached, using another revolute joint, to the chassis. Torque is applied to drive the track, with each track driven independently of the other.

Fig. 4
Fig. 4
Close modal

The track for the vehicle was created by first generating a ring of connected track shoes. This ring was dropped onto a sprocket, five rollers, and an idler which was connected to the chassis using a linear spring. The idler was pushed with 2000 N of force until the track was tensioned and the idler had stopped moving. This pretensioned track was then saved to a data file and loaded for the simulation of the complete vehicle.

Figure 5 shows the joint forces for one revolute joint when the tracked vehicle was simulated as it moved over a bed of 84,000 granular particles. The particles were modeled as large pieces of gravel with a radius of 0.075 m and a density of 1900 kg/m3. A torque of 100 Nm was applied to both sets of tracks to move the vehicle. Unlike the case where the vehicle moves on a flat section of ground, the forces experienced by the revolute joints are much noisier. Individual grains slip and stick under the tracks as the vehicle moves causing large vibrations to travel through the shoes. These vibrations would be reduced when modeling a more compliant terrain material that can dissipate energy on contact.

Fig. 5
Fig. 5
Close modal

### OpenMP Example: Pattern Formation in Granular Dynamics.

This simulation was used to study the pattern formation in granular material shaken vertically. These patterns emerge in layers of material that is driven only by floor vibrations and have been studied extensively both in 2D and in 3D [69–72]. This phenomenon is important in a variety of fields, from studying the formation of crystalline structures to processing and handling different types of granular material [73]. Without friction these patterns cannot emerge [74]. The simulation is challenging due to the number of particles involved and the velocity and amplitude of the vibrations of the moving plate.

The simulation was performed on a Intel Xeon E5-2630 with 6 physical cores and 12 virtual cores. The simulation included 60,000 rigid particles, was advanced with a time step of $5×10-4 s$, and was used to investigate the system dynamics for 30 s. An OpenMP parallel version of the accelerated projected gradient descent solver was used with 50 iterations performed at each time step [18]. The parameters for friction, packing ratio, and vibration amplitude and frequency used were as provided in Ref. [72]. A friction value of μ = 0.5 was chosen. The particle size, gravity, and the total number of particles were used to compute the nondimensional frequency f and amplitude Γ of the driving plate. Figure 6 shows a simulation with parameters Γ = 0.4 and f= 0.38. The results obtained are in line with the experimental data, which confirms that this choice of parameters produces a hexagonal pattern in the granular material [72].

Fig. 6
Fig. 6
Close modal

### MPI Example: Vehicle Mobility.

When relying on MPI parallel computing, Chrono divides a simulation domain into a number of subdomains that can each be simulated in parallel on discrete compute nodes. The solver transparently sets up the necessary communication channels between participating compute nodes and carries out the simulation including collision detection in a distributed and parallel fashion [75].

The general algorithmic flow of the domain decomposition approach for DEM simulations is given in the following pseudocode, which specifies the operations performed by a given node simulating a single subdomain. All nodes operate independently, in parallel, subject to communication and synchronization barriers.

 Pseudocode DEM MPI() (1) Identify boundaries of subdomain $Si$ (2) Initialize communication pathways to each subdomain $Sj$ (3) Initialize data structures (4) fort = 0 totend (5) Perform collision detection (6) Compute collision forces (7) Barrier: Intermediate communication of net forces on shared objects (8) Compute net forces (9) Perform integration to update state (10) Barrier: Final synchronization (11) Update set of bodies in subdomain (12) $t←t+h$ (13) endfor
 Pseudocode DEM MPI() (1) Identify boundaries of subdomain $Si$ (2) Initialize communication pathways to each subdomain $Sj$ (3) Initialize data structures (4) fort = 0 totend (5) Perform collision detection (6) Compute collision forces (7) Barrier: Intermediate communication of net forces on shared objects (8) Compute net forces (9) Perform integration to update state (10) Barrier: Final synchronization (11) Update set of bodies in subdomain (12) $t←t+h$ (13) endfor

The example considered represents a replica of the Mars rover Opportunity. Details on the rover model, including inertia and geometry properties, can be found in Ref. [18]. The rover chassis was 100 kg; the mass of each wheel was 3 kg. The wheels were represented as cylinders with radius 10 cm and width 12.5 cm. A set of 30 grousers distributed evenly around the circumference of each wheel extend across the wheel's width. Each grouser has a thickness of 0.5 cm and extends away from the wheel by 0.5 cm. The terrain was composed of 2,016,000 rigid spherical bodies with radius 5 mm and mass 4.5 g. The elements in the terrain were initialized in random positions above the simulation domain and allowed to settle under gravity. The simulation domain was rectangular, with length of 3 m in the driving direction and width of 1.5 m. All six wheels of the rover were driven with constant angular velocity of π rad/s, or 30 rpm. The integration time step was 10−5 s and 6 s of dynamics were simulated, requiring about 298 h of computation time. The DEM stiffness and damping parameters had the values $kn=kt=2×105$, γn = 750, and γt = 0. The simulation domain was partitioned into 64 subdomains by dividing both the length and the width evenly into 8 subdomains in each direction. The resulting subdomains were rectangular, with length 0.375 m and width 0.1875 m. The depth direction was not divided.

A snapshot from the simulation can be seen in Fig. 7. The terrain particles are colored by subdomain and shared bodies are colored white. Further, the wheels of the rover are checkered with the color of the master subdomain. In Fig. 7, the wheels are pink, which indicates that the pink sub-domain has the master copy of the entire rover vehicle. Figure 8 shows the “footprint” of the rover. The terrain particles are colored uniformly, hiding the subdomain division, and the rover and its wheels are not shown. The amount of simulation time is still prohibitively large for this problem, which aims at understanding how different vehicle topologies influence the dynamics of the rover. By orders of magnitude, the simulation bottleneck is associated with the data communication required by an exchange protocol that handled the cross-boundary motion and interaction of bodies.

Fig. 7
Fig. 7
Close modal
Fig. 8
Fig. 8
Close modal

## Closing Remarks

Once the privilege of national research labs or supercomputing centers, parallel computing is now ubiquitous. In fact, it is virtually impossible today to purchase a workstation or laptop that does not have at least four cores that can support parallel computing. These two observations frame the answer to the why in the title of this contribution: sequential computing promises only modest gains in efficiency, insufficient for fueling the evolution of the MBD field in the next decade. The continuous drive in MBD towards multiphysics problems yielding ever larger systems of equations provides the backdrop for the when in the title of the paper. Fluid-structure interaction, granular dynamics, and vehicle mobility on deformable discrete terrain represent example applications that will stand to benefit from the interplay between MBD and parallel computing. Finally, using as a justification several applications in which we have relied on different parallel computing platforms, we discussed aspects related to the modeling, numerical solution, and software implementation aspects of MBD simulation; i.e., we addressed the how aspect alluded to in the title of the contribution. Several comments made at various points in this paper are summarized as follows:

Parallel computing platform. Large models involving an amount of data that exceeds the memory available on a workstation require a distributed computing model facilitated by MPI parallelism. As discussed in Sec. 5.3, MPI incurs high communications cost that presently can limit duration or size of the problem being analyzed. This is particularly the case if the simulation proceeds at small step-sizes and data communication dwarfs any other component of the numerical solution. If the problem is small enough—tens to hundreds of GB on the CPU and up to 12 GB on the GPU—the approach described in Sec. 4.3 allows one to experiment with OpenMP or GPU computing and make a choice on a case-by-case basis. GPGPU computing is attractive for fine level granularity when the computational task fits the SIMD model. Problems characterized by task parallelism are ideally suited for OpenMP or related approaches (cilk, TBB).

Equation formulation. Recursive formulations do not have a level of granularity that fits well the SIMD model but are well suited for task-level parallelization. The finer degree of granularity associated with the Cartesian formulation makes it ideal for GPGPU computing.

Numerical solution. Minimizing time to solution may require adoption of numerical solution techniques different from those used for sequential calculations. This might require embracing strategies that are manifestly suboptimal in a sequential computing framework but which expose a high degree of parallelism and come with low synchronization demands. Moreover, effective parallel computing requires a shift in how we think about writing software. Rather than organizing data structures for the convenience of the programmer by reflecting the domain specific organization of a problem (bodies, joints, forces, etc.), the data structures should be organized to reflect the hardware constraints of the underlying architecture, which most often prefers data to be organized in long contiguous arrays that are suitable for parallel processing and ideal for vectorization. The latter approach though renders the software hard to design, debug, and maintain as it loses the connection to the domain problem that it solves.

We conclude with a few recommendations that coalesced during several years of experience with using parallel computing in MBD simulation.

1. (1)

The most important observation is that number crunching is basically free, while data movement/memory transactions are one to two orders of magnitude more expensive. For memory transactions, the most important aspect is memory access patterns. For GPGPU computing uncoalesced memory accesses reduce bandwidth, while for OpenMP violating spatial and temporal locality in data accesses increase the likelihood of cache misses. The recommendation is at the software design stage to organize model data in structures that are poised to minimize the amount of traffic to main memory or intercore communication (see the “Numerical solution” remark above).

2. (2)

If available, use domain libraries: the solution is already there, bug-free and optimized. Reinventing the wheel may be useful for learning purposes but is unlikely to lead to performance improvements.

3. (3)

Debugging parallel code is challenging. Although involving a steep learning curve, using debuggers is the only way to access program state information that cannot be replicated by rudimentary debugging techniques (such as using print statements). NVIDIA's cuda-gdb for GPU computing, Intel's Parallel Studio Suite for OpenMP computing, and Intel's Cluster Studio for MPI computing represent debugging solutions widely used in the developer community.

4. (4)

Use dedicated profiling tools as the only way to conclusively find the hot spots and bottlenecks in a parallel program and then direct the optimization effort where it matter most. In this context, the recommendation is to always keep in mind Amdahl's law [77] and parallelize the portions of the code that are the actual computational bottlenecks. Finally, recall that premature optimization is the root of all evil [78].

## Acknowledgment

The hardware used to produce the results reported in this paper was procured through the Army Research Office instrumentation Grant No. W911NF-11-1-0327. We thank NVIDIA for sponsoring our research programs in the area of high performance computing. We are grateful to Alex Dirr, Prateek Gupta, and Arindam Sinha for their feedback on a draft of this document.

2

We use GB to denote gigabytes and Gb for gigabits. Eight Gb make one GB.

## References

1.
MSC.Software
,
2014
, “
,” available online at http://www.mscsoftware.com
2.
,
J.
,
Cardenal
,
J.
,
Morer
,
P.
, and
Bayo
,
E.
,
2000
, “
Intelligent Simulation of Multibody Dynamics: Space-State and Descriptor Methods in Sequential and Parallel Computing Environments
,”
Multibody Syst. Dyn.
,
4
(
1
), pp.
55
73
.10.1023/A:1009824327480
3.
Featherstone
,
R.
,
1999
, “
A Divide-and-Conquer Articulated-Body Algorithm for Parallel O(log(n)) Calculation of Rigid Body Dynamics. Part 2: Trees, Loops and Accuracy
,”
Int. J. Robot. Res.
,
18
(
3
), pp.
876
892
.10.1177/02783649922066628
4.
Anderson
,
K. S.
, and
Duan
,
S.
,
2000
, “
Highly Parallelizable Low Order Dynamics Algorithm for Complex Multi-Rigid-Body Systems
,”
AIAA J. Guidance, Control Dyn.
,
23
(
2
), pp.
355
364
.10.2514/2.4531
5.
Mraz
,
L.
, and
Valasek
,
M.
,
2013
, “
Solution of Three Key Problems for Massive Parallelization of Multibody Dynamics
,”
Multibody Syst. Dyn.
,
29
(
1
), pp.
21
39
.10.1007/s11044-012-9311-1
6.
González
,
F.
,
Luaces
,
A.
,
Lugrís
,
U.
, and
González
,
M.
,
2009
, “
Non-Intrusive Parallelization of Multibody System Dynamic Simulations
,”
Comput. Mech.
,
44
(
4
), pp.
493
504
.10.1007/s00466-009-0386-3
7.
Bauchau
,
O. A.
,
2010
, “
Parallel Computation Approaches for Flexible Multibody Dynamics Simulations
,”
J. Frank. Inst.
,
347
(
1
), pp.
53
68
.10.1016/j.jfranklin.2009.10.001
8.
Quaranta
,
G.
,
Masarati
,
P.
, and
Mantegazza
,
P.
,
2002
, “
Multibody Analysis of Controlled Aeroelastic Systems on Parallel Computers
,”
Multibody Syst. Dyn.
,
8
(
1
), pp.
71
102
.10.1023/A:1015894729968
9.
Iglberger
,
K.
, and
Rüde
,
U.
,
2009
, “
Massively Parallel Rigid Body Dynamics Simulations
,”
Comput. Sci.-Res. Develop.
,
23
(
3–4
), pp.
159
167
.10.1007/s00450-009-0066-8
10.
Negrut
,
D.
,
Tasora
,
A.
,
Mazhar
,
H.
,
Heyn
,
T.
, and
Hahn
,
P.
,
2012
, “
Leveraging Parallel Computing in Multibody Dynamics
,”
Multibody Syst. Dyn.
,
27
, pp.
95
117
.10.1007/s11044-011-9262-y
11.
Mazhar
,
H.
,
Heyn
,
T.
,
Pazouki
,
A.
,
Melanz
,
D.
,
Seidl
,
A.
,
Bartholomew
,
A.
,
Tasora
,
A.
, and
Negrut
,
D.
,
2013
, “
Chrono: A Parallel Multi-Physics Library for Rigid-Body, Flexible-Body, and Fluid Dynamics
,”
Mech. Sci.
,
4
(
1
), pp.
49
64
.10.5194/ms-4-49-2013
12.
Haug
,
E. J.
,
1989
,
Computer-Aided Kinematics and Dynamics of Mechanical Systems Volume-I
,
Prentice-Hall
,
Englewood Cliffs, New Jersey
.
13.
Stewart
,
D. E.
, and
Trinkle
,
J. C.
,
1996
, “
An Implicit Time-Stepping Scheme for Rigid-Body Dynamics With Inelastic Collisions and Coulomb Friction
,”
Int. J. Numer. Methods Eng.
,
39
, pp.
2673
2691
.10.1002/(SICI)1097-0207(19960815)39:15<2673::AID-NME972>3.0.CO;2-I
14.
Pang
,
J. S.
, and
Stewart
,
D. E.
,
2008
, “
Differential Variational Inequalities
,”
Math. Programm.
,
113
, pp.
1
80
.10.1007/s10107-006-0052-x
15.
Anitescu
,
M.
, and
Hart
,
G. D.
,
2004
, “
A Constraint-Stabilized Time-Stepping Approach for Rigid Multibody Dynamics With Joints, Contact and Friction
,”
Int. J. Numer. Methods Eng.
,
60
(
14
), pp.
2335
2371
.10.1002/nme.1047
16.
Tasora
,
A.
, and
Anitescu
,
M.
,
2011
, “
A Matrix-Free Cone Complementarity Approach for Solving Large-Scale, Nonsmooth, Rigid Body Dynamics
,”
Comput. Methods Appl. Mech. Eng.
,
200
(
5–8
), pp.
439
453
.10.1016/j.cma.2010.06.030
17.
Heyn
,
T.
,
Anitescu
,
M.
,
Tasora
,
A.
, and
Negrut
,
D.
,
2013
, “
Using Krylov Subspace and Spectral Methods for Solving Complementarity Problems in Many-Body Contact Dynamics Simulation
,”
Int. J. Numer. Methods Eng.
,
95
(
7
), pp.
541
561
.10.1002/nme.4513
18.
Heyn
,
T.
,
2013
, “
On the Modeling, Simulation, and Visualization of Many-Body Dynamics Problems With Friction and Contact
,” Ph.D thesis, Department of Mechanical Engineering, University of Wisconsin–Madison, http://sbel.wisc.edu/documents/TobyHeynThesis_PhDfinal.pdf
19.
Khulief
,
Y. A.
,
2013
, “
Modeling of Impact in Multibody Systems: An Overview
,”
J. Comput. Nonlinear Dyn.
,
8
, p.
021012
.10.1115/1.4006202
20.
Lucy
,
L. B.
,
1977
, “
A Numerical Approach to the Testing of the Fission Hypothesis
,”
Astronom. J.
,
82
, pp.
1013
1024
.10.1086/112164
21.
Gingold
,
R. A.
, and
Monaghan
,
J. J.
,
1977
, “
Smoothed Particle Hydrodynamics-Theory and Application to Non-Spherical Stars
,”
Monthly Notices Roy. Astronom. Soc.
,
181
(
1
), pp.
375
389
.
22.
Monaghan
,
J. J.
,
2005
, “
Smoothed Particle Hydrodynamics
,”
Rep. Prog. Phys.
,
68
(
1
), pp.
1703
1759
.10.1088/0034-4885/68/8/R01
23.
Monaghan
,
J. J.
,
1989
, “
On the Problem of Penetration in Particle Methods
,”
J. Comput. Phys.
,
82
(
1
), pp.
1
15
.10.1016/0021-9991(89)90032-6
24.
Dilts
,
G.
,
1999
, “
Moving-Least-Squares-Particle Hydrodynamics–I. Consistency and Stability
,”
Int. J. Numer. Methods Eng.
,
44
(
8
), pp.
1115
1155
.10.1002/(SICI)1097-0207(19990320)44:8<1115::AID-NME547>3.0.CO;2-L
25.
Koshizuka
,
S.
,
Nobe
,
A.
, and
Oka
,
Y.
,
1998
, “
Numerical Analysis of Breaking Waves Using the Moving Particle Semi-Implicit Method
,”
Int. J. Numer. Methods Fluids
,
26
(
7
), pp.
751
769
.10.1002/(SICI)1097-0363(19980415)26:7<751::AID-FLD671>3.0.CO;2-C
26.
Dalrymple
,
R. A.
, and
Rogers
,
B. D.
,
2006
, “
Numerical Modeling of Water Waves With the SPH Method
,”
Coastal Eng.
,
53
(
2
), pp.
141
147
.10.1016/j.coastaleng.2005.10.004
27.
Pazouki
,
A.
,
Mazhar
,
H.
, and
Negrut
,
D.
,
2012
, “
Parallel Collision Detection of Ellipsoids With Applications in Large Scale Multibody Dynamics
,”
Math. Comput. Simul.
,
82
(
5
), pp.
879
894
.10.1016/j.matcom.2011.11.005
28.
Pazouki
,
A.
, and
Negrut
,
D.
,
2012
, “
Direct Simulation of Lateral Migration of Buoyant Particles in Channel Flow Using GPU Computing
,”
Proceedings of the 32nd Computers and Information in Engineering Conference, CIE32
, August 12-15, Chicago, IL, USA, American Society of Mechanical Engineers.
29.
Flynn
,
M. J.
,
1972
, “
Some Computer Organizations and Their Effectiveness
,”
IEEE Trans. Comput.
,
100
(
9
), pp.
948
960
.10.1109/TC.1972.5009071
30.
Gropp
,
W.
,
Lusk
,
E.
, and
Skjellum
,
A.
,
1999
,
Using MPI: Portable Parallel Programming with the Message-Passing Interface
, 2nd ed.,
MIT Press
Cambridge, MA.
31.
OpenMP
,
2013
, “
Specification Standard 4.0
,” available online at http://openmp.org/wp/
32.
Patterson
,
D. A.
, and
Hennessy
,
J. L.
,
2011
,
Computer Organization and Design: The Hardware/Software Interface
, 4th ed.,
Morgan Kaufmann
, Burlington, MA.
33.
Dennard
,
R.
,
Gaensslen
,
F.
,
Rideout
,
L.
,
Bassous
,
E.
, and
LeBlanc
,
A.
,
1974
, “
Design of Ion-Implanted MOSFET's With Very Small Physical Dimensions
,”
IEEE J. Solid-State Circuits
,
9
(
5
), pp.
256
268
.10.1109/JSSC.1974.1050511
34.
Munshi
,
A.
,
Gaster
,
B.
,
Mattson
,
T.
, and
Ginsburg
,
D.
,
2011
,
OpenCL Programming Guide
,
.
35.
NVIDIA
,
2014
, “
CUDA Programming Guide
,” available online at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
36.
Khude
,
N.
,
Stanciulescu
,
I.
,
Melanz
,
D.
, and
Negrut
,
D.
,
2013
, “
Efficient Parallel Simulation of Large Flexible Body Systems With Multiple Contacts
,”
ASME J. Comput. Nonlinear Dyn.
,
8
(
4
), p. 041003.10.1115/1.4023915
37.
Gropp
,
W.
,
2002
, “
MPICH2: A New Start for MPI Implementations
,”
Recent Advances in Parallel Virtual Machine and Message Passing Interface
,
Springer
,
New York
, p.
7
.
38.
Graham
,
R.
,
Woodall
,
T.
, and
Squyres
,
J.
,
2006
, “
Open MPI: A Flexible High Performance MPI
,”
Parallel Processing and Applied Mathematics
,
Springer
,
New York
, pp.
228
239
.
39.
MVAPICH2
,
2013
.
MVAPICH: MPI over InfiniBand
, 10GigE/iWARP and RoCE. http://mvapich.cse.ohio-state.edu/performance/mvapich2/interNode.shtml.
40.
Bell
,
N.
, and
Garland
,
M.
,
2012
, CUSP: Generic Parallel Algorithms for Sparse Matrix and Graph Computations. Version 0.3.0.
41.
Hoberock
,
J.
, and
Bell
,
N.
,
2010
, Thrust: A Parallel Template Library. Version 1.7.0.
42.
Open MPI
,
2013
, “
A High Performance Message Passing Library
,” http://www.open-mpi.org/
43.
MPICH2
,
2013
, “
High Performance Portable MPI
,” http://http://www.mpich.org/
44.
Plimpton
,
S.
,
1995
, “
Fast Parallel Algorithms for Short-Range Molecular Dynamics
,”
J. Comput. Phys.
,
117
(
1
), pp.
1
19
.10.1006/jcph.1995.1039
45.
LAMMPS
,
2013
, “
A Molecular Dynamics Simulator
,” http://lammps.sandia.gov/
46.
Parks
,
M.
,
Lehoucq
,
R.
,
Plimpton
,
S.
, and
Silling
,
S.
,
2008
, “
Implementing Peridynamics Within a Molecular Dynamics Code
,”
Comput. Phys. Commun.
,
179
(
11
), pp.
777
783
.10.1016/j.cpc.2008.06.011
47.
LIGGGHTS
,
2013
, “
Open Source Discrete Element Method Particle Simulation Code
,” http://cfdem.dcs-computing.com/?q=OpenSourceDEM
48.
Balay
,
S.
,
Brown
,
J.
,
Buschelman
,
K.
,
Eijkhout
, V
.
,
Gropp
,
W.
,
Kaushik
,
D.
,
Knepley
,
M.
,
McInnes
,
L. C.
,
Smith
,
B.
, and
Zhang
,
H.
,
2012
, “
PETSc Users Manual Revision 3.3
,”
http://
www.mcs.anl.gov/petsc/petsc-3.3/docs/manual.pdf
49.
Hindmarsh
,
A.
,
Brown
,
P.
,
Grant
,
K.
,
Lee
,
S.
,
Serban
,
R.
,
Shumaker
,
D.
, and
Woodward
,
C.
,
2005
, “
SUNDIALS: Suite of Nonlinear and Differential/Algebraic Equation Solvers
,”
ACM Trans. Math. Softw. (TOMS)
,
31
(
3
), pp.
363
396
.10.1145/1089014.1089020
50.
Orlandea
,
N.
,
Chace
,
M. A.
, and
Calahan
,
D. A.
,
1977
, “
A Sparsity-Oriented Approach to the Dynamic Analysis and Design of Mechanical Systems—Part I and Part II
,”
Trans. ASME J. Eng. Indus.
, pp.
773
784
.
51.
Shabana
,
A.
,
1989
,
Multibody Systems
,
John Wiley and Sons
,
New York
.
52.
Shabana
,
A. A.
,
2005
,
Dynamics of Multibody Systems
, 3rd ed.,
Cambridge University Press
Cambridge, UK.
53.
Armstrong
,
W. W.
,
1979
, “
Recursive Solution to the Equations of Motion of an N-Link Manipulator
,”
Proceedings of the 5th World Congress on Theory of Machines and Mechanism
, Vol.
2
, pp.
1343
1346
.
54.
Featherstone
,
R.
,
1983
, “
The Calculation of Robot Dynamics Using Articulated-Body Inertias
,”
Int. J. Robot. Res.
,
2
(
1
), pp.
13
30
.10.1177/027836498300200102
55.
Jain
,
A.
,
1991
, “
Unified Formulation of Dynamics for Serial Rigid Multibody Systems
,”
J. Guidance
,
14
(
3
), pp.
531
542
.10.2514/3.20672
56.
Brandl
,
H.
,
Johanni
,
R.
, and
Otter
,
M.
,
1986
, “
A Very Efficient Algorithm for the Simulation of Robots and Similar Multibody Systems Without Inversion of the Mass Matrix
,” In
IFAC/IFIP/IMACS Symposium (lst)
, Vienna, Austria, pp.
95
100
.
57.
Bae
,
D.
, and
Haug
,
E.
,
1987
, “
A Recursive Formulation for Constrained Mechanical System Dynamics: Part I. Open Loop Systems
,”
Mech. Struct. Mach.
,
15
(
3
), pp.
359
382
.10.1080/08905458708905124
58.
Bae
,
D.
, and
Haug
,
E.
,
1987
, “
A Recursive Formulation for Constrained Mechanical System Dynamics: Part II. Closed Loop Systems
,”
Mech. Struct. Mach.
,
15
(
4
), pp.
481
506
.10.1080/08905458708905130
59.
Rosenthal
,
D.
,
1987
, “
Order N Formulation for Equations of Motion of Multibody Systems
,”
SDIO/NASA Workshop on Multibody Simulations
, JPL Pub. D-5190, Jet Propulsion Lab., Pasadena, CA, pp.
1122
1150
.
60.
Rodriguez
,
G.
,
1987
, “
Kalman Filtering, Smoothing and Recursive Robot Arm Forward and Inverse Dynamics
,”
IEEE J. Robot. Autom.
,
3
(
6
), pp.
624
639
.10.1109/JRA.1987.1087147
61.
Rodriguez
,
G.
,
Jain
,
A.
, and
,
K.
,
1992
, “
Spatial Operator Algebra for Multibody System Dynamics
,”
J. Astronaut. Sci.
,
40
(
1
), pp.
27
50
.10.1.1.143.2314
62.
Fijany
,
A.
,
Shraf
,
I.
, and
D'Eleuterio
,
G.
,
1995
, “
Parallel O(log N) Computation of Manipulator Forward Dynamics
,”
IEEE J. Robot. Autom.
,
11
(
3
), pp.
389
400
.10.1109/70.388780
63.
Mukherjee
,
R.
, and
Anderson
,
K.
,
2007
, “
Orthogonal Complement Based Divide-and-Conquer Algorithm for Constrained Multibody Systems
,”
Nonlinear Dyn.
,
48
, pp.
199
215
.10.1007/s11071-006-9083-3
64.
Wittenburg
,
J.
,
1977
,
Dynamics of Systems of Rigid Bodies
,
B. G.
Teubner
, ed.,
Stuttgart
, Germany.
65.
Falgout
,
R.
, and
Yang
,
U. M.
,
2002
, “
Hypre: A Library of High Performance Preconditioners
,”
Preconditioners, Lecture Notes in Computer Science
, pp.
632
641
.
66.
Open Group
,
2004
, “
IEEE Std. 1003.1
,” available online at http://www.unix.org/version3/ieee_std.html
67.
SBEL
,
2013
, “
,” Simulation-Based Engineering Laboratory, University of Wisconsin-Madison. http://sbel.wisc.edu/Software
68.
Negrut
,
D.
,
Tasora
,
A.
,
Anitescu
,
M.
,
Mazhar
,
H.
,
Heyn
,
T.
, and
Pazouki
,
A.
,
2011
, “
Solving Large Multi-Body Dynamics Problems on the GPU
,”
GPU Gems
Vol.
4
, pp.
269
280
.
69.
Melo
,
F.
,
Umbanhowar
,
P.
, and
Swinney
,
H. L.
,
1994
, “
Transition to Parametric Wave Patterns in a Vertically Oscillated Granular Layer
,”
Phys. Rev. Lett.
,
72
, pp.
172
175
.10.1103/PhysRevLett.72.172
70.
Umbanhowar
,
P. B.
,
Melo
,
F.
, and
Swinney
,
H. L.
,
1996
, “
Localized Excitations in a Vertically Vibrated Granular Layer
,”
Nature
,
382
(
6594
), pp.
793
796
.10.1038/382793a0
71.
Luding
,
S.
,
Clément
,
E.
,
Rajchenbach
,
J.
, and
Duran
,
J.
,
1996
, “
Simulations of Pattern Formation in Vibrated Granular Media
,”
EPL (Europhysics Letters)
,
36
(
4
), p.
247
.10.1209/epl/i1996-00217-9
72.
Bizon
,
C.
,
Shattuck
,
M. D.
,
Swift
,
J. B.
,
McCormick
,
W. D.
, and
Swinney
,
H. L.
,
1998
, “
Patterns in 3D Vertically Oscillated Granular Layers: Simulation and Experiment
,”
Phys. Rev. Lett.
,
80
(
1
), pp.
57
60
.10.1103/PhysRevLett.80.57
73.
Jaeger
,
H. M.
,
Nagel
,
S. R.
, and
Behringer
,
R. P.
,
1996
, “
Granular Solids, Liquids, and Gases
,”
Rev. Mod. Phys.
,
68
, pp.
1259
1273
.10.1103/RevModPhys.68.1259
74.
Moon
,
S. J.
,
Swift
,
J. B.
, and
Swinney
,
H. L.
,
2004
, “
Role of Friction in Pattern Formation in Oscillated Granular Layers
,”
Phys. Rev. E
,
69
, p.
031301
.10.1103/PhysRevE.69.031301
75.
Mazhar
,
H.
,
Heyn
,
T.
, and
Negrut
,
D.
,
2011
, “
A Scalable Parallel Method for Large Collision Detection Problems
,”
Multibody Syst. Dyn.
,
26
, pp.
37
55
.10.1007/s11044-011-9246-y
76.
SBEL
,
2013
, “
Movies, Physics-Based Modeling and Simulation
,” http://sbel.wisc.edu/Animations
77.
Amdahl
,
G.
,
1967
, “
Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities
,”
AFIPS Conf. Proc.
,
30
, pp.
483
485
.10.1145/1465482.1465560
78.
Knuth
,
D.
,
1974
, “
Structured Programming With go to Statements
,”
ACM Comput. Surv. (CSUR)
,
6
(
4
), pp.
261
301
.10.1145/356635.356640