Many scientific and engineering problems are solved by utilizing simulations of computationally intensive mathematical models within massive design spaces. As a result, parametric studies of these models are cost prohibitive in terms of computational time. The focus of this work is a particular kind of parameter sweep problem where the simulation of the model, given real parameters, results in a binary value. The goal of this work is to design, develop, and implement a parallel algorithm for bounding a binary objective when the simulation is a computationally intensive mathematical model. A fully functioning implementation is provided for a two-dimensional example using client-server architecture. Results show that a straight bisection search is approximately 50% faster than a full parametric sweep for most continuous functions. With parallelization and load balancing, the simulation is remarkably faster, exhibiting near-linear speedup up to 16 processors for most functions.

## Introduction

Engineering analysis is often concerned with identifying conditions responsible for catastrophic failure. For example, consider the load required to buckle a column. Using simple material relations and static analysis, an estimate can be obtained for a range of column size and load. However, for complex systems, the prediction of failure can be difficult due to the interaction of many forces and the confounding interaction of many parameters. In these cases, detailed models that govern the fundamental behavior of systems are designed to predict the performance of systems immediately prior to failure. Depending on the application, models can be computationally expensive, and exploration of large design spaces can be cost prohibitive, yet engineers often want to know where in the design space a system will fail.

The calculation of microelectronic device failure due to a single energetic particle strike (1 2) is an example of an application that requires significant computational time for a single design point. Failure of a device is determined by the particle energy, biasing conditions, and strike location and direction, among other parameters. Simulations for a single design point require hours of computer time. As a result, to locate all combinations of bias, particle energy and direction that result in failure would require years of computer time.

Another example that does not deal directly with failure but is a similar class of engineering problem is ignition prediction. In diesel engines, ignition is dependent on a variety of parameters such as temperature, pressure, crank angle, cylinder charge, and so forth. To predict whether ignition will occur, a detailed chemistry and thermodynamic simulation must be performed, allowing the states to evolve in time as the gas is compressed by the cylinder (3). Failure in this context would actually be ignition, but the results of the simulation indicate whether the fuel mixture burns or not. These models require expensive computations.

Contrast the foregoing problem to the discrete optimization problems where the design variable is discrete. In general, discrete optimization problems do not demand a discrete objective. One notable example is the knapsack problem (4)—a simple example of the discrete optimization problem. In this case, the objective is the number of items in the knapsack, which is discrete. Nevertheless, discrete optimization methods, in general, focus on the fact that the design variables are discrete and are less concerned about the continuity of the objective. In our case, the design variables are continuous and the objective is discrete.

Not only is our objective discrete, but it is also binary. However, our problem is not a pseudo-Boolean optimization (also related to binary optimization or zero-one integer programming) either (5). These techniques are special cases of discrete optimization when the design variable is binary. Again, in general, the objective remains continuous in these classes of problems.

Due to the lack of information from a function evaluation, binary objective problems are amenable only to direct search methods much like those problems without numerical objectives (6). Problems without numerical objectives still usually have a “better” or “worse” designation between any two points even though the objective is not quantified. This qualitative evaluation still gives a search direction. Our binary objective is similar except that we do not have enough information to determine a direction to proceed. We assume that gradients are nonexistent or meaningless except across the interface where the objective changes by a unit step. Nevertheless, direct search methods can be applied to these problems.

The binary objective problem somewhat resembles that of an edge detection problem in image processing (7). However, traditional approaches still rely on higher order derivatives to smooth the image and locate where in an image the intensity changes dramatically. The Canny approach (8) assumes that noise is inherent to the data and that all function evaluations have already been acquired (pixel color). These two assumptions are contrary to our problem and make the standard edge detection algorithms impractical for our approach. Moreover, the type of edge being detected is usually quite different from our design space surface. Nevertheless, the interpolation line segment approach used by some edge detection routines (9) is partially responsible for the interpolation scheme used in our approach to speed up serial analysis. Finally, edge detection problems do not scale to multiple dimensions, making these approaches only peripherally related and of limited importance to the present work.

In general, optimization problems are amenable to parallelization. Byrd et al. (10) suggested that either the function evaluations can be performed concurrently or that a sequential optimization can be used with a parallel forward model. Eldred et al. (11) further developed a taxonomy for levels of parallelization that, when used properly, provides maximum use of available resources. In fact, the DAKOTA project at Sandia National Laboratories was developed to exploit the multilevel parallelism of most optimization problems (12). In the present context, we assume that the parallelization of the forward model is limited and that more resources exist than can be used by the forward model. Therefore, we are interested solely in the parallelization of the optimization algorithm.

The vast majority of the research on parallelization of optimization algorithms is in the area of smooth nonlinear functions (13). In these cases, gradients are usually used to provide a new search direction. If analytic gradients are not available, they can be approximated with finite differences. In terms of parallelization, each gradient (or Hessian) requires additional function evaluations that can be performed concurrently. Therefore, the optimal speedup is equivalent to the number of function evaluations required for each iteration (14). This rule of thumb can be improved slightly with more modern quasi-Newton methods (15,16) and using additional information from otherwise unutilized processors to augment the optimization (14). Nevertheless, the number of processors that can be used is limited by the dimensionality of the design space. Consequently, for the most practical optimization problems, the amount of parallelization for practical gradient-based optimization algorithms is not massive.

Nongradient-based optimizations, however, stand to see greater gains than their gradient-based counterparts. Chazan and Miranker (17) showed that an order advantage exists between sequential search algorithms and parallelized search algorithms for quadratic functions. Sutti (18) later extended the methods by performing asynchronous evaluations, which simply means that the order of the linearly independent search direction $s$ is immaterial. Nevertheless, these methods depend on a smooth convex function, which is not representative of our binary function.

The binary objective problems described here can be treated as trivially parallel in that the results of each design point are independent of the results from other design points. Therefore, high-performance solutions can greatly reduce the total amount of time an engineer might have to wait for the results. The simplest approach would be to discretize the design space and evaluate each point on a different processor. However, this approach does not limit the search in any way and can also become computationally prohibitive for multidimensional searches or when a limited number of processors is available. In general, the brute force approach scales such as $Np$ where $N$ is the number of parameters values to be tested for a single parameter and $p$ is the number of parameters to be tested.

The present work focuses on a parallelized and load balanced bisection of the design space to locate a sudden transition at the discontinuity in the objective. We define the mathematical formulation of the problem and solution approach, explore how the solution can be improved serially, parallelize a basic bisection algorithm, and add load balancing. Performance of the serial solution is measured by the number of function evaluations required to obtain the threshold in the objective. Performance of the parallel algorithms is indicated by the speedup.

## Mathematical Model and Solution Approach

The mathematical model of the problem is described in terms of adjustable parameters that govern the physics given by real variables represented by $zk\u2200k\u220a[1,2,\u2026,n]$. The parameter space is then the set of all possible design points $P$ where $f(z1,z2,\u2026,zn)\u2192[0,1]$ is a mapping from $P$ to a binary variable. The problem can be defined as the following: Find sets $V0\u2286P\u220df(z1,z2,\u2026,zn)\u21920$ and $V1\u2286P\u220df(z1,z2,\u2026,zn)\u21921$.

To make the problem more tractable, it is assumed that the boundary between the sets can be represented as a continuous $(n\u22121)$ dimensional surface $S$ in $P$. Because of this simplifying assumption, the solution may be represented by this surface and a binary value that indicates whether the value 1 lies above or below the surface.

In terms of the computational complexity theory (19), the problem is stated as finding the region of the design space that results in failure $(V0)$ or success $(V1)$ to within a particular tolerance such that the surface $S$ is nowhere further from the closest points in either set $V0$ or $V1$ than a prescribed tolerance. If the continuous design space is discretized using the tolerance, then a naive algorithm would be to evaluate the function $f$ for each discretized design point in the domain $P$. This algorithm is $O(nd)$, where $n$ is the number of design points to test in each direction $d$ (assuming the tolerance in each direction is specified to yield the same number of discretized points in each direction). Although we expect to be able to do better, we are not trying to establish a lower bound to the complexity of the problem. Instead, we simply wish to improve on the $O(nd)$ scaling.

To illustrate the solution process and provide metrics for evaluating the efficacy of the algorithm, a two-dimensional design space $(d=2)$ search is considered. Assume that the boundary $S$ between success and failure can be represented by a continuous curve in the $x\u2212y$ domain and that $x\u220a[x0,x1]$ and $y\u220a[y0,y1]$ define the parameter space $P$. The goal is to find the boundary $S$ within $xtol$ along the $x$-direction and $ytol$ along the $y$-direction.

In some instances, the bisection could fail to provide an advantage. For example, consider the right most portion of the example provided in Fig. 1. For the two right most columns, the bisection would not bracket the solution because the solution is not there to bracket. Only after the requisite number of evaluations $(O[log\u2009N])$ would we be able to conclude that the solution $S$ does not extend into that region. If we could leverage existing knowledge about the locus of the solution, perhaps we could reduce or eliminate evaluating these points. This is the focus of the approximation scheme described below.

## Serial Improvements

To reduce the number of function evaluations even further, an approximation scheme can be used, which will leverage information already gathered about the location of the surface $S$. These approximations serve as a guide for the remaining bisections. Types of approximation used include all of the interpolation types supported by the GNU Scientific Library (21): linear, polynomial, cubic spline, periodic cubic spline, akima spline, and periodic akima spline interpolations. Each type of interpolation was tested on a boundary represented by a line, sine, parabola, and polynomial. The results in terms of the number of function evaluations are presented in Table 1. The results suggest that simple interpolation (linear) is adequate for the vast majority of function types.

Type | Line | Parabola | Poly-9 | Sine |

Akima | 9336 | 8153 | 3870 | 20,243 |

Akima-periodic | 10,434 | 10,278 | 3580 | 20,859 |

Cspline | 4421 | 3922 | 3574 | 20,859 |

Cspline-periodic | 21,164 | 5313 | 3578 | 20,985 |

Linear | 4421 | 4694 | 3574 | 20,859 |

None | 21,021 | 21,002 | 2724 | 21,021 |

Polynomial | 341,670 | 320,794 | 3749 | 340,084 |

Type | Line | Parabola | Poly-9 | Sine |

Akima | 9336 | 8153 | 3870 | 20,243 |

Akima-periodic | 10,434 | 10,278 | 3580 | 20,859 |

Cspline | 4421 | 3922 | 3574 | 20,859 |

Cspline-periodic | 21,164 | 5313 | 3578 | 20,985 |

Linear | 4421 | 4694 | 3574 | 20,859 |

None | 21,021 | 21,002 | 2724 | 21,021 |

Polynomial | 341,670 | 320,794 | 3749 | 340,084 |

These functions examined are all smooth, and so the results are quite impressive. Cubic splines perform very well, but performance is worse than the regular bisection method in the case of the ninth degree polynomial, which wildly oscillates outside of the parameter space.

## Parallel Algorithm

This problem can be parallelized easily because each function evaluation is assumed to be independent of all other design points. Therefore, little extra work is required to parallelize the algorithm: One must only assign a single $x$-value to each node and allow the bisection for that $x$-value to take place on that node. The number of function evaluations does not change when the algorithm is parallelized in this way.

A parallel version of the search algorithm was developed using MPI (message passing interface) with ethernet network connectivity. No attempt to balance the load was made in this initial trial. Because the number of design points in the $x$-direction was fixed (based on the tolerance), each processor was assigned a fixed number of evaluations in that direction. For each $x$ design point, bisection was performed in the $y$-direction until the required tolerance was reached. The test problem was to find the threshold given by $y=0.75x+0.13$ on the domains $x0=0$, $x1=1$, $y0=0$, and $y1=1$. The model returned a 1 if the design point was above the line and 0 if the design point was below the line. The required tolerances for each direction were $xtol=10\u22122$ and $ytol=10\u22123$. No interpolation was performed to reduce the number of iterations.

There are three possible sources of task distribution that could prohibit the parallelization from achieving a linear speedup.

- 1
The number of tasks does not divide easily among the available processors. For example, four function evaluations do not divide equally across three processors. While one processor finishes the final function evaluation, the other three processors sit idle.

- 2
Each processor may be assigned a different number of function evaluations due to predictive algorithms. For example, if one can provide a good guess where the interface is based on previous results, fewer iterations may be required to achieve the specified tolerance. Therefore, the processors where a predictive algorithm is effective may finish their tasks earlier than other processors.

- 3
Each function evaluation may require a different amount of time. For example, in the case of the ion-strike simulations described previously, the average time to determine whether a device failed given a single design point is 4 h, yet some failure conditions can be estimated early in the simulation (of the order of 20 min of simulation time), and some simulations that come close to failure but ultimately survive can require 10–20 h of CPU time. Therefore, depending on the number of processors, it is conceivable that several processors may sit idle while the last few simulations finish the calculations.

In the foregoing tests, the number of function evaluations (1200) is much greater than the number of processors (up to 17); therefore, the load based on the number of function evaluations (point 1) should be similar. Concerning point 2, we are not using any predictive algorithm, so the number of function evaluations will not change based on previous results. Finally, if each function evaluation requires the same amount of time, then we expect linear speedup, as shown in Fig.2, for the constant function evaluation time case. These results were obtained on a homogeneous cluster.

In real engineering systems, however, we cannot depend on each function evaluation to be equivalent in terms of computational requirements (point 3). This wait time at the end of the simulation will reduce the speedup of the overall analysis, as shown in Fig. 2. In the random function evaluation time case, a random amount of sleep time was executed before the function returned with a 1 or a 0. This was implemented to simulate varying function evaluation times. This lack of linear speedup can be addressed through a load balancing scheme.

The strategy used to accomplish load balancing is that of a simple client-server architecture. The server maintains a queue of tasks to be completed. Initially, the queue contains, in order, the low and high $y$-values for each $x$-value. The server doles out tasks from the front of the queue. When a task completes, a new task is generated and is placed at the end of the queue. A new task can be assigned in constant time. This form of load balancing addresses all three sources of degradation of linear speedup. However, it requires a dedicated server to dole out design points and requires a bit more communication overhead. Nevertheless, the results are dramatically improved. Figure3 shows the maximum wait time of any processor in the group and the average wait time for all the processors in the group as a function of the number of processors in the group. The amount of wait time increases with the number of processors when no load balancing is performed. With load balancing, however, the wait time is decreased significantly. Therefore, the efficiency of this strategy is nearly perfect, leading to a scalable speedup up to 17 processors.

## Discussion

The performance of the parallel algorithm can be improved by including the approximation scheme used in the serial algorithm. However, this effort introduces additional complexity. As processors are added to the simulation, the accuracy of the approximation decreases, as illustrated in the following example. If all execution occurs on a single processor, then when the evaluation $k$ is assigned $(k\u22121)$, other results may be used in the approximation. On the other hand, if ten clients are used, then when the evaluation $k$ is assigned at most $(k\u221210)$, other results may be used. Furthermore, the approximation in this case is an interpolator, meaning that the approximation must pass through each point. For these reasons, it is extremely difficult to place bounds on the number of evaluations required for a particular number of processors and type of approximation. As illustrated in the case of the ninth degree polynomial, the number of function evaluations does not always decrease when using approximation.

Although the example was cast in a two-dimensional space, the concepts here can be extended to multiple dimensions with a multidimensional bisection (22) algorithm. In fact, the two-dimensional problem may also be improved with a multidimensional bisection approach. However, the scalability is no longer a simple function of the problems size, and the parallelization is no longer trivial.

Other improvements could include mixing an approximation strategy with a load balancing strategy. This is difficult because the approximation and thus the scalability worsen as the number of processors increases. Also, the interpolators used should be replaced with better approximations since the partial results are inexact at best.

## Conclusions

Binary objectives occur naturally in many areas of science and engineering. Therefore, understanding these types of problems is important to the efficient analysis and designs of engineering systems. Examining the problems by forming a mesh of the parameter space seems to be the accepted solution, which is certainly easy to extend to $n$ dimensions. However, complicated physical models often require vast computing resources, and an algorithm that scales better and reduces the computational cost is essential. We have shown that a simple bisection algorithm and simple parallelization with load balancing can reduce the computational cost by an order of magnitude. These results are limited to a boundary that is continuous in the parameter space. Moreover, we did not test the scheme for higher dimensions. Nevertheless, these results lay the framework for a continued study in optimization problems with binary objectives.