|
1.
Load
balancing for
parallel computing
1.1. Flexibly assignable tasks
1.2. One-dimensional partitioning
1.3. Partitioning for complex objectives
2.
Interprocesssor
communication
2.1. Communication with memory constraints
2.2. Determining communication patterns
3.
Interconnection
networks
4.
Sparse
Matrix Computations
4.1. Improving memory performance of
sparse matrix vector products
4.2. Finding nonoverlapping dense blocks
4.3. Permuting a sparse matrix to block angular form
4.4. The nice basis problem
5.
Vulnerability
analysis of the power grid
6.
Graph
algorithms
6.1. The inhibiting bisection problem
6.2. Identifying strongly connected
components in parallel
7.
Supernova
factory scheduler
8.
Indexing
massive data
9.
Compressing
sequences with invariant transition frequencies
1. Load
balancing
Distributing
loads equally among processors is a fundamental problem for parallel
computing. While load
balancing has been the subject of numerous efforts, it continues to draw interest
due to the need for higher concurrencies and the need for better
integration with the underlying algorithms.
1.1. Exploiting flexibly-assignable work
In
many parallel applications, distribution of data implies distribution of
work among processors, but there are exceptions where some tasks can be
assigned to one of several processors without altering the total volume of
communication. Examples include force computation between two particles in
different processors, or node computations in finite-element simulations
when element partitions are used.
We studied how to exploit this flexibility in assignment of tasks to
improve load balance. We first
modeled the problem as a flow problem and used combinatorial techniques to
solve it. Our parametric
search algorithms use maximum flow algorithms for probing a candidate
optimal solution value. We also defined augmenting paths and cuts for this
problem, and showed that any algorithm based on augmenting paths could be
used to solve the task assignment problem optimally. Then we showed that a continuous
version of this problem reduces to the well-studied linearly constrained
least squares problem, which is easily amenable to parallelization. Our
experiments on a molecular dynamics and overlapped Schwartz domain decomposition
data sets verified the effectiveness of our methods.
Relevant
Publications:
A. Pinar and B. Hendrickson, Improving
Load Balance with Flexibly Assignable Tasks, IEEE Transactions on Parallel and Distributed Systems, Vol: 16, No: 10, pages:
956--965, 2005. (pdf)
A. Pinar and B. Hendrickson, Exploiting
Flexibly Assignable Work to Improve Load Balance Proc. ACM 14th Symp.
Parallel Algorithms and Architectures SPAA~2002, pages: 155-163, (postscript).
1.2. One-dimensional partitioning
Although
asymptotically efficient algorithms have been designed for one-dimensional
partitioning, practitioners still use heuristics hoping to achieve good
partitions, with the misconception that exact algorithms are hard to
implement and not run-time efficient.
We proposed run-time efficient algorithms along with implementation
tips. Experiments showed that, on average, our algorithms are 100 times
faster than a matrix-vector product for 64-way partitions.
We
have also enhanced our techniques for heterogeneous systems, where
processors have different processing powers. We showed that when the order
of processors is fixed, the problem could be solved optimally, by enhancing
our techniques for the homogenous case. The problem comes NP-complete, when
we allow permutation of processors.
Relevant
Publications:
A. Pinar, E. K. Tabak, and C.
Aykanat, One-dimensional partitioning for heterogeneous systems:
theory and practice, submitted to Journal
of Parallel and Distributed Computing.(pdf)
A. Pinar and C. Aykanat, Fast
Optimal Load Balancing Algorithms for 1D Partitioning, to appear in Journal of Parallel and Distributed Computing, (pdf).
A. Pinar and C. Aykanat, Sparse
Matrix Decomposition with Optimal Load Balancing, Proceedings of International
Conference on High Performance Computing, pages: 224-229, Dec., 1997.(postscript)
1.3.
Graph partitioning for complex objectives
Unstructured
computations can often be described as undirected graphs in which vertices
correspond to computations and edges reflect data dependencies. A graph
partitioner can be employed to partition the vertices of the graph into
disjoint sets of equal weight (for load balance) while keeping small the
number of edges crossing between sets (low communication volume). For some
computations however, the load is a complicated function of the partition,
and the work per subdomain cannot be assessed until the partition is
computed. Examples of this phenomenon include balancing computation plus
communication; domain decomposition techniques where incomplete or complete
factorization is used at each subdomain; and overlapped subdomain
partitions where boundary vertices are duplicated in neighboring domains. We proposed a framework for addressing
such complex partitioning problems and applied these ideas to achieve equal
sized overlapped subdomains and balanced computation plus communication
load.
Relevant
Publications:
A. Pinar and B. Hendrickson, Partitioning
for Complex Objectives, In Proc. Irregular 01 (postscript)
2. Interprocessor
communication
Many
applications involve periodic redistribution of workloads and associated data,
which requires special communication routines. The first problem below is
about determining the communication pattern after a calculation is first
defined, and the second problem studies how to migrate data subject to
memory constraints.
2.1. Determining communication
patterns in parallel
After subdomains are assigned, processors can
determine what data they need to receive, but will not know which
processors have them. A simple solution is to have the processors exchange
messages in a ring fashion, but this will be slow and is not scalable. An
efficient solution enables a more convenient interface between different
software components and reduces the burden on a repartitioning tool to
update the communication information. We proposed using a distributed directory ---a rendezvous algorithm--- where each
processor is responsible for maintaining directory information for a subset
of the components.
Relevant
Publications:
A. Pinar and B. Hendrickson, Communication
Support for Adaptive Computation, in Proc. SIAM
Parallel Processing 01 , (postscript).
2.2. Interprocessor communication with limited
memory
While
redistributing data among processors, each processor allocates space for
its incoming data, posts asynchronous receives, and then sends its outgoing
data. Memory for outgoing data
can be deallocated only after the send is complete. This requires each
processor to have space simultaneously for both its outgoing and incoming data, and fails otherwise.
To overcome this problem, we can send data in phases. After each phase, processors can
free the memory of the data they sent, to make it available for the next
phase. We proved that
scheduling messages for the minimum number of phases is NP-Complete, and
showed how the problem can be phrased as an instance of multi-commodity
flow.
We
also studied a continuous approximation, which yields solutions with at
most two more phases than that of a discrete solution. Finally, we devised a
practical 1.5-approximation algorithm. Our empirical studies verified the
validity of our model and effectiveness of our methods.
Relevant
Publications:
A. Pinar and B. Hendrickson, Interprocessor
Communication with Limited Memory, IEEE
Transactions on Parallel and Distributed Systems, Vol. 15, pages: 606--616,
2004. (pdf).
A. Pinar and B. Hendrickson, Interprocessor
Communication with Memory Constraints, in Proceedings of SPAA 2000 . (postscript)
3. Interconnection network design
As we enter the era of peta-scale computing,
system architects must plan for machines composed of tens or even hundreds
of thousands of processors.
Although fully connected networks such as fat-tree configurations
currently dominate HPC interconnect designs, such approaches are inadequate
for such ultra-scale concurriencies due to the superlinear growth of
component costs. Traditional
low-degree interconnect topologies, such as 3D tori, have reemerged as a
competitive solution due to the linear scaling of system components
relative to the node count; however, such networks are poorly suited for
the requirements of many scientific applications at extreme
concurrencies. To address
these limitations, we propose HFAST, a hybrid switch architecture that uses
circuit switches to dynamically reconfigure lower-degree interconnects to
suit the topological requirements of a given scientific application.
Relevant
Publications:
S. Kamil, L. Oliker, A. Pinar,
and J. Shalf, “Communication Requirements and Interconnect Optimization
for High-End Scientific Applications," submitted to IEEE Transaction on Parallel and
Distributed Computing. (pdf)
S. Kamil, A. Pinar, D. Gunter,
M. Lijewski, L. Oliker, and J. Shalf, Reconfigurable hybrid
interconnection for static and dynamic scientific applications, In Proc. 2007 ACM International Conference on
Computing Frontiers. (pdf)
4. Sparse Matrix Computations
Sparse
matrix computations lie at the heart of many scientific computing
applications, and permutations of sparse matrices have drawn much interest
as they can significantly
improve performance or enable
application of special algorithms.
4.1.
Improving performance of matrix-vector multiplication
Sparse
matrix-vector multiplication often suffers from poor memory performance due
to memory indirections used to exploit sparsity. We proposed alternative
data structures along with reordering algorithms to improve the
effectiveness of these data structures to reduce the number of memory
indirections. We packed nonzeros in consecutive positions into a block to
do only one memory indirection per block.
We also studied the problem of permuting nonzeros
of a matrix into contiguous positions and showed it can be stated as the
traveling salesperson problem. Our experiments produced improvements of up
to 33\% and improvements of
21\% on average.
Relevant
Publications:
A. Pinar and M. T. Heath, Improving Performance of
Sparse Matrix-Vector Multiplication, Proceedings of SC’99
,
Portland, November 1999. (postscript)
4.2. Finding dense blocks of a sparse
matrix
Due
to the low compute-to-memory ratio and irregular memory access patterns,
the performance of sparse matrix kernels is often far from the peak
performance on modern processors. Alternative matrix representations have
been proposed, where the matrix $A$ is split into $A_d$ and $A_s$, so that
$A_d$ contains all dense blocks of a specified form, and $A_s$ contains the
remainder. This facilitates using dense matrix kernels on the entries of
$A_d$. We study the problem of
finding a maximum number of nonoverlapping rectangular dense blocks in a
sparse matrix. We showed that
the maximum nonoverlapping dense blocks problem is NP-complete by reduction
from the maximum independent set problem on cubic planar graphs. We also proposed a
$2/3$-approximation algorithm for $2\times 2$ blocks that runs in linear
time in the number of nonzeros in the matrix. We discussed alternatives to
rectangular blocks such as diagonal blocks and cross blocks and presented
complexity analysis and approximation algorithms for these cases.
Relevant
Publications:
A. Pinar and V. Vassilevska, Finding
Nonoverlapping Dense Blocks of a Sparse Matrix, Electronic
Transactions on Numerical Analysis, Special issue on combinatorial scientific
computing, Vol: 21, pages: 107--124, 2005. (pdf)
4.3. Permuting
a sparse matrix to block-angular form
Despite numerous algorithms designed for solving
block-angular linear programs, the problem of permuting a matrix to
block-angular form is relatively uninvestigated. We proposed hypergraph
models to represent rectangular sparse matrices, which reduce the problem
to that of hypergraph partitioning. We also devised graph models and
associated techniques to adapt graph partitioning techniques for this
problem.
Relevant
Publications:
C. Aykanat, A. Pinar, and U.
Catalyurek, Permuting Sparse Rectangular Matrices into Block-Diagonal
Form, SIAM Journal on Scientific Computing, Vol. 25, No. 6, pages:
1860--1879, (pdf).
A. Pinar and C. Aykanat, An
Effective Model to Decompose Linear Programs for Parallel Solution, Lecture Notes in Computer
Science Vol: 1184, pages: 592-601. (postscript)
A. Pinar, U. V. Catalyurek, C.
Aykanat, M. C. Pinar, Decomposing Linear Programs for Parallel
Solution,
Lecture Notes in Computer Science, Vol: 1041, pages: 473-482, 1996. (postscript)
4.4.
Nice basis problem
We proposed a new combinatorial approach towards
constructing a sparse, implicit basis for the null space of a sparse,
under-determined matrix. Our approach was to compute a column space basis
of that has a sparse inverse that could be used to represent a null space
basis in implicit form. We investigated three algorithms for computing
column space bases: two greedy algorithms based on graph matchings, and a
third that employs a divide and conquer strategy implemented with
hypergraph partitioning followed by a matching. Our results showed that for
many matrices from linear programming, structural analysis, and circuit
simulation, it was possible to compute column space bases having sparse
inverses, contrary to conventional wisdom. The hypergraph partitioning
method yields sparser basis inverses and computationally more efficient,
relative to the greedy approaches. We also discussed the complexity of
selecting a column space basis in block diagonal form with a given block
size.
Relevant
Publications:
A. Pinar, E. Chow, and A.
Pothen, Combinatorial Techniques for Constructing Sparse Null-space
Bases,” Electronic Transactions on Numerical Analysis, special volume on saddle point
problems: numerical solution and applications, Vol. 22, pages: 122--145,
2006. (pdf)
5. Vulnerability analysis of the
electric power grid
Robust operation of a power system requires the
system to stay intact after a small number of broken lines, and thus calls
for algorithms to detect small groups of lines that can cause a blackout if
they fail collectively. The flow of power can be modeled with a system of
nonlinear equations, and a blackout corresponds to the infeasibility of
these equations. Thus the vulnerability detection problem for the power
grid can be studied as a mixed integer nonlinear optimization problem,
which is a hard problem to solve by itself. We have reduced this problem to a pure combinatorial
problem, known as the network inhibition problem, by analyzing the structure
of an optimal solution to the mixed integer nonlinear optimization problem.
Our key observation was the relation between the feasibility surface of the
power flow equations and spectral graph theory. Instead of solving the NP-complete network inhibition
problem, we have introduced and studied the inhibiting bisection problem,
where we look for a bisection of the graph with a minimum number of edges
between the two parts, and a maximum production/consumption mismatch within each part.
Relevant
Publications:
A. Pinar, J. Meza, V. Donde, and
B. Lesieutre, “Optimization Strategies for the Vulnerability Analysis of
the Power Grid,
submitted to SIAM Journal on Optimization (pdf)
A. Pinar, A. Reichert, and B.
Lesieutre, Computing Criticality of Lines in a Power System, in Proc. 2007 IEEE International Symposium on Circuits and
Systems,
New Orleans, LA, May 2007. (pdf)
V. Donde, V. Lopez, B.
Lesieutre, A. Pinar, C. Yang, and J. Meza, Identification of severe
multiple contingencies in electric power networks, to appear in IEEE Transactions on Power Systems. (pdf)
B. Lesieutre, A. Pinar, and S.
Roy, Power System Extreme Event Detection: The Vulnerability Frontier, Proc. Hawaii International
Conference on System Sciences, 2008. (pdf)
A. Pinar, Y. Fogel, and B.
Lesieutre, The Inhibiting Bisection Problem, submitted to ACM 19th
Symposium Parallel Algorithms and Architectures (SPAA) 2007 (pdf)
B. Lesieutre, S. Roy, V. Donde,
and A. Pinar, Power system extreme event analysis using graph
partitioning,
Proc. of the North American Power Symposium, Carbondale, IL, October 2006. (pdf)
V. Donde, V. Lopez, B.
Lesieutre, A. Pinar, C. Yang, and J. Meza, Identification of severe
multiple contingencies in electric power networks, Proc. the North American Power
Symposium}, Ames, IA, October 2005.(pdf)
6. Graph Algorithms
6.1. Inhibiting bisection problem
Relevant Publications:
A. Pinar, Y. Fogel, and B.
Lesieutre, The Inhibiting Bisection Problem, submitted to ACM 19th
Symposium Parallel Algorithms and Architectures (SPAA) 2007 (pdf)
6.2. Identifying strongly connected
components in parallel
The
standard serial algorithm for finding strongly connected components is
based on depth-first search, which is difficult to parallelize. We designed
a divide-and-conquer algorithm using reachability information, which is
amenable to parallelization. The expected run time for our algorithm is
$O(|E|\lg |V|)$. Our work was motivated by the discrete ordinates method
for modeling radiation transport. In this method, the object to be studied
is modeled by finite elements. Each element is a vertex in our graph and an
edge connects elements that share a face. The radiation equations are
approximated by an angular discretization, and edges in the graph are
directed to align with the angle.
The computations for an element can be performed only after those of
all its predecessors. The numerical calculations must be modified if the
graph has a cycle. Our algorithm detects these cycles to enable the
computation to be modified.
Relevant
Publications:
L. K. Fleischer, B. Hendrickson
and A. Pinar, On Identifying Strongly Connected Components in
Parallel,
to Proc. of Irregular’2000 , Lecture Notes in Computer
Science, Vol. 1586, pages 505--511, (postscript).
7. Supernova factory scheduler
Observing
Type Ia supernovae can help us answer some fundamental questions about the
nature of the universe. Type Ia supernovae are extraordinarily bright,
remarkably uniform objects, which make excellent "candles" for
measuring the expansion rate of the universe, and they remain bright enough
only for a few weeks. For a complete experiment, each supernova should be
observed several times at regular intervals during its lifetime, and the
value of an experiment degrades significantly when the distance between
subsequent observations is too long.
While we would like to collect as much data as possible, we have to
do this with limited resources, i.e., a single telescope. This calls for
carefully selecting a subset of the supernovae that can be observed
regularly, and constructing a schedule for most effective utilization of
the telescope, which is a combinatorial problem in nature, and nonlinearity
provides accuracy and efficiency in formulation.
8.
Compressing bitmap indices
Bitmap indexing, which is commonly used for scientific databases, rely on
variants of runlength encoding for a compact index structure. Runlength encoding exploits uniform
segments of data for compaction, and thus their performance is directly affected by
availability of such segments, which depend on the organization of the
data. We have proposed the
tuple reordering problem to
reorganize database tuples for best run-length encoding performance. The problem can be reduced to
the TSP problem, but the sizes
of the databases make using standard TSP methods prohibitively
expensive. We proposed the Gray-code ordering method,
which is a linear-time, and in-place algorithm, making it easily applicable
to databases, Our experiments on benchmark data sets showed the compression ratio could be improved by a factor of 2
to 10.
Relevant
Publications:
G. Canahuate, H.
Ferhatosmanoglu, and A. Pinar, Improving bitmap index compression by
data reorganization, submitted to IEEE
Transactions on Knowledge and Data Engineering. (pdf)
A. Pinar, T. Tao, and H. Ferhatosmanoglu, Compressing
Bitmap Indices by Data Reorganization, Proc. 21st
International Conference on Data Engineering (ICDE05), pages: 310--321. (pdf)
9. Compacting
sequences to preserve transition frequencies
Simulation-based power estimation is commonly
used in VLSI design for its high accuracy despite excessive computation
times. Techniques have been
proposed to speed it up by compacting an input sequence while preserving
its power-consumption characteristics. We proposed a novel method to
compact a sequence that preserves transition frequencies. We proved the problem is
NP-complete, and proposed a graph model to reduce it to that of finding a
heaviest-weighted trail, and a heuristic utilizing this model. Experiments
showed that power dissipation can be estimated with an error of only 2.3\%,
while simulation times were reduced by 10.
Relevant
Publications:
A. Pinar and C.L. Liu, Power
Invariant Vector Sequence Compaction for Combinational Circuits, ACM
Transactions on Design Automation of Electronic Systems, Volume: 8, No: 2, pages 214 --
221, (postscript).
A. Pinar and C.L. Liu, Power Invariant Vector
Sequence Compaction, Proceedings of 1998 IEEE/ACM International Conference on Computer
Aided Design, pages: 473-476, Nov, 1998. (postscript)
|