Ali Pinar’s   Projects


 Home

Current Projects

Previous Projects

Publications

Presentations

CV

Outreach

 

1.     Load balancing for parallel computing

1.1.  Flexibly assignable tasks

1.2.  One-dimensional partitioning

1.3.  Partitioning for complex objectives

2.     Interprocesssor communication

2.1.  Communication with memory constraints

2.2.  Determining communication patterns

3.     Interconnection networks

4.     Sparse Matrix Computations

4.1.  Improving memory performance of sparse matrix vector products

4.2.  Finding nonoverlapping dense blocks

4.3.  Permuting a sparse matrix to block angular form

4.4.  The nice basis problem

5.     Vulnerability analysis of the power grid

6.     Graph algorithms

6.1.  The inhibiting bisection problem

6.2.  Identifying strongly connected components in parallel

7.     Supernova factory scheduler

8.     Indexing massive data

9.     Compressing sequences with invariant transition frequencies

1. Load balancing

Distributing loads equally among processors is a fundamental problem for parallel computing.   While load balancing has been the subject of numerous efforts, it continues to draw interest due to the need for higher concurrencies and the need for better integration with the underlying algorithms.

1.1. Exploiting flexibly-assignable work

In many parallel applications, distribution of data implies distribution of work among processors, but there are exceptions where some tasks can be assigned to one of several processors without altering the total volume of communication. Examples include force computation between two particles in different processors, or node computations in finite-element simulations when element partitions are used.  We studied how to exploit this flexibility in assignment of tasks to improve load balance.  We first modeled the problem as a flow problem and used combinatorial techniques to solve it.  Our parametric search algorithms use maximum flow algorithms for probing a candidate optimal solution value. We also defined augmenting paths and cuts for this problem, and showed that any algorithm based on augmenting paths could be used to solve the task assignment problem optimally.  Then we showed that a continuous version of this problem reduces to the well-studied linearly constrained least squares problem, which is easily amenable to parallelization. Our experiments on a molecular dynamics and overlapped Schwartz domain decomposition data sets verified the effectiveness of our methods.

Relevant Publications:

A. Pinar and B. Hendrickson, Improving Load Balance with Flexibly Assignable Tasks, IEEE Transactions on Parallel and Distributed Systems, Vol: 16, No: 10, pages: 956--965, 2005. (pdf)

A. Pinar and B. Hendrickson, Exploiting Flexibly Assignable Work to Improve Load Balance Proc. ACM 14th Symp. Parallel Algorithms and Architectures SPAA~2002, pages: 155-163, (postscript).

1.2. One-dimensional partitioning

Although asymptotically efficient algorithms have been designed for one-dimensional partitioning, practitioners still use heuristics hoping to achieve good partitions, with the misconception that exact algorithms are hard to implement and not run-time efficient.  We proposed run-time efficient algorithms along with implementation tips. Experiments showed that, on average, our algorithms are 100 times faster than a matrix-vector product for 64-way partitions.

We have also enhanced our techniques for heterogeneous systems, where processors have different processing powers. We showed that when the order of processors is fixed, the problem could be solved optimally, by enhancing our techniques for the homogenous case. The problem comes NP-complete, when we allow permutation of processors. 

Relevant Publications:

A. Pinar, E. K. Tabak, and C. Aykanat, One-dimensional partitioning for heterogeneous systems: theory and practice, submitted to Journal of Parallel and Distributed Computing.(pdf)

A. Pinar and C. Aykanat, Fast Optimal Load Balancing Algorithms for 1D Partitioning, to appear in Journal of Parallel and Distributed Computing, (pdf).

A. Pinar and C. Aykanat, Sparse Matrix Decomposition with Optimal Load Balancing, Proceedings of International Conference on High Performance Computing, pages: 224-229, Dec., 1997.(postscript)

1.3. Graph partitioning for complex objectives

Unstructured computations can often be described as undirected graphs in which vertices correspond to computations and edges reflect data dependencies. A graph partitioner can be employed to partition the vertices of the graph into disjoint sets of equal weight (for load balance) while keeping small the number of edges crossing between sets (low communication volume). For some computations however, the load is a complicated function of the partition, and the work per subdomain cannot be assessed until the partition is computed. Examples of this phenomenon include balancing computation plus communication; domain decomposition techniques where incomplete or complete factorization is used at each subdomain; and overlapped subdomain partitions where boundary vertices are duplicated in neighboring domains.  We proposed a framework for addressing such complex partitioning problems and applied these ideas to achieve equal sized overlapped subdomains and balanced computation plus communication load.

Relevant Publications:

A. Pinar and B. Hendrickson, Partitioning for Complex Objectives, In Proc. Irregular 01  (postscript)

 2. Interprocessor communication

Many applications involve periodic redistribution of workloads and associated data, which requires special communication routines. The first problem below is about determining the communication pattern after a calculation is first defined, and the second problem studies how to migrate data subject to memory constraints.

2.1. Determining communication patterns in parallel

After subdomains are assigned, processors can determine what data they need to receive, but will not know which processors have them. A simple solution is to have the processors exchange messages in a ring fashion, but this will be slow and is not scalable. An efficient solution enables a more convenient interface between different software components and reduces the burden on a repartitioning tool to update the communication information. We proposed using a distributed directory ---a rendezvous algorithm--- where each processor is responsible for maintaining directory information for a subset of the components.

Relevant Publications:

A. Pinar and B. Hendrickson, Communication Support for Adaptive Computation, in Proc. SIAM Parallel Processing 01 , (postscript).

2.2. Interprocessor communication with limited memory

While redistributing data among processors, each processor allocates space for its incoming data, posts asynchronous receives, and then sends its outgoing data.  Memory for outgoing data can be deallocated only after the send is complete. This requires each processor to have space simultaneously for both its outgoing and  incoming data, and fails otherwise. To overcome this problem, we can send data in phases.  After each phase, processors can free the memory of the data they sent, to make it available for the next phase.  We proved that scheduling messages for the minimum number of phases is NP-Complete, and showed how the problem can be phrased as an instance of multi-commodity flow.

We also studied a continuous approximation, which yields solutions with at most two more phases than that of a discrete solution. Finally, we devised a practical 1.5-approximation algorithm. Our empirical studies verified the validity of our model and effectiveness of our methods.

Relevant Publications:

A. Pinar and B. Hendrickson, Interprocessor Communication with Limited Memory, IEEE Transactions on Parallel and Distributed Systems, Vol. 15, pages: 606--616, 2004. (pdf).

A. Pinar and B. Hendrickson, Interprocessor Communication with Memory Constraints, in Proceedings of SPAA 2000 . (postscript)

 3. Interconnection network design

As we enter the era of peta-scale computing, system architects must plan for machines composed of tens or even hundreds of thousands of processors.  Although fully connected networks such as fat-tree configurations currently dominate HPC interconnect designs, such approaches are inadequate for such ultra-scale concurriencies due to the superlinear growth of component costs.  Traditional low-degree interconnect topologies, such as 3D tori, have reemerged as a competitive solution due to the linear scaling of system components relative to the node count; however, such networks are poorly suited for the requirements of many scientific applications at extreme concurrencies.  To address these limitations, we propose HFAST, a hybrid switch architecture that uses circuit switches to dynamically reconfigure lower-degree interconnects to suit the topological requirements of a given scientific application.

Relevant Publications:

S. Kamil, L. Oliker, A. Pinar, and J. Shalf, “Communication Requirements and Interconnect Optimization for High-End Scientific Applications," submitted to IEEE Transaction on Parallel and Distributed Computing. (pdf)

S. Kamil, A. Pinar, D. Gunter, M. Lijewski, L. Oliker, and J. Shalf, Reconfigurable hybrid interconnection for static and dynamic scientific applications, In Proc. 2007 ACM International Conference on Computing Frontiers. (pdf)

4. Sparse Matrix Computations

Sparse matrix computations lie at the heart of many scientific computing applications, and permutations of sparse matrices have drawn much interest as they can  significantly improve  performance or enable application of special algorithms.  

4.1. Improving performance of matrix-vector multiplication

Sparse matrix-vector multiplication often suffers from poor memory performance due to memory indirections used to exploit sparsity. We proposed alternative data structures along with reordering algorithms to improve the effectiveness of these data structures to reduce the number of memory indirections. We packed nonzeros in consecutive positions into a block to do only one memory indirection per block.

We also studied the problem of permuting nonzeros of a matrix into contiguous positions and showed it can be stated as the traveling salesperson problem. Our experiments produced improvements of up to 33\% and improvements of  21\% on average.

Relevant Publications:

A. Pinar and M. T. Heath, Improving Performance of Sparse Matrix-Vector Multiplication, Proceedings of SC’99 , Portland, November 1999. (postscript)

4.2. Finding dense blocks of a sparse matrix

Due to the low compute-to-memory ratio and irregular memory access patterns, the performance of sparse matrix kernels is often far from the peak performance on modern processors. Alternative matrix representations have been proposed, where the matrix $A$ is split into $A_d$ and $A_s$, so that $A_d$ contains all dense blocks of a specified form, and $A_s$ contains the remainder. This facilitates using dense matrix kernels on the entries of $A_d$.  We study the problem of finding a maximum number of nonoverlapping rectangular dense blocks in a sparse matrix.  We showed that the maximum nonoverlapping dense blocks problem is NP-complete by reduction from the maximum independent set problem on cubic planar graphs.  We also proposed a $2/3$-approximation algorithm for $2\times 2$ blocks that runs in linear time in the number of nonzeros in the matrix. We discussed alternatives to rectangular blocks such as diagonal blocks and cross blocks and presented complexity analysis and approximation algorithms for these cases.

Relevant Publications:

A. Pinar and V. Vassilevska, Finding Nonoverlapping Dense Blocks of a Sparse Matrix, Electronic Transactions on Numerical Analysis, Special issue on combinatorial scientific computing, Vol: 21, pages: 107--124, 2005. (pdf)

4.3. Permuting a sparse matrix to block-angular form

Despite numerous algorithms designed for solving block-angular linear programs, the problem of permuting a matrix to block-angular form is relatively uninvestigated. We proposed hypergraph models to represent rectangular sparse matrices, which reduce the problem to that of hypergraph partitioning. We also devised graph models and associated techniques to adapt graph partitioning techniques for this problem.

Relevant Publications:

C. Aykanat, A. Pinar, and U. Catalyurek, Permuting Sparse Rectangular Matrices into Block-Diagonal Form, SIAM Journal on Scientific Computing, Vol. 25, No. 6, pages: 1860--1879, (pdf).

A. Pinar and C. Aykanat, An Effective Model to Decompose Linear Programs for Parallel Solution, Lecture Notes in Computer Science Vol: 1184, pages: 592-601. (postscript)

A. Pinar, U. V. Catalyurek, C. Aykanat, M. C. Pinar, Decomposing Linear Programs for Parallel Solution, Lecture Notes in Computer Science, Vol: 1041, pages: 473-482, 1996. (postscript)

4.4. Nice basis problem

We proposed a new combinatorial approach towards constructing a sparse, implicit basis for the null space of a sparse, under-determined matrix. Our approach was to compute a column space basis of that has a sparse inverse that could be used to represent a null space basis in implicit form. We investigated three algorithms for computing column space bases: two greedy algorithms based on graph matchings, and a third that employs a divide and conquer strategy implemented with hypergraph partitioning followed by a matching. Our results showed that for many matrices from linear programming, structural analysis, and circuit simulation, it was possible to compute column space bases having sparse inverses, contrary to conventional wisdom. The hypergraph partitioning method yields sparser basis inverses and computationally more efficient, relative to the greedy approaches. We also discussed the complexity of selecting a column space basis in block diagonal form with a given block size.

Relevant Publications:

A. Pinar, E. Chow, and A. Pothen, Combinatorial Techniques for Constructing Sparse Null-space Bases,” Electronic Transactions on Numerical Analysis, special volume on saddle point problems: numerical solution and applications, Vol. 22, pages: 122--145, 2006. (pdf)

5. Vulnerability analysis of the electric power grid

Robust operation of a power system requires the system to stay intact after a small number of broken lines, and thus calls for algorithms to detect small groups of lines that can cause a blackout if they fail collectively. The flow of power can be modeled with a system of nonlinear equations, and a blackout corresponds to the infeasibility of these equations. Thus the vulnerability detection problem for the power grid can be studied as a mixed integer nonlinear optimization problem, which is a hard problem to solve by itself.  We have reduced this problem to a pure combinatorial problem, known as the network inhibition problem, by analyzing the structure of an optimal solution to the mixed integer nonlinear optimization problem. Our key observation was the relation between the feasibility surface of the power flow equations and spectral graph theory.  Instead of solving the NP-complete network inhibition problem, we have introduced and studied the inhibiting bisection problem, where we look for a bisection of the graph with a minimum number of edges between the two parts, and a maximum   production/consumption mismatch within each part.

Relevant Publications:

A. Pinar, J. Meza, V. Donde, and B. Lesieutre, “Optimization Strategies for the Vulnerability Analysis of the Power Grid, submitted to SIAM Journal on Optimization (pdf) 

A. Pinar, A. Reichert, and B. Lesieutre, Computing Criticality of Lines in a Power System, in Proc. 2007 IEEE International Symposium on Circuits and Systems, New Orleans, LA, May 2007. (pdf)

V. Donde, V. Lopez, B. Lesieutre, A. Pinar, C. Yang, and J. Meza, Identification of severe multiple contingencies in electric power networks, to appear in IEEE Transactions on Power Systems. (pdf)

B. Lesieutre, A. Pinar, and S. Roy, Power System Extreme Event Detection: The Vulnerability Frontier, Proc. Hawaii International Conference on System Sciences, 2008. (pdf)

A. Pinar, Y. Fogel, and B. Lesieutre, The Inhibiting Bisection Problem, submitted to ACM 19th Symposium Parallel Algorithms and Architectures (SPAA) 2007 (pdf)

B. Lesieutre, S. Roy, V. Donde, and A. Pinar, Power system extreme event analysis using graph partitioning, Proc. of the North American Power Symposium, Carbondale, IL, October 2006. (pdf)

V. Donde, V. Lopez, B. Lesieutre, A. Pinar, C. Yang, and J. Meza, Identification of severe multiple contingencies in electric power networks, Proc. the North American Power Symposium}, Ames, IA, October 2005.(pdf)

6. Graph Algorithms

6.1. Inhibiting bisection problem

Relevant Publications:

A. Pinar, Y. Fogel, and B. Lesieutre, The Inhibiting Bisection Problem, submitted to ACM 19th Symposium Parallel Algorithms and Architectures (SPAA) 2007 (pdf)

6.2. Identifying strongly connected components in parallel

The standard serial algorithm for finding strongly connected components is based on depth-first search, which is difficult to parallelize. We designed a divide-and-conquer algorithm using reachability information, which is amenable to parallelization. The expected run time for our algorithm is $O(|E|\lg |V|)$. Our work was motivated by the discrete ordinates method for modeling radiation transport. In this method, the object to be studied is modeled by finite elements. Each element is a vertex in our graph and an edge connects elements that share a face. The radiation equations are approximated by an angular discretization, and edges in the graph are directed to align with the angle.  The computations for an element can be performed only after those of all its predecessors. The numerical calculations must be modified if the graph has a cycle. Our algorithm detects these cycles to enable the computation to be modified.

Relevant Publications:

L. K. Fleischer, B. Hendrickson and A. Pinar, On Identifying Strongly Connected Components in Parallel, to Proc. of Irregular’2000 , Lecture Notes in Computer Science, Vol. 1586, pages 505--511, (postscript).

7. Supernova factory scheduler

Observing Type Ia supernovae can help us answer some fundamental questions about the nature of the universe. Type Ia supernovae are extraordinarily bright, remarkably uniform objects, which make excellent "candles" for measuring the expansion rate of the universe, and they remain bright enough only for a few weeks. For a complete experiment, each supernova should be observed several times at regular intervals during its lifetime, and the value of an experiment degrades significantly when the distance between subsequent observations is too long.  While we would like to collect as much data as possible, we have to do this with limited resources, i.e., a single telescope. This calls for carefully selecting a subset of the supernovae that can be observed regularly, and constructing a schedule for most effective utilization of the telescope, which is a combinatorial problem in nature, and nonlinearity provides accuracy and efficiency in formulation.

8. Compressing bitmap indices

Bitmap indexing, which is commonly used for  scientific databases, rely on variants of runlength encoding for a compact index structure.  Runlength encoding exploits uniform segments of data for compaction, and thus their performance  is directly affected by availability of such segments, which depend on the organization of the data.  We have proposed the tuple  reordering problem to reorganize database tuples for best run-length encoding performance.  The problem can be reduced to the  TSP problem, but the sizes of the databases make using standard TSP methods prohibitively expensive.  We proposed  the Gray-code ordering method, which is a linear-time, and in-place algorithm, making it easily applicable to databases, Our experiments on benchmark data sets showed the  compression ratio  could be improved by a factor of 2 to 10.

Relevant Publications:

G. Canahuate, H. Ferhatosmanoglu, and A. Pinar, Improving bitmap index compression by data reorganization, submitted to IEEE Transactions on Knowledge and Data Engineering. (pdf)

A. Pinar, T. Tao, and H. Ferhatosmanoglu, Compressing Bitmap Indices by Data Reorganization, Proc. 21st International Conference on Data Engineering (ICDE05), pages: 310--321. (pdf)

9. Compacting sequences to preserve transition frequencies

Simulation-based power estimation is commonly used in VLSI design for its high accuracy despite excessive computation times.  Techniques have been proposed to speed it up by compacting an input sequence while preserving its power-consumption characteristics. We proposed a novel method to compact a sequence that preserves transition frequencies.  We proved the problem is NP-complete, and proposed a graph model to reduce it to that of finding a heaviest-weighted trail, and a heuristic utilizing this model. Experiments showed that power dissipation can be estimated with an error of only 2.3\%, while simulation times were reduced by 10.

Relevant Publications:

A. Pinar and C.L. Liu, Power Invariant Vector Sequence Compaction for Combinational Circuits, ACM Transactions on Design Automation of Electronic Systems, Volume: 8, No: 2, pages 214 -- 221, (postscript). 

 A. Pinar and C.L. Liu, Power Invariant Vector Sequence Compaction, Proceedings of 1998 IEEE/ACM International Conference on Computer Aided Design, pages: 473-476, Nov, 1998. (postscript)