# Program

Fri, Oct 23 | Talks in 310 Banatao Auditorium, Sutardja Dai Hall | |

1:00-1:30 | Grey Ballard, Sandia | [slides] |

1:30-2:00 | Austin Benson, Stanford | |

2:00-2:30 | Erin Carson, NYU | |

2:30-3:00 | break | |

3:00-3:30 | Greg Henry, Intel | |

3:30-4:00 | Jeff Bilmes, UW | |

4:00-4:20 | David Garmire, U Hawaii | |

4:20-4:40 | Lana Garmire, U Hawaii | |

4:40-5:10 | break | |

5:10-5:40 | Tzu-Yi Chen, Pomona | [slides] |

5:40-6:10 | Ken Stanley, Du Bois Project of Oberlin | [slides] |

6:30-8:30 | Reception (1015 Evans Hall) | |

Sat, Oct 24 | Talks in 306 Soda Hall | |

9:00-9:30 | Jiawang Nie, UCSD | |

9:30-10:00 | Edgar Solominik, ETH Zurich | [slides] |

10:00-10:30 | Michael Overton, NYU | |

10:30-11:00 | break | |

11:00-11:30 | Ren-Cang Li, UT Arlington | [slides] |

11:30-12:00 | Cho-Jui Hsieh, UC Davis | |

12:00-2:00 | lunch | |

2:00-2:30 | Laura Grigori, INRIA | |

2:30-3:00 | E. Jason Riedy, GA Tech | [slides] |

3:00-3:30 | Richard Vuduc, GA Tech | |

3:30-4:00 | break | |

4:00-4:30 | Alan Edelman, MIT | |

4:30-5:00 | Joel Tropp, Caltech | [slides] |

6:00-9:00 |
Banquet (Faculty Club) Horst Simon, LBNL |
[slides] |

## Speakers

- Grey Ballard, Sandia
- Austin Benson, Stanford
- Jeff Bilmes, UW
- Erin Carson, NYU
- Tzu-Yi Chen, Pomona
- Alan Edelman, MIT
- David Garmire, U Hawaii
- Lana Garmire, U Hawaii
- Laura Grigori, INRIA
- Greg Henry, Intel
- Cho-Jui Hsieh, UC Davis
- Ren-Cang Li, UT Arlington
- Jiawang Nie, UCSD
- Michael Overton, NYU
- E. Jason Riedy, GA Tech
- Horst Simon, LBNL
- Edgar Solominik, ETH Zurich
- Ken Stanley, Du Bois Project of Oberlin
- Joel Tropp, Caltech
- Richard Vuduc, GA Tech

## Abstracts

### Parallel Tensor Compression for Large-Scale Scientific Data

#### Grey Ballard, Sandia

#### [slides]

As high-performance computing trends towards the exascale, scientific data produced by high fidelity simulations in fields such as combustion science are becoming massive, with even modest simulations producing terabytes of data. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8TB of data. By viewing the data as a dense five-way tensor, we can compute a Tucker tensor approximation to find inherent low-dimensional multilinear structure, achieving large compression ratios with negligible loss in accuracy. I will present a distributed-memory parallel algorithm and implementation for approximating general dense tensors using a Tucker decomposition. The kernel computations of the algorithm correspond to linear algebra operations on various views of the data. I’ll show results for compression ratios and approximation errors on combustion simulation data, and I’ll discuss run time results, including the key performance-tuning parameters within the algorithm.

### Network clustering with higher-order structures

#### Austin Benson, Stanford

Networks are a common tool in the physical, biological, and social sciences to describe entities (nodes) and the relationships (links) between them. Two of the fundamental analyses of networks are clustering (partitioning, community detection) and the frequency of network motifs, or patterns of links between nodes. However, these analyses are disjoint as community structure is typically defined by link relationships and ignore motifs. Here, we develop spectral algorithms to find network community structure defined by motifs. We show several examples of motif-based communities in networks from a variety of scientific domains.

### A PHiPAC Retrospective

#### Jeff Bilmes, UW

PHiPAC was an early attempt, in the context of matrix multiplication, to improve software performance by searching in a large design space of possible implementations to find the best one. In this talk, we will discuss the origins of PHiPAC, including some of the decision decisions, steps, and missteps that, for better or for worse, were made during the early stages of the project.

Joint work with Krste Asanovic, Chee-Whye Chin, and Jim Demmel

### Communication-Avoiding Krylov Methods in Theory and Practice

#### Erin Carson, NYU

Solvers for sparse linear algebra problems, ubiquitous throughout scientific codes, often limit application performance due to a low computation/communication ratio. In this talk, we discuss the development of communication-avoiding Krylov subspace methods, which can achieve performance improvements in a wide variety of scientific applications. We discuss complex tradeoffs that arise due to both machine parameters and the numerical properties of the problem, and show how an improved understanding of convergence rate and attainable accuracy in finite precision can be used to develop methods that both avoid communication and meet application-specific numerical constraints. We conclude with recommendations on when we expect these communication-avoiding variants to be beneficial in practice.

### On communities: algorithms, experiments, and experiences

#### Tzu-Yi Chen, Pomona

#### [slides]

### Julia, High Performance Computing, and Linear Algebra

#### Alan Edelman, MIT

Jim Demmel inspires us all by showing that one can live a life working simultaneously on high performance computing and serious mathematics. Jim never seemed to care about the differences between computer science, computational science, and mathematics. One of the key lessons that I learned from Jim is that if you are going to work on software, it really should be widely used, and if it is widely used, it should be of high quality.

The Julia computing language takes what we believe to be some of the best practices in computer science and computational science to bring together a fresh approach to technical computing. Julia solves various versions of the two language problem in that there is no need to translate Julia into another language for speed or deployment, and since Julia is written primarily in Julia, it is more than only open source, it is openly transparent. This makes Julia popular for scientists and regulators.

Julia’s use is doubling every nine months (“Moore’s law” style growth) by any number of statistics, including the use of juliabox.org (try it right now!) in classrooms around the world. In this talk, we will describe Julia, consider open problems in high performance computing, and describe the benefits Julia brings to numerical linear algebra.

This is joint work with the MIT Julia Group, some now at Julia Computing Inc., and Julia developers worldwide.

### Application of Linear Algebra to Making

#### David Garmire, U Hawaii

Making has seen an explosion of interest in the public. The growth since 2010 has been 20-30% per year likely reaching an O($10 billion) by 2017. This talk investigates where matrix linear algebra fits into Making and how it could accelerate the growth of the industry. In turn, it also investigates how Making may help accelerate our understanding of matrices at an early age.

### Integration of Big Data in Cancer Research

#### Lana Garmire, U Hawaii

With the rapid applications of high throughput technologies such as next generation sequencing as well as the huge amount of publicly available high throughput data, the translational cancer informatics field is granted with bountiful opportunities and yet faced with big challenges at the same time. It is particularly important to address the issues of detecting multiple types of biomarkers arising from different high-throughput platforms. In this talk, I will elaborate my research projects that aim to integrate the transcriptomic (mRNA, microRNA and lincRNA) profiles for cancer diagnosis and prognosis prediction and explore where matrix linear algebra is used in these pipelines.

### Communication avoiding algorithms: from early ideas to recent progress

#### Laura Grigori, INRIA

One of the major challenges we face in high performance computing is the increased communication cost with respect to computation cost. In this talk we will review work that we have started in 2008, on linear algebra algorithms that minimize communication, in which the communication problem is addressed directly at the numerical algorithmic design. These communication avoiding algorithms supersede previous approaches to the communication problem based either on scheduling or tuning solutions.

We will review some of the early communication avoiding algorithms for dense linear algebra as LU or QR factorizations and discuss some recent progress as computing low rank approximations for sparse matrices.

### Software Design for Small Matrices

#### Greg Henry, Intel

Small matrix computations run fastest when they have hand-tuned algorithms just for an application’s specific sizes. While “border” computations never matter for larger problems, for smaller problems they are critical but also far too numerous to include in a library every possible small case. Since there are too many small cases to cover exhaustively, many libraries compromise performance to create a smaller and more manageable foot-print. Sometimes the resulting “compromised” performance is between 0.1 and 0.5 times what’s possible. In this work, we present a high performance implementation of small matrix BLAS for the latest Intel architectures. We show a new way of looking at math libraries and software design to dynamically achieve top performance without this unmanageable foot-print problem, and show how other statically-built solutions (such as exist in Intel® MKL, ATLAS, BLIS, OpenBLAS) are missing performance opportunities.

### Parallel Asynchronous Stochastic Coordinate Descent with Auxiliary Variables

#### Cho-Jui Hsieh, UC Davis

Stochastic Coordinate Descent (SCD) is now widely used in many machine learning applications. Interestingly, in the most popular SCD algorithms and implementations, a set of auxiliary variables has to be maintained to facilitate efficient computation of each update. In this paper, we study parallel asynchronous SCD for a general family of functions where coordinate descent can be accelerated by maintaining a set of auxiliary variables. Each thread repeatedly selects a random variable and performs efficient updates on both the target and auxiliary variables that are stored in shared memory. Our result provides a theoretical guarantee for many existing parallel coordinate descent solvers. Experimental results show that our methods are much faster than previous parallel coordinate descent solvers.

### The Big Bang Theory in High Accuracy Computations

#### Ren-Cang Li, UT Arlington

#### [slides]

Ideally, each computed number commits an error of no more than half unit in its last place (ulp). But in reality, that’s not going to happen. The next best thing is for the error to within a few ulps. That did not seem likely, either, until the big bang:

J. Demmel and W. Kahan. Accurate singular values of bidiagonal
matrices. *SIAM J. Sci. Statist. Comput.*, 11:873–912, 1990.

This talk will take a deep look at the big bang and its aftermath.

### Real eigenvalues of nonsymmetric tensors

#### Jiawang Nie, UCSD

We discuss the computation of real Z-eigenvalues and H-eigenvalues of nonsymmetric tensors. A general nonsymmetric tensor has finitely many Z-eigenvalues, while there may be infinitely many ones for special tensors. In the contrast, every nonsymmetric tensor has finitely many H-eigenvalues. We propose Lasserre type semidefinite relaxation methods for computing such eigenvalues. For every nonsymmetric tensor that has finitely many real Z-eigenvalues, we can compute all of them; each of them can be computed by solving a finite sequence of semidefinite relaxations. For every nonsymmetric tensor, we can compute all its real H-eigenvalues; each of them can be computed by solving a finite sequence of semidefinite relaxations. This is a joint work with Xinzhen Zhang.

### Investigation of Crouzeix's Conjecture via Nonsmooth Optimization

#### Michael Overton, NYU

M. Crouzeix’s 2004 conjecture concerns the relationship between $|p|_{W(A)}$, the norm of a polynomial $p$ on $W(A)$, the field of values of a matrix $A$, and $|p(A)|_2$, the operator norm of the matrix $p(A)$. We use nonsmooth optimization to investigate the conjecture numerically, using the BFGS (Broyden-Fletcher-Goldfarb-Shanno) method to search for local minimizers of the ``Crouzeix ratio” $|p|_{W(A)}/|p(A)|_2$ and Chebfun to compute the boundary of the field of values. The conjecture states that the globally minimal value of the Crouzeix ratio is 1/2. We present numerical results that lead to some theorems and further conjectures about globally and locally minimal values of the Crouzeix ratio when varying only $A$ (of given order, with $p$ fixed) or varying only $p$ (of given degree, with $A$ fixed), as well as locally minimal values of the ratio when minimizing over $p$ and $A$. All the computations strongly support the truth of Crouzeix’s conjecture.

This is joint work with Anne Greenbaum and Adrian Lewis.

### Graph Analysis Beyond Linear Algebra

#### E. Jason Riedy, GA Tech

#### [slides]

High-performance graph analysis is unlocking knowledge in computer security, bioinformatics, social networks, and many other data integration areas. Graphs provide a convenient abstraction for many data problems beyond linear algebra. Some problems map directly to linear algebra. Others, like community detection, look eerily similar to sparse linear algebra techniques. And then there are algorithms that strongly resist attempts at making them look like linear algebra. This talk will cover recent results with an emphasis on streaming graph problems where the graph changes and results need updated with minimal latency. We’ll also touch on issues of sensitivity and reliability where graph analysis needs to learn from numerical analysis and linear algebra.

### Six Books

#### Horst Simon, LBNL

#### [slides]

### Communication Lower Bounds for Numerical Tensor Algebra

#### Edgar Solominik, ETH Zurich

#### [slides]

To make way through a dark passage, one first feels for the walls. To find efficient parallel algorithms, one first derives communication lower bounds. The cost tradeoffs and optimality gaps revealed by lower bounds paint the contours of the algorithmic search space. I will present new lower bounds on communication cost for symmetric tensor contractions and on synchronization cost for dense-matrix factorizations and stencil computations. I will also discuss how both of these two lower bound analyses lead to the discovery of improved algorithms.

### Making math fun for math-phobic children

#### Ken Stanley, Du Bois Project of Oberlin

#### [slides]

The United States lags many developing countries in educational performance, particularly in our education of children of underrepresented minorities and children from families of lower socio-economic status. Local funding for schools is part of the problem as students from low income families disproportionately attend poorly funded schools, however, local funding does not explain why these children perform poorly in all schools. In Oberlin, Ohio children from all backgrounds attend the same K-12 schools, yet the achievement gap persists. Motivation, especially intrinsic motivation, drives educational achievement. The Du Bois Project helps underrepresented minorities and children from low socioeconomic families achieve and maintain excellence in the Oberlin Public Schools by helping them learn to enjoy math. We will explain how we use running, music, simple psychology and motivated Oberlin College students to make math fun.

### Finding structure with randomness

#### Joel Tropp, Caltech

#### [slides]

Computer scientists have long known that randomness can be used to improve the performance of algorithms. A familiar application is the process of dimension reduction, in which a random map transports data from a high-dimensional space to a lower-dimensional space while approximately preserving some geometric properties. By operating with the compact representation of the data, it is possible to produce approximate solutions to certain large problems very efficiently.

It has been observed that dimension reduction has powerful applications in numerical linear algebra and numerical analysis. This talk will discuss randomized techniques for constructing standard matrix factorizations, such as the truncated singular value decomposition. In practice, the algorithms are so effective that they compete with—or even outperform—classical algorithms. These methods are already making a significant impact in large-scale scientific computing and learning systems.

Joint work with P.-G. Martinsson and N. Halko.

### Algorithms in the Time of Energy and Power †

#### Richard Vuduc, GA Tech

This talk considers the following question: Do we need *new*
principles of algorithm design, if we wish to minimize energy
(Joules) or reduce power (Watts), instead of–or in addition
to–time and space? Bring your votes for a straw poll that I will
conduct before I offer you my answer. I will also reflect on how I
arrived at the question (and answer) “under the influence” of Jim
Demmel and a few of his other rabble-rousing associates.

† With profuse apology to Gabriel Garcia Marquez.