-----------------------------------------
Release Notes for Trilinos Package Tpetra
-----------------------------------------

Trilinos 11.4:
--------------

* Performance improvements to fillComplete (CrsGraph and CrsMatrix)

* Performance improvements to Map's global-to-local index conversions

* MPI performance optimizations

Methods that perform communication between (MPI) processes do less
communication than before.  This should improve performance,
especially for large process counts, of the following operations:

  - Creating a Map
  - Creating an Import or Export communication plan
  - Executing an Import or Export (e.g., in a distributed sparse
    matrix-vector multiply, or in global finite element assembly)
  - Calling fillComplete() on a CrsGraph or CrsMatrix

* Restrict a Map's communicator to processes with nonzero elements,
  and apply the result to a distributed object

Map now has two new methods.  The first, removeEmptyProcesses(),
returns a new Map with a new communicator, which contains only those
processes which have a nonzero number of entries in the original Map.
The second method, replaceCommWithSubset(), returns a new Map whose
communicator is an arbitrary subset of processes of the original Map's
communicator.  Distributed objects (subclasses of DistObject) also
have a new removeEmptyProcessesInPlace() method, for applying in place
the new Map created by calling removeEmptyProcesses() on the original
Map over which the object was distributed.

These methods are especially useful for algebraic multigrid.  At
coarser levels of the multigrid hierarchy, it is helpful for
performance to "rebalance" the matrices at those levels, so that a
subset of processes share the elements.  This leaves the remaining
processes without any elements.  Excluding them from the communicator
reduces the cost of all-reduces and other communication operations
necessary for creating the coarser levels of the hierarchy.

* CrsMatrix: Native SOR and Gauss-Seidel kernels

These kernels improve the performance of Ifpack2 and MueLu.
Gauss-Seidel is a special case of SOR (Symmetric Over-Relaxation).
See the documentation of Ifpack2::Relaxation for details on the
algorithm, which is actually a "hybrid" of Jacobi between MPI
processes, and SOR (or Gauss-Seidel) within an MPI process.  The
kernels also include the "symmetric" variant (forward and backward
sweeps) of SOR and Gauss-Seidel.

* CrsMatrix: Precompute and reuse offsets of diagonal entries

The (existing) one-argument verison of CrsMatrix's getLocalDiagCopy()
method requires the following operations per row:

  1. Convert current local row index to global, using the row Map
  2. Convert global index to local column index, using the column Map
  3. Search the row for that local column index
    
Precomputing the offsets of diagonal entries and reusing them skips
all these steps.  CrsMatrix has a new method getLocalDiagOffsets() to
precompute the offsets, and a two-argument version of
getLocalDiagCopy() that uses the precomputed offsets.  The precomputed
offsets are not meant to be used in any way other than to be given to
the two-argument version of getLocalDiagCopy().  They must be
recomputed whenever the structure of the sparse matrix changes (by
calling insertGlobalValues() or insertLocalValues()) or is optimized
(e.g., by calling fillComplete() for the first time).

* CrsGraph,CrsMatrix: Added "No Nonlocal Changes" parameter to
  fillComplete()

The fillComplete() method accepts an optional ParameterList which
controls the behavior of fillComplete(), as opposed to behavior of the
object in general.  "No Nonlocal Changes" is a bool parameter which is
false by default.  Its value must be the same on all processes in the
graph or matrix's communicator.  If the parameter is true, the caller
asserts that no entries were inserted in nonowned rows.  This lets
fillComplete() skip the global communication that checks whether any
processes inserted any entries in nonowned rows.

* Default Kokkos/Tpetra Node type is now Kokkos::SerialNode

NOTE: This change breaks backwards compatibility.

Users expect that Tpetra by default uses "MPI only" for parallelism,
rather than "MPI plus threads."  These users were therefore
experiencing unexpected performance issues when the default Kokkos
Node type is threaded, as was the case if Trilinos' support for any of
the threading libraries (Pthreads, TBB, OpenMP) are enabled.  Trilinos
detects and enables support for Pthreads automatically on many
platforms.  Therefore, after some discussion among Kokkos and Tpetra
developers, we decided to change the default Kokkos Node type (and
therefore, the default Node used by Tpetra objects) to
Kokkos::SerialNode. This can be overridden at configure time by
specifying the following option to CMake when configuring Trilinos:

-D KokkosClassic_DefaultNode:STRING="<node-type>" 

where <node-type> any of the official Kokkos Node types, such as the
following:
- Kokkos::SerialNode (current default) 
- Kokkos::TBBNode
- Kokkos::TPINode
- Kokkos::OpenMPNode


Trilinos 11.0: 
--------------
* Significant performance improvements to local sparse matrix-vector multiply on CPU nodes. 
* Removed all deprecated methods.


Trilinos 10.12:
--------------

* Major (backwards-compatible, internal) refactor to interaction between Tpetra::CrsGraph/CrsMatrix and their interaction 
  with their LocalSparseOps template parameter. 
* Removed generic kernels for GPU nodes; GPU sparse kernel support now provided by CUSPARSE library; requires CUDA 4.1
* Additional methods in Reduction/Transformation Interface (RTI) interface, examples in tpetra/examples/MultiPrec
* Fixed major bugs in Tpetra Import/Export
* Minor bug fixes and documenting tests
* Numerous improvements to documentation
* Better MatrixMarket support in tpetra/util
* Added the ability to construct a Tpetra::Vector/MultiVector using user data (host-based nodes only)
- Deprecated: fillComplete(OptimizeStorageOption) on Tpetra::CrsGraph and Tpetra::CrsMatrix, in favor of a ParameterList.


Trilinos 10.7:
--------------

* Added (experimental) Reduction/Transformation Interface (RTI) interface to tpetra/rti, examples in tpetra/examples/RTInterface

Trilinos 10.6.4:
----------------

* Fixed some bugs in the build system
* Updates to support CUDA 4.0 and built-in Thrust

Trilinos 10.6.1:
----------------

* Added new HybridPlatform examples, under tpetra/examples/HybridPlatform. Anasazi and Belos examples are currently not built, though they are functional.
* Added Added new MultiVector GEMM tests, to evaluate potential interference of TPI/TBB threads and a threaded BLAS, to tpetra/test/MultiVector.
* Added Tpetra timers to Anasazi and Belos adaptors.
* Added test/documentation build of Tpetra::CrsMatrix against KokkosExamples::DummySpasreKernelClass
* Fixed some bugs, added some bug verification tests, disabled by default.

Trilinos 10.6:
--------------

Significant internal changes in Tpetra for this release, mostly centered around
the CrsMatrix class. Lots of new features centering around multi-core/GPUs did
not make it in this release; look for more development in 10.6.1.

* Lots of additional documentation, testing and examples in Tpetra.
* Imported select Teuchos memory management classes/methods into the Tpetra namespace.
* Updates to the Anasazi/Tpetra adaptors for efficiency, node-awareness and debugging.
* Minor bug fixes, warnings addressed.

Changes breaking backwards compatbility:
* Tpetra CRS objects (i.e., CrsGraph and CrsMatrix) are required to be "fill-active" in order to be modified.
  Furthermore, they are requried to be "fill-complete" in order to call multiply/solve.
  The transition between these states is mediated by the methods fillComplete() and resumeFill(). 
  This will only effect users that modify a matrix after calling fillComplete().

Newly deprecated functionality:
* CrsGraph/CrsMatrix persisting views of graph and matrix data are now
  deprecated. New, non-persisting versions of these are provided.


Trilinos 10.4:
--------------

The Trilinos release 10.4 came at an unfortunate time, as we were in the middle
of a medium refactor in Kokkos/Tpetra in order to better support GPU and
multicore nodes. Therefore, there has been some potential regression in performance
for GPU nodes; and some known issues regarding multi-core CPU performance (especially 
on NUMA platforms) have not been addressed. The rest of this refactor is likely to happen 
in the development branch, and will not be released until 10.6 (estimated for September 2010). 

Users that require access to this code should contact a Trilinos developer regarding access to the 
development branch repository. 

(*) Improvements to doxygen documentation.
- added ifdefs to support profiling/tracing of host-to-device memory transfers 
  These are enable via cmake options
  -D KokkosClassic_ENABLE_CUDA_NODE_MEMORY_PROFILING:BOOL=ON
  -D KokkosClassic_ENABLE_CUDA_NODE_MEMORY_TRACE:BOOL=ON

(*) VBR capability (experimental)
- added variable-block row matrix (VbrMatrix) and underlying support classes (BlockMap, BlockMultiVector)
- added power method example of VBR classes

(*) CrsMatrix:
- now implements DisbObject, allowing import/export capability with CrsMatrix objects (experimental)
- combined LocalMatVec and LocalMatSolve objects into a single template parameter. (non BC)
  this required changes to CrsMatrixMultplyOp and CrsMatrixSolveOp operators as well. (non BC)
- access default for this type via Kokkos::DefaultKernels
- removed cached views of object data. this should have no effect on CPU-based nodes, but will result in slower performance
  for GPU-based nodes. this regression is a result of the release happening mid-refactor. it will not be addressed in the 
  10.4.x sequence.
- bug fixes regarding complex cases involving user-specified column maps and graphs.

(*) DistObject interface:
- added createViews(), releaseViews() methods to allow host-based objects to temporarily cache views of host data during import/export procedure

(*) Map: 
- added new non-member constructors: createContigMap(), createWeightedContigMap(), createUniformContigMap()
- fixed some bugs regarding use of unsigned Ordinal types
- fixed MPI-stalling bug in getRemoteIndexList()

(*) MultiVector:
- added view methods offsetView() and offsetViewNonConst() to create a MultiVector view of a subset of rows
- added non-member constructor Tpetra::createMultiVector(map,numVecs)

(*) Vector:
- added non-member constructor Tpetra::createVector(map)

(*) Tpetra I/O:
- added Galeri-type methods for generating pedagogical matrices (currently, only 3D Laplacian)

(*) External adaptors (experiemental)
- Efficiency improvements for Belos/Tpetra adaptors
- Brought Anasazi/Tpetra adaptors back online
