-----------------------------------------
Release Notes for Trilinos Package Kokkos
-----------------------------------------

Current (11.3)
--------------
* Non-backwards compatible change: Default Kokkos/Tpetra Node type is now Kokkos::SerialNode
  User expectation seems to be that the default behavior of Tpetra is
  MPI-only. These users are therefore experiencing unexpected performance when
  the default node is threaded, as is currently the case if any of the
  threading libraries (pthreads, TBB, OpenMP) are enabled. Therefore, after
  some discussion among Kokkos/Tpetra developers, it was decided to change the 
  default Kokkos node (and therefore, the default node used by Tpetra
  objects) to Kokkos::SerialNode. This can be over-ridden at configure time by
  specifying the following option to CMake when configuring Trilinos: 
    -D KokkosClassic_DefaultNode:STRING="node_type"
  where node_type is one of the official Kokkos nodes:
    Kokkos::SerialNode    (current default)
    Kokkos::TBBNode
    Kokkos::TPINode  
    Kokkos::OpenMPNode


Trilinos 11.0
------------------------------
* Non-backwards compatible change: row pointers for CRS objects are not longer size_t; instead, they are the same Ordinal type as the columns indices
* Non-backwards compatible change: construction of Kokkos local graph objects requires specifying number of columns


Trilinos 10.12
------------------------------

* Major (backwards-compatible, internal) refactor of local sparse operators, esp. DefaultSparseOps
* Using generic sparse kernels only on host node; CUDA nodes use CUSPARSE. Adaptors for Cusp are work-in-progress.
* Added first-touch-allocation support for sparse matrices; should improve performance on NUMA nodes.
* Still lots of work in progress on sparse matrix support, expected for Trilinos 11.0


Trilinos 10.10
------------------------------

* Added OpenMP node (written by Radu Popescu)

Trilinos 10.8:
--------------

* New multidimensional array sub-package in kokkos/array subdirectory.
  The MDArrayView<Type,Device>, MultiVectorView<Type,Device>,
  and ValueView<Type,Device> template classes manage allocated
  memory on the 'Device' and perform compile-time selection of a
  device-optimal mapping of array's multi-index space to data members.
  Includes unit tests, performance tests, and example mini-applications.



Trilinos 10.6.4:
----------------

* Fixed some bugs in the build system
* Updates to support CUDA 4.0 and built-in Thrust

Trilinos 10.6.1:
----------------

* Fixed some ansi/pednatic build warnings in the node tests.
* Added sync() method to all nodes, modified node timings test to use it, documented it in Kokkos Node API module. Added cudaSyncThreads() to native CUDA/Thrust timings. Broke node timings into multiple exectuables, to keep TBB and TPI from stepping on each other.
* Added native threading examples to Kokkos Node API tests for timing comparisons, kokkos/NodeAPI/test.
* Added ifdefed TPI_Block() and TPI_Unblock() around GEMM calls in Kokkos::DefaultArithmetic<TPINode>
* Added missing clearStatistics() method to CUDANodeMemoryModel

Trilinos 10.6:
--------------

Significant internal/external changes to both the Kokkos Node API and the
Kokkos Linear Algebra library. Most of these changes are centered around 
CrsGraph and CrsMatrix and their kernels. Some exciting developments regarding 
sparse mat-vec on multi-core/GPUs did not make it in this release; look for more
development in 10.6.1.

* Lots of additional documentation, testing and examples in Kokkos.
* Improved debugging in Kokkos Node API
* Added isHostNode static bool to all Kokkos nodes (false for ThrustGPUNode)
* Imported select Teuchos memory management classes/methods into the Kokkos namespace.
* Minor bug fixes, warnings addressed.

Changes breaking backwards compatbility:
* Kokkos CRS classes (i.e., CrsGraph and CrsMatrix) are now templated on the 
  sparse kernel operator, allowing specialization of the class data 
  according to the implementation of the kernel.
* Kokkos CRS classes now contain host-allocated memory, instead of node-allocated
  memory. This means that use of these buffers by the node will, in general, require
  a copy. 


Trilinos 10.4:
--------------

(*) Change to sparse ops
- Combined DefaultSparseSolve and DefaultSparseMultiply into DefaultSparseOps
- Added new traits class, DefaultKernels, to specify default sparse and default block sparse objects
- Added new classes supporting Tpetra::VbrMatrix

(*) MultiVector ops
- some methods were missing "inline" specifier. This could (should?) improve performance for TPINode.

(*) CUDANodeMemoryModel
- support for tracing/profiling data movement between host and GPU

