16 special function units for single-precision floating-point
transcendental functions
4 warp schedulers.
An SM has:
A unified data cache and shared memory with a total size of 128
KB
Shared memory is partitioned out of the unified data cache, and can
be configured to various sizes. The remaining data cache serves as an L1
cache and is also used by the texture unit that implements the various
addressing and data filtering modes mentioned in Texture and Surface
Memory.
Global Memory:
Global memory accesses are always cached in L2.
Data that is read-only for the entire lifetime of the kernel can
also be cached in the unified L1/texture cache.
Data that is not read-only for the entire lifetime of the kernel
cannot be cached in the unified L1/texture cache.
Shared Memory:
The amount of the unified data cache reserved for shared memory
is configurable on a per kernel basis. The unified data cache has a size
of 128 KB. The shared memory capacity can be set to 0, 8, 16, 32, 64 or
100 KB.
Allows a single thread block to address up to 99 KB of shared
memory. Kernels relying on shared memory allocations over 48 KB per
block are architecture-specific, and must use dynamic shared memory
rather than statically sized shared memory arrays. These kernels require
an explicit opt-in by using
cudaFuncSetAttribute().
Note that the maximum amount of shared memory per thread block is
smaller than the maximum shared memory partition available per SM. The 1
KB of shared memory not made available to a thread block is reserved for
system use.