A custom built machine for computational experiments

Gagarine Yaikhom

10 May 2024, Friday

Abstract

The following are specifications of my custom built machine. The plan is to do development and testing on here, and perhaps use the cloud for scaling.

Summary

Processor: Intel Core i7-14700F
- SSE4.1, SSE4.2, AVX2
- Total Cores 20
  - Performance cores: 8
  - Efficient cores: 12
- Total Threads: 28
- Max Turbo Frequency: 5.4 GHz
- Performance-core Max Turbo Frequency: 5.3 GHz
- Efficient-core Max Turbo Frequency: 4.2 GHz
- Performance-core Base Frequency: 2.1 GHz
- Efficient-core Base Frequency: 1.5 GHz
- Cache: 33 MB Intel Smart Cache
- Total L2 Cache: 28 MB
Memory: Corsair DDR5 Vengeance Black
- Non-ECC Unbuffered
- CAS latency: CL40
- XMP: 3.0
- SPD Latency: 40-40-40-77
- SPD Speed: 4800MHz
- Memory speed: 5600MT/s
- Total RAM: 96GB (2 x 48GB)
Motherboard: MSI PRO Z790-A MAX WIFI, Intel Z790
Graphics card:: EVGA GeForce RTX 3080 Ti XC3 ULTRA GAMING
- GPU Name: GA102
- GPU Variant: GA102-225-A1
- Architecture: Ampere
- CUDA Capability: 8.6
- CUDA cores: 10,240 (128 cuda cores per SM)
- SM count: 80
- RT cores: 80
- Tensor cores: 320
- NVIDIA driver: 555.42.02
- CUDA Version: 12.5
- L1 Cache: 128 KB (per SM)
- L2 Cache: 6 MB
- Boost Clock: 1.67GHz
- Base Clock: 1.37GHz
- Memory Specs: 12 GB GDDR6X
- Memory Interface Width: 384-bit
- Tensor Cores: 3rd Generation
- NVIDIA Architecture: Ampere
- PCI Express Gen 4: Yes
- Installed on PCIe x16 slot.

CPU cache structure

$ sudo dmicode -t cache

L1 Cache
- Operational Mode: Write Back
- Installed Size: 768 kB
- Maximum Size: 768 kB
- Error Correction Type: Parity
- Associativity: 8-way Set-associative
L2 Cache
- Operational Mode: Write Back
- Installed Size: 12 MB
- Maximum Size: 12 MB
- Error Correction Type: Single-bit ECC
- Associativity: 16-way Set-associative
L3 Cache
- Configuration: Enabled, Not Socketed, Level 3
- Operational Mode: Write Back
- Installed Size: 33 MB
- Maximum Size: 33 MB
- Error Correction Type: Multi-bit ECC
- Associativity: Other

GPU compute capability

Summarised from the Compute capability 8.6 details specifically for the installed card.

A Streaming Multiprocessor (SM) consists of:

128 fp32 cores for single-precision arithmetic
2 fp64 cores for double-precision arithmetic
64 int32 cores for integer arithmetic
4 mixed-precision Third-Generation Tensor Cores supporting half-precision (fp16), __nv_bfloat16, tf32, sub-byte and double precision (fp64) matrix arithmetic
16 special function units for single-precision floating-point transcendental functions
4 warp schedulers.

An SM has:

A unified data cache and shared memory with a total size of 128 KB
Shared memory is partitioned out of the unified data cache, and can be configured to various sizes. The remaining data cache serves as an L1 cache and is also used by the texture unit that implements the various addressing and data filtering modes mentioned in Texture and Surface Memory.

Global Memory:

Global memory accesses are always cached in L2.
Data that is read-only for the entire lifetime of the kernel can also be cached in the unified L1/texture cache.
Data that is not read-only for the entire lifetime of the kernel cannot be cached in the unified L1/texture cache.

Shared Memory:

The amount of the unified data cache reserved for shared memory is configurable on a per kernel basis. The unified data cache has a size of 128 KB. The shared memory capacity can be set to 0, 8, 16, 32, 64 or 100 KB.
Allows a single thread block to address up to 99 KB of shared memory. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, and must use dynamic shared memory rather than statically sized shared memory arrays. These kernels require an explicit opt-in by using cudaFuncSetAttribute().
Note that the maximum amount of shared memory per thread block is smaller than the maximum shared memory partition available per SM. The 1 KB of shared memory not made available to a thread block is reserved for system use.

Summary

CPU cache structure

GPU compute capability

References