GPT-2 parallelisation: layer normalisation with SIMD (part 3)

SIMD parallelisation of layer normalisation.
GPT-2 parallelisation: embedding with CUDA (part 2)

Parallelisation strategies for token embedding and positional encoding using CUDA.
Statistical analysis of performance improvements due to optimisation techniques

How do you measure and compare application performance, and how to be sure any performance improvement due to certain optimisation techniques is not a fluke? How do you compare and choose between optimisation strategies?
GPT-2 parallelisation: embedding with SIMD (part 1)

First part of experiments with GPT-2 internals and SIMD parallelisation strategies. We address the data entry into a Transformer: the tokenisation, token embedding and positional encoding.
Comparing throughput of memory mapped file to buffered read

How does the performance, as measured in gigabytes per second (GB/s) throughput, of memory mapped files compare to buffered read for sequential access pattern?
Localised approximation of hyperbolic tangents

How are hyperbolic tangents implement? What are the approximants underlying those implementations? How are the formulas derived? How could we parallelise the implementation using SIMD intrinsics?
Understanding convolution through echoes and blurring

What is convolution? How is it different from correlation? Why do we reverse the order for linear convolutions? Why do we flip both axes for two-dimensional convolution? How can we understand convolution intuitively?
Finding direction of steepest ascent

During backpropagation, how do we decide how to update the weights in the neural network?
Deriving the backpropagation algorithm

What is backpropagation? How does it work? How do we use it for supervised learning? How do we vectorise it for implementation using matrix multiplication?
Matrix multiplication

What is the impact of loop ordering in matrix multiplication? How to improve cache utilisation? How do we implement using SIMD intrinsics?
Why are p-values distributed uniformly under the null hypothesis?

False Discovery Rate (FDR) approach to p-value adjustment in multiple statistical significance testing asserts that the p-values are distributed uniformly under the null hypothesis. Why?
Why divide by (n-1) when calculating sample variance?

Why are the formulas for calculating population variance and sample variance different?
Understanding the method of least squares

What is modelling? How do models predict values? How do we formulate our prediction objective? How do we find the model parameters? How do we minimise prediction error?
How to derive formulas for the sum of powers?

Huffman Coding (literate)
DBSCAN Clustering (literate)
Fuzzy c-Means Clustering (literate)
Agglomerative Hierarchical Clustering
Finding strongly connected components in a directed graph
Shortest path algorithms
- Dijkstra’s shortest path
- Floyd-Warshall all-pair shortest path
Alpha-Beta Pruning in MinMax tree search
Discrete Fourier Transform (DFT)
Piece-wise linear interpolation
Approximations to Student’s t-distribution
Zhang-Suen Thinning for skeletonisation
Downsampling using Maxpooling
A simple backup and restore utility
Understanding Receiver Operating Characteristic (ROC) Curves

Contents

Articles and notes

Algorithm implementations

Miscellaneous