03 June 2024, Monday
List of articles.
GPT-2 parallelisation: layer normalisation with SIMD (part 3)
SIMD parallelisation of layer normalisation.
GPT-2 parallelisation: embedding with CUDA (part 2)
Parallelisation strategies for token embedding and positional encoding using CUDA.
Statistical analysis of performance improvements due to optimisation techniques
How do you measure and compare application performance, and how to be sure any performance improvement due to certain optimisation techniques is not a fluke? How do you compare and choose between optimisation strategies?
GPT-2 parallelisation: embedding with SIMD (part 1)
First part of experiments with GPT-2 internals and SIMD parallelisation strategies. We address the data entry into a Transformer: the tokenisation, token embedding and positional encoding.
Comparing throughput of memory mapped file to buffered read
How does the performance, as measured in gigabytes per second (GB/s) throughput, of memory mapped files compare to buffered read for sequential access pattern?
Localised approximation of hyperbolic tangents
How are hyperbolic tangents implement? What are the approximants underlying those implementations? How are the formulas derived? How could we parallelise the implementation using SIMD intrinsics?
Understanding convolution through echoes and blurring
What is convolution? How is it different from correlation? Why do we reverse the order for linear convolutions? Why do we flip both axes for two-dimensional convolution? How can we understand convolution intuitively?
Finding direction of steepest ascent
During backpropagation, how do we decide how to update the weights in the neural network?
Deriving the backpropagation algorithm
What is backpropagation? How does it work? How do we use it for supervised learning? How do we vectorise it for implementation using matrix multiplication?
What is the impact of loop ordering in matrix multiplication? How to improve cache utilisation? How do we implement using SIMD intrinsics?
Why are p-values distributed uniformly under the null hypothesis?
False Discovery Rate (FDR) approach to p-value adjustment in multiple statistical significance testing asserts that the p-values are distributed uniformly under the null hypothesis. Why?
Why divide by (n-1) when calculating sample variance?
Why are the formulas for calculating population variance and sample variance different?
Understanding the method of least squares
What is modelling? How do models predict values? How do we formulate our prediction objective? How do we find the model parameters? How do we minimise prediction error?
For systems I developed in the past, please see https://github.com/gyaikhom.
The following are few programs implementing interesting algorithms. Where available, I have provided literate programs that explain the implementation in detail. Before using these, please read disclaimer in About page.
Finding strongly connected components in a directed graph
Understanding Receiver Operating Characteristic (ROC) Curves