LLM2 70B on a NVIDIA 4090 II

Running a single question against the LLM2 was tedious: 20 minutes for an answer. So I looked for improvement.

The developers of the code I used


showed their de-optimizations in



Instead of loading the complete model and infer it, they use a sequential algorithm.

Their algorithm firstly segmented the NN into shards of 1.6 GB volume, consisting of one NN-Layer. Then they load each layer into the GPU and process them, keeping track of the intermediate results

A similar technique I used in my diploma thesis (25 years ago) , simulating the flow of a current through the lattice of a crystal. The crystal lattice is 3D but I broke this down to three layers of the lattice that has to live in RAM to perform the complex operations.

So I was enthralled to look at this similar application.

Serializing an operation that could be performed without any IO into many operations which need IO will always make things slower. Simply because IO costs several magnitudes more time than operations in the processing unit. Be that a CPU or GPU.

In fact the simple question "Who is the president of the USA" took 20 minutes to calculate the incorrect answer "Barrack Obama".

But the liberation of having a "home AI" independent of all these Web-Services which I never tried was a milestone.

All the other down scaled LLM models that can be run on a 4090 card do not even come close to the richness of the 70b-LLM2 modell.