I recently experimented with running llama-3.1-8b-instruct locally on my Consumer hardware, aka my Nvidia RTX 4060 with 8GB VRAM, as I wanted to experiment with prompting pdfs with a large context which is extremely expensive with how LLMs are priced.
I was able to fit the model with decent speeds (30 tokens/seconds) and a 20k token context completely on the GPU.
For summarization, the performance of these models are decent enough. However unfortunately in my use case I felt using Gemini's Free Tier with it's multimodal capabilities and much better quality output made running local LLMs not really worth it as of right now, atleast for consumers.
Supposedly submitting screenshots of pdfs (at a large enough zoom per tile/page) to OpenAI gtp4o or Google’s whatever is currently the best way of handling charts and tables.
I was able to fit the model with decent speeds (30 tokens/seconds) and a 20k token context completely on the GPU.
For summarization, the performance of these models are decent enough. However unfortunately in my use case I felt using Gemini's Free Tier with it's multimodal capabilities and much better quality output made running local LLMs not really worth it as of right now, atleast for consumers.