Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Open access for next 5 hours (8GiB model, running on RTX 3090) or until server crashes or the this spot instance gets taken away :) =>

https://ofo1j9j6qh20a8-80.proxy.runpod.net

  ./build/bin/llama-server \
   -m ../Bonsai-8B.gguf \
   -ngl 999 \
   --flash-attn on \
   --host 0.0.0.0 \
   --port 80 \
   --ctx-size 65500 \
   --batch-size 512 \
   --ubatch-size 512 \
   --parallel 5 \
   --cont-batching \
   --threads 8 \
   --threads-batch 8 \
   --cache-type-k q4_0 \
   --cache-type-v q4_0 \
   --log-colors on
The server can serve 5 parallel request, with each request capped at around `13K` tokens...

A bit of of benchmarks I did:

1. Input: 700 tokens, ttfs: ~0 second, outputs: 1822 tokens ~190t/s

1. Input: 6400+ tokens, ttfs: ~2 second, outputs: 2012 tokens at ~135t/s

Vram usage was consistently at ~4GiB.



Better keep the KV cache in full precision


Wow.. the GOAT himself.. thank you sooo much for creating llama.cpp ... will re-deploy with full kv cache once requests stop coming.


I genuinely love talking to these models

https://ofo1j9j6qh20a8-80.proxy.runpod.net/#/chat/5554e479-0...

I'm contemplating whether I should drive or walk to the car wash (I just thought of that one HN post) and this is what it said after a few back-and-forths:

- Drive to the car (5 minutes), then park and wash.

- If you have a car wash nearby, you can walk there (2 minutes) and do the washing before driving to your car.

- If you're in a car wash location, drive to it and wash there.

Technically the last point was fine, but I like the creativity.


That was really impressive. https://pastebin.com/PmJmTLJN pretty much instantly. (Very weak models can't do this.)


Update: this has been evicted by runpod as it was on spot.


Kind sir, May I say to you thanks for doing so! I really appreciate it :D


Thank you! I am impressed by the speed of it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: