There are several benefits to running LLMs (Large Language Models) locally. Some of the advantages are Privacy and working offline.
There are tools to make it easier to run models locally, such as ๐ฆ Ollama or LMStudio. I personally use LMStudio for the mlx support. With MLX, models should supposedly run faster on apple silicon.
Both Ollama and LMStudio have a chat interface built-in, but there are other alternatives such as open-webui.
Models
Most models used by cloud providers require a massive amount of memory to run efficiently. Running LLMs locally means that you are significantly restricted in the models you can run. There is a trade-off to be made:
- Use models with a lower number of parameters
- Use a quantized model
For example on a base Mac Mini M4 (16GB of unified memory), you can comfortably run models up to about 12GB, but it will be reasonably slow. This corresponds to a very heavily quantized GPT-OSS 20B or Qwen3 14B. My personal preference goes to significantly smaller models, such as Qwen3 4B, as they will run significantly faster. This depends a lot on the specific workload of course.