When working with large language models (LLMs) on a MacBook Pro with Apple’s M1 or M2 chips, or on Ubuntu with dual NVIDIA 2080 Ti GPUs, it’s essential to optimize performance and utilize GPU acceleration effectively. This blog post will explore how to install and use Ollama with GPU support, as well as other similar tools that provide OpenAI API compatibility.
1. Ollama on macOS with GPU Acceleration
Ollama utilizes the llama.cpp library, which supports the Metal API for GPU acceleration on macOS. Here’s how to enable and verify GPU acceleration with Ollama:
Enabling GPU Acceleration
Ollama automatically detects your hardware and utilizes GPU acceleration by default. Thus, no additional configuration is needed. If your MacBook is equipped with M1 or M2 chips, it will use the Metal framework for performance enhancement.
Verifying GPU Acceleration
To check if GPU acceleration is active, run an inference task and compare the performance to CPU-only execution. You can also analyze logs for indications of Metal usage.
Performance Optimization
- Use Quantized Models: Ollama supports loading quantized versions of models (e.g., 4-bit, 8-bit), which can further enhance GPU acceleration efficiency.
- Select Smaller Models: Opt for smaller variants (like
llama-7b) to reduce hardware load.
2. Similar Tools and Platforms
If you’re looking for alternatives to Ollama that also support local execution and GPU acceleration with OpenAI API compatibility, consider the following:
(1) Text Generation Web UI
- Description: An open-source web interface for running models such as LLaMA, GPT-J, and more.
- Features:
- OpenAI API compatibility.
- macOS Metal GPU acceleration.
- Installation:
git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
pip install -r requirements.txt
python server.py
(2) GPT4All
- Description: A local LLM project supporting multiple models.
- Features:
- macOS Metal acceleration.
- Easy-to-use CLI and GUI.
- Installation:
brew install gpt4all
gpt4all
(3) LocalAI
- Description: An API service for LLaMA models that is OpenAI-compatible.
- Features:
- macOS Metal acceleration.
- REST API support.
- Installation:
curl -LO https://github.com/go-skynet/LocalAI/releases/download/v1.0.0/local-ai-darwin-arm64
chmod +x local-ai-darwin-arm64
./local-ai-darwin-arm64
(4) MLC LLM
- Description: A framework optimized for macOS and mobile devices.
- Features:
- GPU acceleration via Apple Metal.
- Supports multiple models.
- Installation: Download precompiled binaries and load models to run.
(5) llama.cpp
- Description: A framework for running LLaMA models efficiently.
- Features:
- Metal API support.
- Installation:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
./main -m path/to/llama/model
3. Using Multiple NVIDIA GPUs on Ubuntu
To maximize the utilization of two NVIDIA 2080 Ti GPUs on Ubuntu while serving large models with OpenAI-compatible APIs, follow these steps:
System and Environment Preparation
- Install NVIDIA Driver:
sudo apt update
sudo apt install -y nvidia-driver-530
reboot
-
Install CUDA and cuDNN: Ensure compatibility with your driver version.
-
Install Python and Dependencies:
sudo apt install -y python3 python3-pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Using Distributed Inference Frameworks
Utilize frameworks like DeepSpeed or Hugging Face Accelerate to efficiently leverage both GPUs:
DeepSpeed
- Install:
pip install deepspeed
- Inference Example:
import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
ds_engine = deepspeed.init_inference(model=model, mp_size=2)
Hugging Face Accelerate
- Install:
pip install accelerate
- Configuration:
accelerate config
- Inference Example:
from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import infer_auto_device_map
model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
device_map = infer_auto_device_map(model)
4. Providing OpenAI-Compatible APIs
To serve OpenAI-compatible APIs, consider using FastAPI or LocalAI:
FastAPI + Uvicorn
- Install:
pip install fastapi uvicorn
- API Service Code:
from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
app = FastAPI()
model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
@app.post("/v1/completions")
async def completions(request: CompletionRequest):
inputs = tokenizer(request.prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs)
return {"choices": [{"text": outputs[0]}]}
- Run:
uvicorn serve:app --host 0.0.0.0 --port 8000
LocalAI
- Installation follows similar steps as described previously.
- Run:
localai --models-path /models --api-port 8080
Conclusion
This guide outlines how to efficiently leverage GPU acceleration for local LLM inference using Ollama and other OpenAI-compatible tools. For MacBook users, Ollama provides a seamless experience with Metal API integration, while Ubuntu users can optimize their dual GPU setup through distributed frameworks like DeepSpeed and Hugging Face Accelerate. By following the suggested configurations, developers can maximize their hardware’s potential while providing robust API services.

