Optimize LLM Performance on Mac and Ubuntu

When working with large language models (LLMs) on a MacBook Pro with Apple’s M1 or M2 chips, or on Ubuntu with dual NVIDIA 2080 Ti GPUs, it’s essential to optimize performance and utilize GPU acceleration effectively. This blog post will explore how to install and use Ollama with GPU support, as well as other similar tools that provide OpenAI API compatibility.

1. Ollama on macOS with GPU Acceleration

Ollama utilizes the llama.cpp library, which supports the Metal API for GPU acceleration on macOS. Here’s how to enable and verify GPU acceleration with Ollama:

Enabling GPU Acceleration

Ollama automatically detects your hardware and utilizes GPU acceleration by default. Thus, no additional configuration is needed. If your MacBook is equipped with M1 or M2 chips, it will use the Metal framework for performance enhancement.

Verifying GPU Acceleration

To check if GPU acceleration is active, run an inference task and compare the performance to CPU-only execution. You can also analyze logs for indications of Metal usage.

Performance Optimization

Use Quantized Models: Ollama supports loading quantized versions of models (e.g., 4-bit, 8-bit), which can further enhance GPU acceleration efficiency.
Select Smaller Models: Opt for smaller variants (like llama-7b) to reduce hardware load.

2. Similar Tools and Platforms

If you’re looking for alternatives to Ollama that also support local execution and GPU acceleration with OpenAI API compatibility, consider the following:

(1) Text Generation Web UI

Description: An open-source web interface for running models such as LLaMA, GPT-J, and more.
Features:
OpenAI API compatibility.
macOS Metal GPU acceleration.
Installation:

git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
pip install -r requirements.txt
python server.py

(2) GPT4All

Description: A local LLM project supporting multiple models.
Features:
macOS Metal acceleration.
Easy-to-use CLI and GUI.
Installation:

brew install gpt4all
gpt4all

(3) LocalAI

Description: An API service for LLaMA models that is OpenAI-compatible.
Features:
macOS Metal acceleration.
REST API support.
Installation:

curl -LO https://github.com/go-skynet/LocalAI/releases/download/v1.0.0/local-ai-darwin-arm64
chmod +x local-ai-darwin-arm64
./local-ai-darwin-arm64

(4) MLC LLM

Description: A framework optimized for macOS and mobile devices.
Features:
GPU acceleration via Apple Metal.
Supports multiple models.
Installation: Download precompiled binaries and load models to run.

(5) llama.cpp

Description: A framework for running LLaMA models efficiently.
Features:
Metal API support.
Installation:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
./main -m path/to/llama/model

3. Using Multiple NVIDIA GPUs on Ubuntu

To maximize the utilization of two NVIDIA 2080 Ti GPUs on Ubuntu while serving large models with OpenAI-compatible APIs, follow these steps:

System and Environment Preparation

Install NVIDIA Driver:

sudo apt update
sudo apt install -y nvidia-driver-530
reboot

Install CUDA and cuDNN: Ensure compatibility with your driver version.
Install Python and Dependencies:

sudo apt install -y python3 python3-pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Using Distributed Inference Frameworks

Utilize frameworks like DeepSpeed or Hugging Face Accelerate to efficiently leverage both GPUs:

DeepSpeed

Install:

pip install deepspeed

Inference Example:

import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

ds_engine = deepspeed.init_inference(model=model, mp_size=2)

Hugging Face Accelerate

Install:

pip install accelerate

Configuration:

accelerate config

Inference Example:

from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import infer_auto_device_map

model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

device_map = infer_auto_device_map(model)

4. Providing OpenAI-Compatible APIs

To serve OpenAI-compatible APIs, consider using FastAPI or LocalAI:

FastAPI + Uvicorn

Install:

pip install fastapi uvicorn

API Service Code:

from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()
model_name = "EleutherAI/gpt-neo-2.7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

@app.post("/v1/completions")
async def completions(request: CompletionRequest):
inputs = tokenizer(request.prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs)
return {"choices": [{"text": outputs[0]}]}

Run:

uvicorn serve:app --host 0.0.0.0 --port 8000

LocalAI

Installation follows similar steps as described previously.
Run:

localai --models-path /models --api-port 8080

Conclusion

This guide outlines how to efficiently leverage GPU acceleration for local LLM inference using Ollama and other OpenAI-compatible tools. For MacBook users, Ollama provides a seamless experience with Metal API integration, while Ubuntu users can optimize their dual GPU setup through distributed frameworks like DeepSpeed and Hugging Face Accelerate. By following the suggested configurations, developers can maximize their hardware’s potential while providing robust API services.

Johngai’s Tech Forge

Optimize LLM Performance on Mac and Ubuntu

1. Ollama on macOS with GPU Acceleration

Enabling GPU Acceleration

Verifying GPU Acceleration

Performance Optimization

2. Similar Tools and Platforms

(1) Text Generation Web UI

(2) GPT4All

(3) LocalAI

(4) MLC LLM

(5) llama.cpp

3. Using Multiple NVIDIA GPUs on Ubuntu

System and Environment Preparation

Using Distributed Inference Frameworks

DeepSpeed

Hugging Face Accelerate

4. Providing OpenAI-Compatible APIs

FastAPI + Uvicorn

LocalAI

Conclusion

Leave a comment Cancel reply

Optimize LLM Performance on Mac and Ubuntu

1. Ollama on macOS with GPU Acceleration

Enabling GPU Acceleration

Verifying GPU Acceleration

Performance Optimization

2. Similar Tools and Platforms

(1) Text Generation Web UI

(2) GPT4All

(3) LocalAI

(4) MLC LLM

(5) llama.cpp

3. Using Multiple NVIDIA GPUs on Ubuntu

System and Environment Preparation

Using Distributed Inference Frameworks

DeepSpeed

Hugging Face Accelerate

4. Providing OpenAI-Compatible APIs

FastAPI + Uvicorn

LocalAI

Conclusion

Share this:

Leave a comment Cancel reply