Artifact // OLM

OpenLanguage Model.

An open source LLM library for everyone. Does for LLMs what PyTorch did for Deep Learning.

By Keshava, Tavish, Vardhaman — OpenLanguageModel Team

01 — The Problem

Why OLM Exists

Typical LLM repositories are over 3,000 lines of code. The barrier to entry is enormous: domain specialization is required just to get started. "Which architecture choice is best?" is a question that demands vast, scattered knowledge. There is no central, up-to-date resource for training decent LLMs for beginners. Building language models remains a niche and learned skill.

OLM is the solution: simplified, modular, and transparent.

Two Audiences, One Library

FOR BEGINNERS

Very easy to start. Train your own LLMs with minimal setup. No domain specialization required.

FOR RESEARCHERS

Ability to go super deep. Perfectly optimum architecture changes. Doesn't compromise ease of use for performance and customizability.

02 — Simplicity

Training in < 15 Lines

import sys, os, torch, urllib.request
from torch.utils.data import DataLoader
from tempfile import TemporaryDirectory
from olm import Dataset, HFTokenizer, Trainer, LM

with TemporaryDirectory() as tmp:
    urllib.request.urlretrieve("https://github.com/.../input.txt",
                              os.path.join(tmp, "i.txt"))
    tokenizer, device = HFTokenizer("gpt2"), "cuda" if torch.cuda.is_available() else "cpu"

    # Define Model
    model = LM(tokenizer.vocab_size, 64, 4, 2, 33)
    optimizer = torch.optim.AdamW(model.parameters(), 3e-4)

    # Data & Training
    dataset = Dataset(tmp, tokenizer, 32)
    dataloader = DataLoader(dataset, 4)
    trainer = Trainer(model, optimizer, dataloader, device, 32, use_amp=False)

    losses = trainer.train(1, 10, 100)
    print(f"S:{losses[0]:.4f} E:{losses[-1]:.4f} OK:{losses[-1]<losses[0]}")

Models come from olm.models. Data pipelines come from olm.data. Training orchestration lives in olm.train. Start with this structure and gradually customize any part of it.

03 — Replicated Models

Configurations & Architectures

We have automatically replicated submodels for all major families:

LLAMA 3

8B · 70B

LLAMA 2

7B · 13B · 70B

PHI-3 / PHI-4

Mini (3.8B) · Small (7B) · Medium (14B)

QWEN 1.5 / QWEN 2

0.5B · 1.8B · 4B · 7B · 14B · 32B · 72B

GEMMA 2

2B · 9B · 27B

GPT-2

Small (124M) · Medium (355M) · Large (774M) · XL (1.5B)

OPT

125M · 1B · 7B · 66B

OLMo

1B · 7B

04 — Extensibility

Standardized Building Blocks

All components—Attention, FeedForward, Norms—are modular. Want a new loss function? Simply inherit from the base class and modify forward(). No need to rewrite the trainer or data pipeline.

Compute Optimal

Current version aligns with Chinchilla scaling laws. Near-perfect SOTA GPU MFU utilization (60%). Future versions will continue to prioritize performance.

Example: Custom Activation

import torch
import torch.nn.functional as F
from olm.nn.activations.base import ActivationBase

class SwiGLU(ActivationBase):
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        value, gate = x.chunk(2, dim=-1)
        return value * F.silu(gate)

Example: Complete LLama 3 Architecture

Llama3Block = Block([
    Residual(Block([
        RMSNorm(embed_dim, eps=1e-5),
        GroupedQueryAttention(
            embed_dim, num_heads, num_kv_heads, max_seq_len,
            dropout=dropout, rope_theta=rope_theta, use_bias=False
        )
    ])),
    Residual(Block([
        RMSNorm(embed_dim, eps=1e-5),
        SwiGLUFFN(embed_dim, hidden_dim=intermediate_size,
                  dropout=dropout, bias=False)
    ]))
])

Llama3Model = Block([
    Embedding(vocab_size, embed_dim),
    Repeat(lambda: Llama3Block(
        embed_dim, intermediate_size, num_heads, num_kv_heads,
        max_seq_len, dropout, rope_theta
    ), num_layers),
    RMSNorm(embed_dim, eps=1e-5),
    Linear(embed_dim, vocab_size, bias=False)
])

05 — Status & Roadmap

Current Status

4 LLMs trained successfully. v1.0 published. v1.1 ready.

Roadmap

v1.0Foundation & Core Architectures.

v1.1On-GPU Optimization (FlashAttention, torch.compile), UX & Refinement.

v2.0Multi-GPU (DDP, FSDP), Mixture of Experts.

v2.1Distributed Optimization. ZeRO stages, Expert Parallelism. Multiple MoE routing methods and improved training stability from our own experiments.

v3.0Scaling Out (Multi-Node). Cluster support, Pipeline/Tensor Parallelism.

v3.1"Open Source" Goal. Reproduce every single open-source LLM recipe (Llama 3, Mistral).

v4.0Further Training. SFT, LoRA, RLHF.

06 — Contribute

Call for Contributors

We are looking for contributors for documentation & API reference, minor feature additions, website development & outreach, intuitive UX enhancements, and major roadmap features.

Why Contribute?

Become a Lead Developer. Current team transitioning in 2 months.

Complete the Pipeline. Work on SFT (Supervised Fine-Tuning) and RL methods (RLHF, RLVR).

Grow the Project. Contact companies for compute sponsorship.

Research Impact. Build a scaling library. Successful compute sponsorship → Scaling Laws Paper (high citation potential).

GITHUB REPOSITORY ↗