How to Build a Production-Ready Gemma 3 1B Instruct Generation AI Pipeline with Hugging Face Transformers, Chat Templates, and Colab Inference

Spread the love

In this tutorial, we create and run a Colab workflow letter bud Teach how to use the 3 1B Hugging Face Transformer and HF Token in a practical, reproducible and easy-to-follow step-by-step manner. We start by installing the required libraries, authenticating securely with our Hugging Face token, and loading the tokenizer and model onto an available device with the correct precision settings. From there, we build reusable generation utilities, format signals in a chat-style structure, and test the models in a variety of realistic tasks like basic generation, structured JSON-style responses, quick series, benchmarking, and deterministic summarization, so we not only load Gemma but actually work with it in a meaningful way.

import os
import sys
import time
import json
import getpass
import subprocess
import warnings
warnings.filterwarnings("ignore")


def pip_install(*pkgs):
   subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *pkgs])


pip_install(
   "transformers>=4.51.0",
   "accelerate",
   "sentencepiece",
   "safetensors",
   "pandas",
)


import torch
import pandas as pd
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM


print("=" * 100)
print("STEP 1 — Hugging Face authentication")
print("=" * 100)


hf_token = None
try:
   from google.colab import userdata
   try:
       hf_token = userdata.get("HF_TOKEN")
   except Exception:
       hf_token = None
except Exception:
   pass


if not hf_token:
   hf_token = getpass.getpass("Enter your Hugging Face token: ").strip()


login(token=hf_token)
os.environ["HF_TOKEN"] = hf_token
print("HF login successful.")

We have set up the necessary environment in Google Colab to run the tutorial smoothly. We install the required libraries, import all core dependencies, and authenticate securely with Hugging Face using our token. By the end of this section, we will prepare the notebook to access Gemma models and continue the workflow without manual setup issues.

print("=" * 100)
print("STEP 2 — Device setup")
print("=" * 100)


device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
print("device:", device)
print("dtype:", dtype)


model_id = "google/gemma-3-1b-it"
print("model_id:", model_id)


print("=" * 100)
print("STEP 3 — Load tokenizer and model")
print("=" * 100)


tokenizer = AutoTokenizer.from_pretrained(
   model_id,
   token=hf_token,
)


model = AutoModelForCausalLM.from_pretrained(
   model_id,
   token=hf_token,
   torch_dtype=dtype,
   device_map="auto",
)


model.eval()
print("Tokenizer and model loaded successfully.")

We configure the runtime by detecting whether we are using a GPU or CPU and select the appropriate precision to load the model efficiently. Then we define the Gemma 3 1B Instruct model path and load both the tokenizer and the model from the hugging face. At this stage, we complete the core model initialization, making the notebook ready to generate text.

def build_chat_prompt(user_prompt: str):
   messages = [
       {"role": "user", "content": user_prompt}
   ]
   try:
       text = tokenizer.apply_chat_template(
           messages,
           tokenize=False,
           add_generation_prompt=True
       )
   except Exception:
       text = f"<start_of_turn>user\n{user_prompt}<end_of_turn>\n<start_of_turn>model\n"
   return text


def generate_text(prompt, max_new_tokens=256, temperature=0.7, do_sample=True):
   chat_text = build_chat_prompt(prompt)
   inputs = tokenizer(chat_text, return_tensors="pt").to(model.device)


   with torch.no_grad():
       outputs = model.generate(
           **inputs,
           max_new_tokens=max_new_tokens,
           do_sample=do_sample,
           temperature=temperature if do_sample else None,
           top_p=0.95 if do_sample else None,
           eos_token_id=tokenizer.eos_token_id,
           pad_token_id=tokenizer.eos_token_id,
       )


   generated = outputs[0][inputs["input_ids"].shape[-1]:]
   return tokenizer.decode(generated, skip_special_tokens=True).strip()


print("=" * 100)
print("STEP 4 — Basic generation")
print("=" * 100)


prompt1 = """Explain Gemma 3 in plain English.
Then give:
1. one practical use case
2. one limitation
3. one Colab tip
Keep it concise."""
resp1 = generate_text(prompt1, max_new_tokens=220, temperature=0.7, do_sample=True)
print(resp1)

We create reusable functions that format prompts into the expected chat structure and handle text generation from the model. We make the inference pipeline modular so that we can reuse the same functions in different tasks in the notebook. After that, we run the first practical generation example to confirm that the model is working correctly and producing meaningful output.

print("=" * 100)
print("STEP 5 — Structured output")
print("=" * 100)


prompt2 = """
Compare local open-weight model usage vs API-hosted model usage.


Return JSON with this schema:
{
 "local": {
   "pros": ["", "", ""],
   "cons": ["", "", ""]
 },
 "api": {
   "pros": ["", "", ""],
   "cons": ["", "", ""]
 },
 "best_for": {
   "local": "",
   "api": ""
 }
}
Only output JSON.
"""
resp2 = generate_text(prompt2, max_new_tokens=300, temperature=0.2, do_sample=True)
print(resp2)


print("=" * 100)
print("STEP 6 — Prompt chaining")
print("=" * 100)


task = "Draft a 5-step checklist for evaluating whether Gemma fits an internal enterprise prototype."
resp3 = generate_text(task, max_new_tokens=250, temperature=0.6, do_sample=True)
print(resp3)


followup = f"""
Here is an initial checklist:


{resp3}


Now rewrite it for a product manager audience.
"""
resp4 = generate_text(followup, max_new_tokens=250, temperature=0.6, do_sample=True)
print(resp4)

We extend the model beyond simple prompting by examining structured output generation and prompt chaining. We ask Gemma to return a response in a defined JSON-like format and then use a follow-up directive to transform the earlier response for a different audience. This helps us see how the model handles formatting constraints and multi-step refinements in a realistic workflow.

print("=" * 100)
print("STEP 7 — Mini benchmark")
print("=" * 100)


prompts = [
   "Explain tokenization in two lines.",
   "Give three use cases for local LLMs.",
   "What is one downside of small local models?",
   "Explain instruction tuning in one paragraph."
]


rows = []
for p in prompts:
   t0 = time.time()
   out = generate_text(p, max_new_tokens=140, temperature=0.3, do_sample=True)
   dt = time.time() - t0
   rows.append({
       "prompt": p,
       "latency_sec": round(dt, 2),
       "chars": len(out),
       "preview": out[:160].replace("\n", " ")
   })


df = pd.DataFrame(rows)
print(df)


print("=" * 100)
print("STEP 8 — Deterministic summarization")
print("=" * 100)


long_text = """
In practical usage, teams often evaluate
trade-offs among local deployment cost, latency, privacy, controllability, and raw capability.
Smaller models can be easier to deploy, but they may struggle more on complex reasoning or domain-specific tasks.
"""


summary_prompt = f"""
Summarize the following in exactly 4 bullet points:


{long_text}
"""
summary = generate_text(summary_prompt, max_new_tokens=180, do_sample=False)
print(summary)


print("=" * 100)
print("STEP 9 — Save outputs")
print("=" * 100)


report = {
   "model_id": model_id,
   "device": str(model.device),
   "basic_generation": resp1,
   "structured_output": resp2,
   "chain_step_1": resp3,
   "chain_step_2": resp4,
   "summary": summary,
   "benchmark": rows,
}


with open("gemma3_1b_text_tutorial_report.json", "w", encoding="utf-8") as f:
   json.dump(report, f, indent=2, ensure_ascii=False)


print("Saved gemma3_1b_text_tutorial_report.json")
print("Tutorial complete.")

We evaluate the model on a small benchmark of signals to observe response behavior, latency, and output length in a compact experiment. We then perform a deterministic summation function to see how the model behaves when the randomness is reduced. Finally, we save all key outputs to a report file, turning the notebook into a reusable experimental setup rather than just a temporary demo.

Finally, we have a complete text-generation pipeline that shows how Gemma 3 1b can be used for practical experiments and light prototyping in Colab. We generated direct responses, compared outputs across different signal styles, measured simple latency behavior, and saved the results to a report file for later inspection. In doing so, we turned the notebook into more than a demo: We made it a reusable base for testing signals, evaluating outputs, and integrating Gemma into larger workflows with confidence.

check it out Complete coding notebook here. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Source link

Related Stories

Access Denied

MIT researchers use AI to uncover atomic defects in materials | MIT News

Chroma Releases Context-1: A 20B Agentic Search Model for Multi-Hop Retrieval, Context Management, and Scalable Synthetic Task Generation

You may have missed

Iran Has Asked for a Ceasefire

CERT-UA Impersonation Campaign Spread AGEWHEEZE Malware to 1 Million Emails

NASA to launch Artemis 2, its first Moon-bound mission with astronauts since 1972 – Spaceflight Now

Harimau Malaya plummets to 138th in FIFA rankings