OpenAI has released a research preview of GPT-OSS-Safeguard, two open source security logic models that let developers enforce custom security policies at predictable times. The models come in two sizes, gpt-oss-safeguard-120b And gpt-oss-safeguard-20bBoth are fine-tuned by gpt-oss, both are licensed under Apache 2.0, and both are available on Hugging Face for local use.

Why does policy-conditional security matter?
Traditional moderation models are trained on a fixed policy. When that policy changes, the model must be retrained or replaced. gpt-oss-safeguard reverses this relationship. It takes the policy written by the developer along with user content as input, then decides step by step whether the content violates the policy or not. This turns security into a quick and evaluative task, better suited to rapidly changing or domain specific harms such as fraud, biology, self-harm or game specific abuse.
Similar patterns to OpenAI’s internal security reasoner
OpenAI says that GPT-OSS-Safeguard is an open implementation of the security reasoner used internally in systems such as GPT5, ChatGPT Agent, and Sora2. In production settings OpenAI already runs small high recall filters, then elevates uncertain or sensitive objects into a reasoning model, and in a recent launch up to 16 percent of total computation was spent on security reasoning. The open release lets external teams reproduce this defense in depth instead of having to guess how OpenAI’s stack works.
Model Size and Hardware Fit
The larger model, gpt-oss-safeguard-120b, has 117B parameters with 5.1B active parameters and is sized to fit on a single 80GB H100 class GPU. The smaller gpt-oss-safeguard-20b has 21B parameters with 3.6B active parameters and targets low latency or small GPUs, including 16GB setups. Both models were trained on a cohesive response format, so the signals have to follow that structure otherwise the results will be skewed. The license is Apache 2.0, which is the same as the original GPT-OSS model, so commercial local deployment is allowed.

evaluation results
OpenAI evaluated the model in internal multi-policy tests and on public datasets. In multi-policy accuracy, where the model must correctly enforce multiple policies simultaneously, GPT-OSS-SafeGuard and OpenAI’s internal security reasoner outperforms GPT-5-Thinking and Open GPT-OSS baselines. The new models perform slightly better than both GPT-5-Thinking and Internal Security Reasoner on the 2022 moderation dataset, although OpenAI specifies that this difference is not statistically significant, so it should not be overstated. On ToxicChat, the internal security reasoner is still ahead, with gpt-oss-safeguard close behind. This puts open models in competitive range for real moderation tasks.
Recommended deployment pattern
OpenAI is clear that pure logic is expensive on each request. The recommended setup is to run small, fast, high recall classifiers on all traffic, then only send uncertain or sensitive content to gpt-oss-safeguard, and to run the reasoner asynchronously when the user experience requires fast responses. This mirrors OpenAI’s own production guidance and reflects the fact that dedicated task specific classifiers can win even when there is a large high quality labeled dataset.
key takeaways
- GPT-OSS-Safeguard is a research preview of two open weight safety reasoning models, 120b and 20b, that classify content using developer-provided policies at inference time, so policy changes do not require retraining.
- The models implement the same security reasoner pattern that OpenAI uses internally in GPT5, ChatGPT Agent, and Sora2, where the first fast filter routes only risky or ambiguous content to the slower reasoning model.
- Both models are fine-tuned with GPT-OSS, keep the Harmony response format, and are sized for real deployment, the 120b model fits on a single H100 class GPU, the 20b model targets 16GB level hardware, and both have Apache 2.0 on the hugging face.
- On the internal multiple policy evaluation and 2022 moderation datasets, the security models outperform the GPT-5-think and GPT-OSS baselines, but OpenAI notes that the small margin on the internal security reasoner is not statistically significant.
- OpenAI recommends using these models in a layered moderation pipeline with community resources like ROOST, so platforms can express custom taxonomies, audit chains of views, and update policies without touching the weights.
OpenAI is taking an intrinsic security pattern and making it reproducible, which is the most important part of this launch. The models are open-weighted, policy conditioned, and Apache 2.0, so platforms can ultimately apply their own taxonomies rather than accepting fixed labels. The fact that GPT-OSS-Safeguard matches and sometimes slightly exceeds the internal security logic on the 2022 moderation dataset, while outperforming GPT-5-think on multi-policy accuracy, but with a non-statistically significant margin, shows that the approach is already usable. The recommended tiered deployment is realistic for production.

Michael Sutter is a data science professional and holds a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michael excels in transforming complex datasets into actionable insights.
🙌 Follow MarketTechPost: Add us as a favorite source on Google.