autogen - 💡(How to fix) Fix Proposal: Semantic Intent Classification Safety Extension [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
microsoft/autogen#7242Fetched 2026-04-08 00:40:07
View on GitHub
Comments
2
Participants
2
Timeline
4
Reactions
0
Timeline (top)
commented ×2closed ×1cross-referenced ×1

Code Example

from autogen import ConversableAgent
from autogen_safety import SafetyGuard, GovernancePolicy

policy = GovernancePolicy.load("policy.yaml")
guard = SafetyGuard(policy=policy)

agent = ConversableAgent(
    name="coder",
    system_message="You write Python code.",
)
guard.protect(agent)  # Wraps message handling with safety checks

# Now all agent actions are classified and policy-checked
# Dangerous actions (exfiltration, privilege escalation) are blocked
# All actions are logged to tamper-evident audit chain
RAW_BUFFERClick to expand / collapse

Proposal: Semantic Intent Classification Safety Extension for AutoGen

Problem

AutoGen enables powerful multi-agent conversations with flexible orchestration. However, as agents gain autonomy to execute code, call tools, and delegate tasks, there's a growing need for fine-grained action-level safety classification - understanding what an agent is trying to do before allowing it.

Current approaches (token limits, blocklists) are too coarse. Teams need:

  • Semantic intent classification - Classify each action into threat categories before execution
  • Trust-scored agent interactions - Track and decay trust across multi-turn conversations
  • Policy enforcement - Declarative governance policies with event hooks for violations
  • Tamper-evident audit trails - Cryptographic proof of what happened during execution

What we've built (Apache-2.0)

Agent-OS includes a production-grade semantic intent classifier:

  1. 9 threat categories - destructive, exfiltration, privilege_escalation, resource_abuse, persistence, lateral_movement, reconnaissance, social_engineering, benign
  2. No LLM dependency - Fast, deterministic classification using pattern matching and heuristics
  3. GovernancePolicy - YAML-based policies with blocked patterns (regex/glob), token limits, tool call limits
  4. Event hooks - on(POLICY_VIOLATION, callback) for real-time alerting
  5. Trust scoring - 5-dimension trust model with decay (via AgentMesh)

Proposed integration

An autogen-safety extension that hooks into AutoGen's message handling:

from autogen import ConversableAgent
from autogen_safety import SafetyGuard, GovernancePolicy

policy = GovernancePolicy.load("policy.yaml")
guard = SafetyGuard(policy=policy)

agent = ConversableAgent(
    name="coder",
    system_message="You write Python code.",
)
guard.protect(agent)  # Wraps message handling with safety checks

# Now all agent actions are classified and policy-checked
# Dangerous actions (exfiltration, privilege escalation) are blocked
# All actions are logged to tamper-evident audit chain

Why this matters for AutoGen

  • Enterprise readiness - Organizations need safety guarantees before deploying autonomous agents
  • Code execution risk - AutoGen's code executor is powerful but needs guardrails against malicious patterns
  • Composable - Works with existing AutoGen patterns (group chat, nested chat, tool use)
  • Deterministic - No LLM-in-the-loop for safety checks; fast and predictable
  • Standards-aligned - Implements CSA's Agentic Trust Framework zero-trust governance model

Ask

Is there interest in this kind of contribution? Options:

  1. Standalone autogen-safety package using AutoGen's extensibility hooks
  2. PR to AutoGen core adding optional safety middleware
  3. Example/cookbook showing the integration pattern

Happy to discuss the best approach with maintainers.

extent analysis

Fix Plan

To integrate the semantic intent classification safety extension into AutoGen, follow these steps:

  • Step 1: Install the autogen-safety package
    • Run pip install autogen-safety to install the package
  • Step 2: Create a Governance Policy
    • Create a policy.yaml file with the desired governance policies, including blocked patterns, token limits, and tool call limits
    • Example policy file:
      blocked_patterns:
        - "rm -rf *"
        - "sudo *"
      token_limits:
        max_tokens: 100
      tool_call_limits:
        max_calls: 5
  • Step 3: Initialize the Safety Guard
    • Import the SafetyGuard class from autogen_safety
    • Load the governance policy using GovernancePolicy.load("policy.yaml")
    • Initialize the SafetyGuard instance with the loaded policy
    • Example code:
      from autogen_safety import SafetyGuard, GovernancePolicy
      
      policy = GovernancePolicy.load("policy.yaml")
      guard = SafetyGuard(policy=policy)
  • Step 4: Protect the AutoGen Agent
    • Create an instance of the ConversableAgent class from autogen
    • Pass the agent instance to the protect method of the SafetyGuard instance
    • Example code:
      from autogen import ConversableAgent
      
      agent = ConversableAgent(
          name="coder",
          system_message="You write Python code.",
      )
      guard.protect(agent)
  • Step 5: Verify the Integration
    • Test the integration by sending messages to the protected agent
    • Verify that the safety checks are working as expected

Example Code

Here's an example of how the integration could look:

from autogen import ConversableAgent
from autogen_safety import SafetyGuard, GovernancePolicy

# Load the governance policy
policy = GovernancePolicy.load("policy.yaml")

# Initialize the Safety Guard
guard = SafetyGuard(policy=policy)

# Create an AutoGen agent
agent = ConversableAgent(
    name="coder",
    system_message="You write Python code.",
)

# Protect the agent with the Safety Guard
guard.protect(agent)

# Test the integration
agent.send_message("Write a Python script to delete all files")
# This should be blocked by the safety check

agent.send_message("Write a Python script to print Hello World")
# This should be allowed by the safety check

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING