Skip to main content

Create Evaluation

POST /v1/evaluate

Creates a new evaluation job.

Authentication

  • Type: Bearer Token

Request Body

FieldTypeRequiredDescription
datastringYesEvaluation input, formatted as:
<CONTEXT>...</CONTEXT>\n<USER INPUT>...</USER INPUT>\n<MODEL OUTPUT>...</MODEL OUTPUT>
pass_criteriastringYesThe criteria that determine if the output passes.
rubricstringYesThe rubric or scoring guidelines for the evaluation.

Example Request

curl -X POST [https://sandbox-api.pegasi.ai/v1/evaluate](https://sandbox-api.pegasi.ai/v1/evaluate) \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"data": "<CONTEXT>\nAcme Q2 earnings: $1.2B\n</CONTEXT>\n<USER INPUT>\nWhat were Acme’s Q2 earnings?\n</USER INPUT>\n<MODEL OUTPUT>\nAcme reported $1.2 billion in Q2 earnings.\n</MODEL OUTPUT>",
"pass_criteria": "Must mention the correct Q2 earnings figure.",
"rubric": "The answer should be factually correct and reference the provided context."
}'

Example (Python, OpenAI SDK style)

data = """<CONTEXT>
Acme Q2 earnings: $1.2B
</CONTEXT>

<USER INPUT>
What were Acme's Q2 earnings?
</USER INPUT>

<MODEL OUTPUT>
Acme reported $1.2 billion in Q2 earnings.
</MODEL OUTPUT>
"""

resp = client.evaluations.create(
data=data,
pass_criteria="Must mention the correct Q2 earnings figure.",
rubric="The answer should be factually correct and reference the provided context."
)

Example pass_criteria

The pass_criteria parameter defines the conditions that determine whether an evaluation passes or fails. This can range from simple to complex requirements.

Simple Criteria

Must mention the correct Q2 earnings figure of $1.2B.

Multi-point Criteria

- The answer must reference the exact earnings figure from the context ($1.2B).
- The answer must be directly responsive to the user's question.
- No additional, unsupported claims should be present.
- The response must use professional, clear language.

Conditional Criteria

The response passes if it:
1. Correctly states Acme's Q2 earnings as $1.2B
2. AND either:
a) Provides context about how this compares to previous quarters, OR
b) Explains what factors contributed to this result (if present in context)
3. AND does not include any factual errors

Example rubric

The rubric parameter provides scoring guidelines for evaluating responses. A well-structured rubric helps ensure consistent evaluation.

Basic Rubric

- **Factual Accuracy (Required):** The answer must be factually correct and reference the provided context.
- **Relevance (Required):** The answer must directly address the user's question.
- **Completeness:** The answer should provide all relevant details from the context.
- **No Hallucination:** The answer should not invent facts not present in the context.

Detailed Scoring Rubric

Score responses on a scale of 1-5 for each criterion:

1. **Factual Accuracy** (40% weight)
- 5: Perfect accuracy, with precise figures from context
- 3: Mostly accurate with minor imprecisions
- 1: Contains significant factual errors

2. **Relevance** (30% weight)
- 5: Directly addresses the specific question asked
- 3: Somewhat relevant but includes tangential information
- 1: Fails to address the question

3. **Completeness** (20% weight)
- 5: Includes all relevant information from context
- 3: Includes most key points but misses some details
- 1: Missing critical information

4. **Clarity** (10% weight)
- 5: Clear, concise, well-structured response
- 3: Somewhat clear but could be better organized
- 1: Confusing or poorly structured

Domain-Specific Example: Financial Reports

For earnings report questions:

- **Numerical Accuracy:** Must state the exact figures as presented in context
- **Temporal Precision:** Must correctly identify the time period (Q2, 1H25, etc.)
- **Company Identification:** Must correctly attribute figures to the right entity
- **Contextual Comparison:** Should include YoY or QoQ comparisons if present in context
- **Jargon Usage:** Should use appropriate financial terminology

Create Evaluation Slopsquatting

POST /v1/evaluate/security/slopsquatting

Creates a security evaluation to detect hallucinated dependencies and slopsquatting vulnerabilities in AI-generated code.

Security Alert: Approximately 20% of AI-generated code contains hallucinated dependencies that don't exist in official repositories. Attackers are now registering these phantom packages on PyPI, npm, and other repositories to exploit organizations using AI coding assistants. This creates attack surfaces that bypass traditional security scanning tools.

Authentication

  • Type: Bearer Token

Implementation Options

For code generation tools like Windsurf and Cursor, use the Model Control Protocol (MCP):

// MCP configuration for code generation tools
{
"mcp_version": "1.0",
"components": [
{
"name": "pegasi_slopsquatting_detection",
"type": "security_check",
"endpoint": "https://sandbox-api.pegasi.ai/v1/evaluate/security/slopsquatting",
"trigger": "on_code_generation",
"settings": {
"sensitivity": "high",
"inline_feedback": true,
"block_high_risk": true
},
"auth": {
"type": "bearer",
"env_var": "PEGASI_API_KEY"
}
}
]
}

2. Direct API Call

from openai import OpenAI

client = OpenAI(
api_key="YOUR_PEGASI_API_KEY",
base_url="https://sandbox-api.pegasi.ai/v1"
)

response = client.evaluations.create(
type="security/slopsquatting",
content={
"code": "import tenserflow as tf", # Misspelled tensorflow
"language": "python"
},
settings={"sensitivity": "high"}
)

For complete documentation and additional integration options, see Supply Chain Security Evaluations.

Request Body

FieldTypeRequiredDescription
content.codestringYesThe code snippet to analyze for slopsquatting vulnerabilities
content.languagestringYesProgramming language of the code (e.g., "python", "javascript", "java")
content.contextstringNoAdditional context about the code's purpose
settings.sensitivitystringNoDetection sensitivity: "low", "medium", "high" (default: "medium")
settings.check_importsbooleanNoWhether to check import statements (default: true)
settings.check_package_versionsbooleanNoWhether to check package versions (default: true)
settings.known_packages_dbstringNoPackage database to use: "standard", "extended", "custom" (default: "standard")

Example Response

{
"id": "eval_sec_7a9b3c2d1e",
"created": 1686571085,
"results": {
"issues": [
{
"type": "slopsquatting",
"package": "tenserflow",
"suggestion": "tensorflow",
"confidence": 0.98,
"risk_level": "high",
"description": "Package 'tenserflow' appears to be a typosquat of 'tensorflow', a popular machine learning library",
"line": 4,
"column": 8
},
{
"type": "slopsquatting",
"package": "matpotlib",
"suggestion": "matplotlib",
"confidence": 0.96,
"risk_level": "high",
"description": "Package 'matpotlib' appears to be a typosquat of 'matplotlib', a popular plotting library",
"line": 5,
"column": 8
}
],
"summary": {
"total_issues": 2,
"high_risk": 2,
"medium_risk": 0,
"low_risk": 0
}
}
}

For more detailed information about slopsquatting detection and MCP integration for IDE tools, see Supply Chain Security Evaluations.