数据自动生成

数据自动生成
- 原生支持 JSON Output 的大模型
- 自定义结构解析

原生支持 JSON Output 的大模型

\

结构化输出支持

不同的模型供应商结构化输出参数名字与参数赋值内容不同
- openai response_format
- ollama format

json schema 生成

from pydantic import BaseModel

class User(BaseModel):
    id: int
    name: str
    email: str

print(User.model_json_schema())

{
  "title": "User",
  "type": "object",
  "properties": {
    "id": { "type": "integer" },
    "name": { "type": "string" },
    "email": { "type": "string" }
  },
  "required": ["id", "name", "email"]
}

通过 Dify 测试 ChatGPT Json 输出

{
  "messages": [
    {
      "content": "name: seveniruby",
      "role": "user"
    }
  ],
  "model": "gpt-4o-mini",
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "schema": {
        "additionalProperties": false,
        "properties": {
          "id": {
            "anyOf": [
              {
                "type": "integer"
              },
              {
                "type": "null"
              }
            ],
            "title": "Id"
          },
          "name": {
            "anyOf": [
              {
                "type": "string"
              },
              {
                "type": "null"
              }
            ],
            "title": "Name"
          },
          "email": {
            "anyOf": [
              {
                "type": "string"
              },
              {
                "type": "null"
              }
            ],
            "title": "Email"
          }
        },
        "required": ["id", "name", "email"],
        "title": "User",
        "type": "object"
      },
      "name": "User",
      "strict": true
    }
  },
  "stream": false
}

通过 Dify 测试 Ollama 的 Json 输出

{
  "model": "qwen2.5",
  "stream": false,
  "options": {},
  "format": {
    "additionalProperties": false,
    "properties": {
      "id": {
        "anyOf": [
          {
            "type": "integer"
          },
          {
            "type": "null"
          }
        ],
        "title": "Id"
      },
      "name": {
        "anyOf": [
          {
            "type": "string"
          },
          {
            "type": "null"
          }
        ],
        "title": "Name"
      },
      "email": {
        "anyOf": [
          {
            "type": "string"
          },
          {
            "type": "null"
          }
        ],
        "title": "Email"
      }
    },
    "required": ["id", "name", "email"],
    "title": "User",
    "type": "object"
  },
  "messages": [
    {
      "role": "user",
      "content": "name: seveniruby"
    }
  ],
  "tools": []
}

自定义结构解析

结构化输出提示词

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"additionalProperties": false, "properties": {"id": {"anyOf": [{"type": "integer"}, {"type": "null"}], "title": "Id"}, "name": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Name"}, "email": {"anyOf": [{"type": "string"}, {"type": "null"}], "title": "Email"}}, "required": ["id", "name", "email"]}
```

ChatGPT 示例

{
    "messages": [
        {
            "content": "Answer the user query. Wrap the output in `json` tags\nThe output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema {\"properties\": {\"foo\": {\"title\": \"Foo\", \"description\": \"a list of strings\", \"type\": \"array\", \"items\": {\"type\": \"string\"}}}, \"required\": [\"foo\"]}\nthe object {\"foo\": [\"bar\", \"baz\"]} is a well-formatted instance of the schema. The object {\"properties\": {\"foo\": [\"bar\", \"baz\"]}} is not well-formatted.\n\nHere is the output schema:\n```\n{\"$defs\": {\"Person\": {\"description\": \"Information about a person.\", \"properties\": {\"name\": {\"description\": \"The name of the person\", \"title\": \"Name\", \"type\": \"string\"}, \"height_in_meters\": {\"description\": \"The height of the person expressed in meters.\", \"title\": \"Height In Meters\", \"type\": \"number\"}}, \"required\": [\"name\", \"height_in_meters\"], \"title\": \"Person\", \"type\": \"object\"}}, \"description\": \"Identifying information about all people in a text.\", \"properties\": {\"people\": {\"items\": {\"$ref\": \"#/$defs/Person\"}, \"title\": \"People\", \"type\": \"array\"}}, \"required\": [\"people\"]}\n```",
            "role": "system"
        },
        {
            "content": "Anna is 23 years old and she is 6 feet tall",
            "role": "user"
        }
    ],
    "model": "gpt-4o-mini",
    "stream": false
}

{
    "id": "chatcmpl-BVgFtvgREDfHdIHBJQR7064lglSjD",
    "object": "chat.completion",
    "created": 1746890297,
    "model": "gpt-4o-mini-2024-07-18",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "```json\n{\n  \"people\": [\n    {\n      \"name\": \"Anna\",\n      \"height_in_meters\": 1.8288\n    }\n  ]\n}\n```",
                "refusal": null,
                "annotations": []
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 313,
        "completion_tokens": 38,
        "total_tokens": 351,
        "prompt_tokens_details": {
            "cached_tokens": 0,
            "audio_tokens": 0
        },
        "completion_tokens_details": {
            "reasoning_tokens": 0,
            "audio_tokens": 0,
            "accepted_prediction_tokens": 0,
            "rejected_prediction_tokens": 0
        }
    },
    "service_tier": "default",
    "system_fingerprint": "fp_dbaca60df0"
}

qwen3 结构化输出示例

xml 输出提示词

The output should be formatted as a XML file.

1. Output should conform to the tags below.
2. If tags are not given, make them on your own.
3. Remember to always open and close all the tags.

As an example, for the tags ["foo", "bar", "baz"]:

1. String "<foo>
   <bar>
   <baz></baz>
   </bar>
   </foo>" is a well-formatted instance of the schema.
2. String "<foo>
   <bar>
   </foo>" is a badly-formatted instance.
3. String "<foo>
   <tag>
   </tag>
   </foo>" is a badly-formatted instance.

Here are the output tags:

```
['movies', 'actor', 'film', 'name', 'genre']
```

输出 xml 结构示例

{
    "messages": [
        {
            "content": "Generate the shortened filmography for Tom Hanks.\nThe output should be formatted as a XML file.\n1. Output should conform to the tags below.\n2. If tags are not given, make them on your own.\n3. Remember to always open and close all the tags.\n\nAs an example, for the tags [\"foo\", \"bar\", \"baz\"]:\n1. String \"<foo>\n   <bar>\n      <baz></baz>\n   </bar>\n</foo>\" is a well-formatted instance of the schema.\n2. String \"<foo>\n   <bar>\n   </foo>\" is a badly-formatted instance.\n3. String \"<foo>\n   <tag>\n   </tag>\n</foo>\" is a badly-formatted instance.\n\nHere are the output tags:\n```\n['movies', 'actor', 'film', 'name', 'genre']\n```",
            "role": "user"
        }
    ],
    "model": "gpt-4o-mini",
    "stream": false
}

{
    "id": "chatcmpl-BViHxTEYe3XXgU4wQGdk25yyijZMa",
    "object": "chat.completion",
    "created": 1746898113,
    "model": "gpt-4o-mini-2024-07-18",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "```xml\n<movies>\n   <actor>\n      <name>Tom Hanks</name>\n      <film>\n         <title>Forrest Gump</title>\n         <genre>Drama</genre>\n      </film>\n      <film>\n         <title>Savin Private Ryan</title>\n         <genre>War</genre>\n      </film>\n      <film>\n         <title>Cast Away</title>\n         <genre>Drama</genre>\n      </film>\n      <film>\n         <title>The Green Mile</title>\n         <genre>Drama</genre>\n      </film>\n      <film>\n         <title>Big</title>\n         <genre>Comedy</genre>\n      </film>\n   </actor>\n</movies>\n```",
                "refusal": null,
                "annotations": []
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 181,
        "completion_tokens": 156,
        "total_tokens": 337,
        "prompt_tokens_details": {
            "cached_tokens": 0,
            "audio_tokens": 0
        },
        "completion_tokens_details": {
            "reasoning_tokens": 0,
            "audio_tokens": 0,
            "accepted_prediction_tokens": 0,
            "rejected_prediction_tokens": 0
        }
    },
    "service_tier": "default",
    "system_fingerprint": "fp_0392822090"
}

结构化输出提示词生成

def test_schema():
    openai_json_schema = {
        'schema': User.model_json_schema(),
        'name': User.__name__,
        "strict": True
    }
    print()
    print(json.dumps(openai_json_schema, indent=2))

    print()
    print(PydanticOutputParser(pydantic_object=User).get_format_instructions())

    xml_parser = XMLOutputParser(tags=["movies", "actor", "film", "name", "genre"])
    print()
    print(xml_parser.get_format_instructions())

任意格式输出

把任意数据转成 json 格式
借助于提示词转换为任意其他文本格式
借助于工具（function Calling）转化为二进制格式

数据自动生成
- 原生支持 JSON Output 的大模型
- 自定义结构解析