kafm' blog

书生大模型公式识别打榜赛参赛记录

2026-03-05T06:43:54.000Z

参赛记录

本文记录参加书生大模型社区比赛的一些过程。

比赛简介

本次比赛是上海 AI Lab 举办的书生大模型实战营（第六期）的社区活动，在算力平台 d.run 上使用沐曦算力，通过微调 VLM 等方法识别输入的公式图片，输出对应的 LaTex 文本。限定使用 InternVL3.5-1B 模型。

之前也有过类似比赛，实战营（第五期）举办了论文分类打榜赛。微调 LLM 对论文摘要进行学科分类。

任务

具体来说，输入图片均为 texlive 渲染得到的 LaTex 公式图片，输出应为对应的 LaTex 公式文本。
例如对于输入：

期望的输出：

\sum_{i=1}^{\infty} \frac{1}{i^2} = \frac{\pi^2}{6} \quad \text{and} \quad \left\| \mathbf{A} \right\| = \sqrt{\lambda_{\max}(\mathbf{A}^T\mathbf{A})} \quad \text{where} \quad \mathbf{A} = \begin{bmatrix} \int_{0}^{1} x^2 dx & \frac{1}{2} \\ 2 & \int_{0}^{2} e^{-x} dx \end{bmatrix}

评估方式

哈希比较成功率: 模型生成的图片与参考图片哈希值完全相同的样本比例。
相似度比较成功率: 模型生成的图片与参考图片的图像相似度（直方图相似度/SSIM/MSE/特征点相似度加权）高于阈值的样本比例。
最终综合得分: 上述两项成功率的加权平均值，全面反映模型的性能。

提示：测试数据的图像可能经过增强

点击展开详情

LLM 结构化输出原理、实现方式及实践

2026-02-03T02:43:54.000Z

结构化输出的意义

LLM 是输出不可控的概率模型，而结构化输出则增强输出的可控性。
有了这种可控性，LLM 就具备了一个程序的交互 interface 可以稳定地接入上下游程序，比如调用工具

结构化输出的原理及实现方式

实现格式化输出通常有以下几种手段，

提示词工程
通过指令 “只输出JSON”，few-shot 示例等提示词技巧引导模型进行格式化输出
后处理
instruct 等库对模型输出进行后处理，进行提取或修复等。包括解析后反馈错误重新生成等。
约束解码
在选择 token 时用 mask 机制直接排除不符合输出格式的候选 token，在有限输出空间内选择 token。
{"key": value}，从仅让模型预测格式串中的value部分，到约束value的格式/类型，约束解码都可以保证可靠的格式化输出。
再进一步，通过等价的状态机进行约束解码，可以进行任意的格式化输出。约束解码是结构化输出的主要工程手段。
SFT / RL
通过后训练手段形成输出偏好，使模型遵循一定模式。可靠性高，是结构化输出的理论保障、能力来源。

当前许多模型厂商从 SFT/RL 的角度增强模型的格式化输出能力，并从推理端实现约束解码（例如Ollama），从而提供可靠的结构化输出 API。

工具调用与结构化输出，先有鸡还是先有蛋

依靠结构化输出的结果，我们才能稳定地处理工具调用
然而，如果模型能调用给定的工具，那么任意输出格式都可以包装成工具，通过 LLM 填写调用参数，也就实现了指定格式的输出
看起来两者相互成就，已经变成了鸡生蛋，蛋生鸡的问题
许多厂商都强调模型有tool use/ function call能力，却不强调模型支持结构化输出，可能工具使用能力是结构化输出的内核吧，结构化输出结合了众多工程手段来提供可靠性

本地模型在 langchain 中的结构化输出

本文在 langchain 以下生态中进行讨论（2026-02-02）

1
2
3

langchain-core            1.2.7
langchain-openai          1.1.7
langchain-ollama          1.0.1

以 YesNoJudge 类的 JSON 格式，搭配不同 with_structured_output() 的格式化输出方式（即method参数）测试结构化输出表现
method 参数可选项有 json_schema, function_calling 和 json_mode，其中json_mode因无法给出确定格式的 JSON ，需要后处理，不推荐使用。（来源：https://platform.openai.com/docs/guides/structured-outputs#json-mode）

class YesNoJudge(BaseModel):
    """return judge result from the context with this function."""

    yes_or_no: Literal["yes", "no"] = Field(
        description="The answer should be either 'yes' or 'no'"
    )
    explanation: str


judge = ChatOpenAI(
            model=model_name, temperature=0.6, streaming=True, **fast_judge_config
        )
        judge = judge.with_structured_output(
            schema=YesNoJudge, method="function_calling", strict=True, include_raw=True
        )

图：指定`json_schema` 方式，在 prompt 中提示按JSON格式输出后，gpt-oss:20b 仍调用不存在的工具进行格式化输出图：指定 `function_calling` 方式，在 prompt 中提示按JSON格式输出后，qwen3:4b 会在内容中输出 json，不调用工具图：指定 `function_calling` 方式，去除json prompt，qwen3:4b 仍倾向于遵循指令在内容中输出结果，不调用工具图：指定 `function_calling` 方式，去除指令式prompt，qwen3:4b 有概率调用工具返回结果

可见 Qwen 模型更重视指令遵循，而 GPT 偏向于使用工具

langchain 中 with_structured_output() 的实现简析

不同 backend 生态包的 with_structured_output() 实现不同，
对于langchan-openai：

function_calling 把 JSON schema 作为 tool 传入，并填充 tool_choice 参数，强制 LLM 使用工具输出工具调用的 schema
相当于把工具调用能力转化为结构化输出能力
json_schema 方式则是对输出内容进行解析，该参数
json_mode 建议配合 prompt 使用，全靠模型发挥
而 create_agnet API 中 response_format 也是直接调用 LLM 后端的结构化输出能力

langchain + langchain-openai + Ollama 后端结构化输出踩的坑

为什么用 langchain-openai 不用 langchain-ollama，因为想着兼容 OpenAI API 会好一些
甚至用 langchain-ollama + Ollama 还一直出现 502 报错。。openai / curl 均正常

看似兼容，实则不兼容
如 langchain[openai, ollama] 中 chat_models 的 with_structured_output() 实现不一致
且将 Ollama 作为后端使用 langchain-openai 时 with_structured_output 的表现和模型高度相关，需根据模型调优 method 参数
实际支持多个后端仍需要进行适配
多处实现，效果不一
如langchain[openai, ollama]中chat_models的with_structured_output()与from langchain.agents import create_agent的response_format都用于结构化输出，然而可靠性不同
过度封装
源码里到处是消除警告的类型不匹配 # type: ignore
MessageState 的消息 messages 传入 AgentState 的 messages 提示类型不匹配，有点搞笑

龙潭公园里的鹅和鸭们日子过得无忧无虑

2025-12-21T12:43:54.000Z

龙潭公园里的鹅和鸭们日子过得无忧无虑呐。

潭水表面大都结着冰，沿潭水绕行一圈，每个鹅鸭聚集的岸边都有人在喂食，有馒头，有不知名的食儿，甚至还有个眼镜哥带一大串葡萄来喂。
馒头屑浮在水面上，鸭子悠悠游来伸脖一衔便进肚，而葡萄都沉入水中。
有鸭子站在冰面上了，就有葡萄滚落周围，这下总能吃了吧！
可葡萄圆滚滚的，这黑绿头的鸭子用窄窄扁扁的嘴衔几次掉落几次，只有极少数葡萄完成了使命。
眼镜哥毫不在乎，在冰上、在水里，吃不吃、吃不吃得着无关紧要，喂食本是一种过程。

继续往前走，又一大群鸭，和拎一大袋馒头的大姨。这边鸭也是游水时偷闲吃两口，但成群结队游向岸边还是给足了大姨面子。
有旁人说道，一清早就有人喂过了！
怨不得鸭吃得糊弄，原来是尝个咸淡，看合不合胃口。

龙潭公园里，鹅和鸭们日子过得无忧无虑呐。

25年12月21日

书生大模型论文分类微调赛参赛记录

2025-08-06T06:43:54.000Z

写在前面

https://aicarrier.feishu.cn/wiki/Gr7Iw6vhTiniMUkBIPvcfBiAnkg

有点遗憾，后面没抢到 GPU，感觉找到正确堆数据的方法了，没 scale 起来，最终从 10 名左右滑落到 28 名

看了排名靠前的一个佬的报告和代码，AI 成分非常重，代码和报告都是 AI 写的

我们古法编程爱好者还是输给了AI 🤡

不，我还没输，我只是没抢到卡没 scale 起来

基于以下信条，参加比赛：

Random is a strong baseline.

DL is data-driven ML.

微调策略

草草调研了下微调方法，还是回到了 Lora，LLM 量化了应该算应该算 QLora 吧

调整了 batchsize 和梯度累积，学习率和 Lora 参数没动，Lora 的 rank 和 alpha 参数可以尝试下倍数关系

但我觉得对这个比赛来说，数据才是制胜法宝

直接 SFT，不进行 Pretrain，（虽然教程教大家 Pretrain …

数据处理

有个类别的名字 math.PH 需要修正，或者把 math-ph 映射回去

丰富模板

用 DeepSeek 和 ChatGPT 各生成了一些模板，最后用了 136 个对话模板，22 个系统指令
尝试改变选项和题干的先后顺序、对选项增加中文名称，反而掉点，老老实实用回原本格式的模板了
如果我设置 test split，肯定整很多花样的模板，比如变换选项顺序
（可能需要考虑能不能收敛，对 Lora 来说应该还好）

Scaling up

Kaggle 上的数据集挺新的，和 arxiv 上没差多少，爬 arxiv 的收益感觉可以忽略不计，像尾部类别 cs.OS 好像也就多了近两三个月的几十条数据
找到正确方向（一定量的简单模板 + 正确的论文抽取和样本生成后）逐渐尝试了每类别 1k，2k，3k，6k 的数据量
尾部类别也没多采样，小模型怕过拟合，而且不到 10 倍也不算严重长尾；何况所有尾类数据都包括进去了，已经拿着答案在背了

数据过滤

单类别 => 主类别 => 过滤交叉类别样本
一开始为了数据质量，限制只选择单类别样本，数量太少；后来扩大到主类别为目标类别的样本
后来检查数据，发现有些类别重叠很严重，像 cs.MM，本来就是 CV/AI 之类的交叉严重的研究领域，而且这个类别样本数量有限；尝试了去掉交叉类别样本，训练感觉没有明显差异
最后看到赛题文档有更新。。。

放心地把交叉类别样本都过滤了

不足之处

没考虑也没尝试全量训练，（full 在这个任务上一定能优于 Lora 吗，存疑）
调整 Lora 参数范围，例如增加 o_proj 之类的
没牢牢占住算力，导致没 scale 起来

数据处理代码

import os
import json
import random

random.seed(42)

output_base_dir = 'internlm/dataset'
if not os.path.exists(output_base_dir):
    os.makedirs(output_base_dir)

input_templates = [
    "Based on the title '{title}', authors '{authors}', and abstract '{abstract}', please determine the scientific category of this paper.",

    "Classification Request: Given the title '{title}', authored by '{authors}', and abstract '{abstract}', identify the research field of this paper.",

    "Field Determination: Analyze the title '{title}', authors '{authors}', and abstract '{abstract}' to assign a discipline category.",

    "Academic Categorization: Based on '{title}' (authors: '{authors}') and abstract content '{abstract}', classify this paper into a scientific domain.",

    "Domain Assignment: Using the title '{title}', author list {authors}, and abstract text '{abstract}', determine the most relevant academic field.",

    "Research Area Identification: From the paper titled '{title}' (by {authors}) and abstract '{abstract}', infer its primary research area.",

    "Paper Taxonomy: Categorize the paper with title '{title}', authors {authors}, and abstract '{abstract}' into a specific scientific discipline.",

    "Subject Labeling: With the metadata: Title '{title}', Authors {authors}, Abstract '{abstract}', generate a subject classification.",

    "Knowledge Domain Inference: Based on '{title}' (by {authors}) and abstract '{abstract}', predict the broad field of study.",

    "Scientific Field Prediction: Analyze the title '{title}', authors {authors}, and abstract '{abstract}' to output a single discipline label.",

    "Multi-Metadata Classification: Integrate the paper\'s title '{title}', author affiliations '{authors}', and abstract '{abstract}' to assign a research category.",
    
    "分类请求：根据标题“{title}”、作者“{authors}”和摘要“{abstract}”，请确定该论文的研究领域。",

    "领域判定：结合标题“{title}”、作者{authors}及摘要内容“{abstract}”，判断此论文所属学科类别。",

    "学术分类：基于论文标题“{title}”（作者：{authors}）和摘要“{abstract}”，将其划分到具体的科学领域。",

    "学科标注：根据标题“{title}”、作者列表{authors}和摘要文本“{abstract}”，确定最相关的学术领域。",

    "研究方向识别：从标题为“{title}”（作者{authors}）及摘要“{abstract}”中推断其主要研究方向。",

    "文献归类：将标题“{title}”、作者{authors}、摘要“{abstract}”的论文归类至特定学科门类。",

    "主题分类：根据元数据：标题“{title}”、作者{authors}、摘要“{abstract}”，生成一个学科分类标签。",

    "知识领域推断：基于标题“{title}”（作者{authors}）及摘要“{abstract}”，预测其所属广泛研究领域。",

    "科学领域预测：分析标题“{title}”、作者{authors}和摘要“{abstract}”，输出单一学科标签。",

    "多维度分类：综合论文标题“{title}”、作者信息{authors}和摘要“{abstract}”，划分研究类别。",

    "Label the research domain of this paper by analyzing:\nTitle: {title}\nAuthors: {authors}\nKey findings: {abstract}'",
    
    "Q: Which academic field does the paper '{title}' by {authors} belong to, given its abstract: '{abstract}'?\nA: The field is:",
    
    "This paper [{title}] authored by {authors} primarily focuses on ______ (fill in the field), as evidenced by the abstract: '{abstract}'.",
    
    "Reviewer Task: Based on the title '{title}', author affiliations {authors}, and abstract summary '{abstract}', assign a discipline category from the taxonomy codes.",
    
    "If the paper '{title}' by {authors} were a book in a library, which section would it shelve in? Abstract clues: '{abstract}'.",

    "Step 1: Extract keywords from '{title}' and abstract: '{abstract}'.\nStep 2: Cross-reference with author '{authors}' expertise.\nStep 3: Output the dominant field.",

    "Compare these metadata to classify the paper:\nTitle focus: {title}\nAuthor expertise: {authors}\nAbstract emphasis: {abstract}\nConclusion: The paper belongs to _____ field.",

    "Can you accurately categorize {title} by {authors} just from this abstract? Prove it: '{abstract}'.",

    "Inputs:\nMetadata: Title={title}, Authors={authors}\nContent: Abstract={abstract}\nProcessing: Apply field codes.\nOutput: Field=?",

    "The DNA of this paper ({title} by {authors}) reveals its academic species. Abstract strand: '{abstract}'. Species identification:",

    "Research Area Identification: From the paper titled '{title}' (by {authors}) and abstract '{abstract}', infer its primary research area.",

    "Paper Taxonomy: Categorize the paper with title '{title}', authors {authors}, and abstract '{abstract}' into a specific scientific discipline.",

    "Subject Labeling: With the metadata: Title '{title}', Authors {authors}, Abstract '{abstract}', generate a subject classification.",

    "Knowledge Domain Inference: Based on '{title}' (by {authors}) and abstract '{abstract}', predict the broad field of study.",

    "Discipline Prediction: Analyze the abstract '{abstract}' of the paper '{title}' authored by {authors} and suggest the academic domain.",

    "Field Classification Task: Use the title '{title}', authors {authors}, and abstract '{abstract}' to assign a research category.",

    "Scientific Area Determination: Given the information — Title: '{title}', Authors: {authors}, Abstract: '{abstract}' — identify the scientific domain.",

    "Area Tagging: From the context of the paper '{title}' and its abstract '{abstract}', assign a field label.",

    "Disciplinary Mapping: With the title '{title}', the author(s) {authors}, and the abstract '{abstract}', map this paper to a discipline.",

    "Research Field Suggestion: Based on the content in the title '{title}' and abstract '{abstract}', recommend the research field.",

    "Topic Classification: Classify the following paper by title '{title}', authors {authors}, and abstract '{abstract}'.",

    "Academic Field Categorization: Given the title '{title}' and abstract '{abstract}', determine which academic field this paper falls into.",

    "Scientific Discipline Inference: Determine the scientific discipline of the paper titled '{title}' (authors: {authors}) based on the abstract '{abstract}'.",

    "Field Assignment Task: Use the provided paper metadata to assign the appropriate research area. Title: '{title}', Authors: {authors}, Abstract: '{abstract}'.",

    "Content-Based Field Classification: Determine the field of study using the paper's title '{title}', authors {authors}, and abstract '{abstract}'.",

    "Scholarly Classification Prompt: Use the paper title '{title}', author list {authors}, and abstract '{abstract}' to classify the research area.",

    "Discipline Deduction: From the title '{title}', author list {authors},  and abstract '{abstract}', deduce the primary academic discipline.",

    "Study Area Determination: Determine the core area of study of the paper titled '{title}' authored by {authors} from the abstract '{abstract}'.",

    "Category Prediction Task: Predict the research category using the paper title '{title}' and abstract '{abstract}'.",

    "Field Analysis Instruction: Based on metadata (title: '{title}', authors: {authors}, abstract: '{abstract}'), identify the study field.",

    "**分类请求：**根据标题“{title}”、作者“{authors}”和摘要“{abstract}”，请确定该论文的研究领域。",

    "**领域判定：**结合标题“{title}”、作者{authors}及摘要内容“{abstract}”，判断此论文所属学科类别。",

    "**学术分类：**基于论文标题“{title}”（作者：{authors}）和摘要“{abstract}”，将其划分到具体的科学领域。",

    "**主题标签生成：**请依据论文的标题“{title}”、作者“{authors}”及摘要“{abstract}”，为其生成对应的学科标签。",

    "**领域识别任务：**请根据以下论文信息（标题：“{title}”，作者：{authors}，摘要：“{abstract}”）识别其研究领域。",

    "**学科归类请求：**请将题为“{title}”、作者为{authors}的论文，基于摘要“{abstract}”进行学科归类。",

    "**研究领域预测：**请根据论文摘要“{abstract}”内容，预测标题为“{title}”的论文的研究领域。",

    "**论文领域自动识别：**输入信息包括标题“{title}”、作者{authors}、摘要“{abstract}”，请自动判断其学科领域。",

    "**学术方向分类任务：**请根据以下论文元数据，判断其研究方向。标题：{title}，作者：{authors}，摘要：{abstract}。",

    "**科学领域分类：**根据论文题目“{title}”和作者“{authors}”、摘要“{abstract}”，将其归类到相应的科学领域。",

    "**领域推理任务：**利用标题“{title}”、作者“{authors}”及摘要“{abstract}”对论文进行研究方向推理。",

    "**领域划分：**请根据“{title}”和“{abstract}”信息，作者为“{authors}”，判断其归属的学术领域。",

    "**分类辅助：**请依据标题“{title}”和作者{authors}的摘要“{abstract}”内容，推荐一个合适的研究分类。",

    "**领域归属分析：**根据论文内容判断其属于哪个研究领域。信息如下：标题：{title}；作者：{authors}；摘要：{abstract}。",

    "**学科方向识别：**请根据摘要“{abstract}”和标题“{title}”，作者是“{authors}”，识别该论文的学科方向。",

    "**论文归类任务：**依据论文元数据“{title}”、“{authors}”、“{abstract}”，请将其归类为某一学科类别。",

    "hich academic field does this paper belong to? Based on its title '{title}', authors {authors}, and abstract '{abstract}', determine the most suitable classification.",

    "Assign a scientific category to the paper below, using its metadata: Title: '{title}', Authors: {authors}, Abstract: '{abstract}'.",

    "Summarize the domain of study that best fits the research described in '{title}' by {authors}. Consider the abstract: '{abstract}'.",

    "Field estimation challenge: Based on the content of this scholarly work (Title: '{title}', by {authors}. Abstract: '{abstract}'), which field is it most aligned with?",

    "Discipline tagging assistant: Help identify the most relevant field for the paper titled '{title}' by {authors}, summarized as: '{abstract}'.",

    "Knowledge scope detection: Use the following metadata to detect the academic scope: Title - '{title}'; Authors - {authors}; Abstract - '{abstract}'.",

    "Contextual paper classification: Examine the title and abstract provided, and place the research in an appropriate scientific taxonomy.",

    "Suggest a domain label for the paper titled '{title}' with abstract '{abstract}'. Focus on broad scientific or technical fields.",

    "Research domain detection: This paper (title: '{title}'; abstract: '{abstract}') was written by {authors}. What is its academic category?",

    "Infer the scholarly classification from the semantic cues in the abstract '{abstract}', title '{title}', and authorship {authors}.",

    "**请问这篇论文属于哪个研究领域？**以下是其基本信息：标题“{title}”，作者{authors}，摘要“{abstract}”。",
    
    "**基于内容的领域分类：**请分析论文标题“{title}”和摘要“{abstract}”，判断其所属的科学门类。",
    
    "请对以下论文信息进行分类，包括标题“{title}”、作者{authors}和摘要“{abstract}”。",
    
    "**根据语义内容判断类别：**请从摘要“{abstract}”和标题“{title}”中提取关键信息，为论文分配一个学术领域。",
    
    "**帮我标注该论文的研究方向：**信息如下：{title}，作者：{authors}，摘要内容：“{abstract}”。",
    
    "**该研究更偏向哪个学科？**结合论文标题与摘要信息，请给出一个合理的分类建议。",
    
    "**从专业角度判断：**基于论文“{title}”与其研究摘要“{abstract}”，其应属于哪个专业领域？",
    
    "**请推荐一个学术标签，**用于表示这篇由{authors}撰写、标题为“{title}”的论文所属领域。",
    
    "**摘要分析分类：**请从该摘要“{abstract}”推测研究方向，并结合论文标题“{title}”做出归属判断。",

    "**内容归类任务提示：**请使用该论文的元数据（{title}、{authors}、{abstract}）对其进行领域标签的生成。",

    "Classify this paper into a research field. Title: '{title}', Authors: ({authors}), Abstract: '{abstract}'.",

    "Given: title '{title}', authors '{authors}', abstract '{abstract}'. Determine the academic domain.",

    "Use the abstract to assign a research category. Title: '{title}', Authors: '{authors}',  Abstract: '{abstract}'.",

    "Input: '{title}' by '{authors}'. Abstract: '{abstract}'. Output: scientific field.",

    "From the title and abstract, categorize this paper. Title: '{title}'. Abstract: '{abstract}', Authors: ({authors}).",

    "Can you help me figure out what field this paper belongs to? Here's the info: title '{title}', authors {authors}, abstract '{abstract}'.",

    "I\'m trying to organize some papers. What category should this one go into? Title: '{title}', Authors: {authors}, Abstract: '{abstract}'.",

    "I read this paper, but I'm unsure about its domain. Can you classify it? Title: '{title}', Abstract: '{abstract}', Authors: '{authors}'.",

    "Which research area would you assign to this work based on its abstract and title? Title: '{title}', Authors: '{authors}',  Abstract: '{abstract}'.",

    "You are an academic journal editor. Based on the title '{title}', authors {authors}, and abstract '{abstract}', assign this paper to a suitable discipline.",

    "As a librarian building a research taxonomy, determine the subject area for the paper: '{title}' by {authors} and abstract: '{abstract}'.",

    "Act as a scientific reviewer. Categorize this manuscript by domain using: Title: '{title}', Abstract: '{abstract}', Author List: '{authors}'.",

    "From the abstract '{abstract}' and title '{title}', (authors {authors}), what can you infer about the research domain of the paper?",

    "What clues in the abstract '{abstract}' and title '{title}', (authors {authors}) suggest the field of study?",

    "Analyze the keywords and topics in '{abstract}' and classify accordingly. And title '{title}', (authors {authors}).",

    "[System] Input received. Paper Title: '{title}', Authors: {authors}, Abstract: '{abstract}'. Proceed to classify by domain.",

    "[AI_Tagger] Please assign subject label based on: Title = '{title}', Abstract = '{abstract}', Author List: '{authors}'.",

    "[MetadataAnalyzer] Classify this entry using embedded text: '{abstract}' (title: '{title}'), (authors {authors}).",

    "Title = '{title}', Abstract = '{abstract}', Author List: '{authors}'. This paper was submitted for classification. Use the metadata to determine the category.",

    "Title = '{title}', Abstract = '{abstract}', Author List: '{authors}'. Generate a domain label based on the core ideas from the abstract and title provided.",

    "请根据标题“{title}”、作者{authors}和摘要“{abstract}”，对该论文进行学科分类。",

    "任务：对以下论文分类。标题：{title}；摘要：{abstract}； 作者列表“{authors}”。",

    "输入元信息：标题“{title}”，摘要“{abstract}”，作者列表“{authors}”。输出：研究领域。",

    "Title = '{title}' Author List: '{authors}', Abstract = '{abstract}',. 分类需求：根据论文摘要和标题内容，为其指定一个研究类别。",

    "给出以下论文信息，请判断所属学科门类。 Title = '{title}', Abstract = '{abstract}', Author List: '{authors}'.",

    "请问这篇文章属于哪个领域？标题是“{title}”，摘要如下：“{abstract}”。作者列表“{authors}”。",

    "我正在整理文献，不确定这篇论文的研究方向。你能帮我分类吗？信息如下。标题是“{title}”，摘要如下：“{abstract}”。作者列表“{authors}”。",

    "根据摘要“{abstract}”的内容，这篇题为“{title}” （作者列表“{authors}”）的论文应该归属哪个研究领域？",

    "我不太确定这篇文章的学科归属，可以请你判断一下吗？标题是“{title}”，摘要如下：“{abstract}”。作者列表“{authors}”。",

    "你是一位资深学术期刊编辑，请根据标题“{title}”、作者{authors}、摘要“{abstract}”为其确定研究方向。",

    "作为图书馆分类员，你需要为这篇论文分配一个学科分类。标题“{title}”、作者{authors}、摘要“{abstract}”。",

    "请模拟审稿人角色，为该论文选择一个最合适的研究领域。标题“{title}”、作者{authors}、摘要“{abstract}”。",

    "标题“{title}”、作者{authors}、摘要“{abstract}”。 请模拟审稿人角色，为该论文选择一个最合适的研究领域。",

    "从摘要“{abstract}”中的关键词判断，该论文属于哪一类学科？额外的信息：标题“{title}”、作者{authors}。",

    "从研究目标和方法出发，请为该论文做出领域归属判断。标题“{title}”、作者{authors}、摘要“{abstract}”。",

    "通过标题“{title}”及其对应的研究内容“{abstract}”，推断其最可能的研究方向。作者列表：“{authors}”。",

    "[系统请求] 输入论文元信息：标题“{title}”、作者{authors}、摘要“{abstract}”。请进行自动分类。",

    "[分类助手] 请为该论文分配一个领域标签。标题“{title}”、作者{authors}、摘要“{abstract}”。",

    "[AI 分类引擎] 任务输入：{title}，摘要：{abstract}。请输出所属学科。作者列表：“{authors}”。",

    "如果你只读了以下论文摘要“{abstract}”和标题“{title}”，（作者列表你可能不关心：“{authors}”）你会认为它属于哪个领域？",

    "假设你是一个“论文归类机器人”，你的任务是为这篇论文打上一个准确的学科标签。标题“{title}”、作者{authors}、摘要“{abstract}”。",

    "[System Instruction] Paper classification task initiated. Input: title '{title}', authors {authors}, abstract '{abstract}'. Please assign an appropriate research domain label.",

    "[MetadataClassifier::Invoke] -> Analyze the paper with metadata {title}, {authors}, and {abstract}. Output: scientific discipline.",

    "[Task: ResearchFieldDetection] Paper metadata received. Begin classification using the abstract and title.\n> Title: '{title}'\n> Authors: {authors}\n> Abstract: '{abstract}'",

    "[CLASSIFY_PAPER] Inputs:\n- TITLE = '{title}'\n- AUTHORS = {authors}\n- ABSTRACT = '{abstract}'\n→ RETURN: FIELD_LABEL",

    "[System Input] A new research paper has been submitted. Please determine the academic category based on:\n• Title: '{title}'\n• Authors: {authors}\n• Abstract: '{abstract}'",

    "【系统指令】已接收到论文元数据。请根据标题“{title}”、作者{authors}和摘要“{abstract}”，判定所属学科领域。",

    "【研究领域分类模块】接收到一篇新论文，请根据摘要与标题内容进行自动归类。\n→ 论文信息：{title}，{authors}，{abstract}",

    "[调用接口：学科分类] 参数如下：标题：{title}作者：{authors}摘要：{abstract}→ 返回值：学术领域标签"

]

options = "A. quant-ph\nB. physics.chem-ph\nC. physics.atom-ph\nD. cond-mat.soft\nE. cs.RO\nF. cs.CL\nG. cs.SE\nH. cs.IR\nI. hep-th\nJ. hep-ph\nK. physics.optics\nL. cs.AI\nM. cs.CV\nN. nucl-th\nO. astro-ph\nP. math.PR\nQ. cs.OS\nR. eess.SP\nS. math.OC\nT. math.DS\nU. math.DG\nV. math.MP\nW. cs.MM\nX. stat.ME\nY. math.CO\nZ. cs.NE"
# options = "A. 量子物理 quant-ph\nB. 化学物理 physics.chem-ph\nC. 原子物理 physics.atom-ph\nD. 软凝聚态物理 cond-mat.soft\nE. 机器人学 cs.RO\nF. 计算语言学 cs.CL\nG. 软件工程 cs.SE\nH. 信息检索 cs.IR\nI. 高能理论物理 hep-th\nJ. 高能现象学 hep-ph\nK. 光学 physics.optics\nL. 人工智能 cs.AI\nM. 计算机视觉 cs.CV\nN. 核理论 nucl-th\nO. 天体物理 astro-ph\nP. 概率论 math.PR\nQ. 操作系统 cs.OS\nR. 信号处理 eess.SP\nS. 最优化与控制 math.OC\nT. 动力系统 math.DS\nU. 微分几何 math.DG\nV. 数学物理 math.MP\nW. 多媒体 cs.MM\nX. 统计方法 stat.ME\nY. 组合数学 math.CO\nZ. 神经与进化计算 cs.NE"

# author_template = []

instruction_templates = [
    "You are an AI academic librarian trained to classify research papers with 99%\ accuracy.",
    "[SYSTEM ROLE] Domain Classification Officer\n Mission: Categorize the paper",
    "你是个优秀的论文分类师",
    "As a meta-reviewer AI, you must:\n1. Identify 4 key terms from title of the paper\n2. Cross-check with authors publication history\n3. Map abstract to the most ralted subfileds",
    "By academic protocol GPT-2025, you are required to\n1. Disclose uncertainty if abstract is ambiguous\n2. Prioritize author-specified keywords in title\n3. Identify the most ralted subfileds",
    "Task: Teach a graduate student how to classify title.\nSteps:\na) Highlight disciplinary cues in abstract\nb) Explain why authors affiliations suggest _____ field\nc) Conclude with the option [A-Z Arxiv field code]",
    "[AI CLASSIFIER v3.1 INPUT]\nTitle: title\nAuthors: authors\nAbstract: abstract\nPROCESSING...\nOUTPUT: [A-Z Arxiv code]",
    "As an ethical AI classifier, you MUST:\nAvoid overgeneralization (e.g., 'Engineering' is too broad)\nCite classification rationale from abstract\nExample output: [Arxiv field code]",
    "[URGENT PEER REVIEW REQUEST] Deadline: 10s to classify title (authors) for conference track assignment. Abstract snapshot: abstract. Respond ONLY with track option from provided list.",
    "你是一名学术档案管理员，需根据《Arxiv图书馆分类法》根据题目、作者和摘要内容对论文进行精准分类。并输出Arixv分类代码",
    "[系统指令] 国家自然科学基金委AI评审员  任务：依据标题、作者及摘要，从申请代码A-Z中选择最匹配的子领域",
    "作为学术审计AI，你必须：\n① 从摘要提取方法论关键词\n② 核对authors在Scopus的研究主题\n③ 对照 A-Z 的《学科分类与代码》\n最终输出分类代码：",
    "根据《AI科研分类规范》2024版：\n标题中的'研究'/'分析'等词不得作为分类依据\n需明确摘要中的3处领域特征\n输出包含 A-Z 的分类代码",
    "假设你是一个“论文归类机器人”，你的任务是为这篇论文打上一个准确的学科标签。",
    "[分类助手] 请为该论文分配一个领域标签。",
    "从研究目标和方法出发，请为该论文做出领域归属判断。",
    "从摘要和题目中的关键词判断，该论文属于哪一类学科？",
    "作为图书馆分类员，你需要从摘要和题目中的关键词判断为这篇论文分配一个学科分类。",
    "请模拟审稿人角色，为该论文选择一个最合适的研究领域，可以从摘要和题目进行判断。",
    "作为论文资深读者，你可以通过论文元信息判断所属学科门类。",
    "This paper was submitted for classification. Use the metadata to determine the category.",
    "Generate a domain label based on the core ideas from the abstract and title provided.",
]

option_map = {"A": "quant-ph", "B": "physics.chem-ph", "C": "physics.atom-ph", "D": "cond-mat.soft", "E": "cs.RO", "F": "cs.CL",
            "G": "cs.SE", "H": "cs.IR", "I": "hep-th", "J": "hep-ph", "K": "physics.optics", "L": "cs.AI", "M": "cs.CV", "N": "nucl-th",
            "O": "astro-ph", "P": "math.PR", "Q": "cs.OS", "R": "eess.SP", "S": "math.OC", "T": "math.DS", "U": "math.DG", "V": "math.MP",
            "W": "cs.MM", "X": "stat.ME", "Y": "math.CO", "Z": "cs.NE"}
get_options = dict(zip(option_map.values(), option_map.keys()))

other_option_map = {}
for category in get_options.keys():
    other_categories = set(option_map.values())
    other_categories.remove(category)
    other_option_map[category] = other_categories

def preprocess_arxiv_json(input_jsonl_file, output_jsonl_file):
    """
    Preprocess the arXiv JSONL file to extract and save the 'title', 'abstract' 
    and other fields to build a sft dataset for a category.

    Args:
        input_jsonl_file (str): Path to the input JSONL file.
        output_jsonl_file (str): Path to the output JSONL file.
    """
    papers = dict(zip(option_map.values(), [list() for _ in option_map.values()]))
    with open(input_jsonl_file, 'r', encoding='utf-8') as f:
        for line in f:
            item = json.loads(line)
            title = item.get('title', '')
            authors: str = item.get('authors', '')
            abstract: str = item.get('abstract', '')
            categories: str = item.get('categories', '')
            # if categories.startswith(category):
            for category in papers.keys():
                if category in categories and not any(c in categories for c in other_option_map[category]): # 排除交集样本
                    instruction = random.choice(instruction_templates)
                    input_text = random.choice(input_templates).format(title=json.dumps(title), authors=json.dumps(authors), abstract=json.dumps(abstract))
                    # if random.randint(0, 1) == 0:
                    #     input_text = input_text + '\n\n' + options
                    # else:
                    #     input_text = options + '\n\n' + input_text
                    input_text = input_text + '\n\n' + options
                    output = get_options[category]
                    papers[category].append({"instruction": instruction, "input": input_text, "output": output})
                    break

    for category in papers.keys():
        cnt = 0
        output_file = os.path.join(output_base_dir, f"{category}.jsonl")
        with open(output_file, 'w', encoding='utf-8') as out_f:
            for item in papers[category]:
                out_f.write(json.dumps(item, ensure_ascii=False))
                out_f.write('\n')
                out_f.flush()
                cnt += 1
        print(f'{category}: {cnt}')

def fix_category(input_jsonl_file, output_jsonl_file, category, repeat_to=0, judge_rule=lambda x, y: x.startswith(y), open_mode='w', exclude_multi=True):

    cnt = 0
    if open_mode != 'w':
        with open(output_jsonl_file, 'r', encoding='utf-8') as out_f:
            cnt = len(out_f.readlines())

    def _fix_category(input_jsonl_file, output_jsonl_file, category):
        nonlocal cnt

        # 补充 Q，V, W 类
        with open(input_jsonl_file, 'r', encoding='utf-8') as f, open(output_jsonl_file, open_mode, encoding='utf-8') as out_f:
            data = []
            for line in f:
                item = json.loads(line)
                title = item.get('title', '')
                authors: str = item.get('authors', '')
                abstract: str = item.get('abstract', '')
                categories: str = item.get('categories', '')
                if exclude_multi and any(c in categories for c in other_option_map[category]): # 排除交集样本
                    continue
                # if categories.startswith(category):
                if judge_rule(categories, category):
                    categories = category
                    if categories == 'math-ph':
                        categories = 'math.MP'
                    instruction = random.choice(instruction_templates)
                    input_text = random.choice(input_templates).format(title=json.dumps(title), authors=json.dumps(authors), abstract=json.dumps(abstract))
                    input_text = input_text + '\n\n' + options
                    output = get_options[categories]
                    item = json.dumps({"instruction": instruction, "input": input_text, "output": output}, ensure_ascii=False)
                    data.append(item)
            for item in data:
                out_f.write(item)
                out_f.write('\n')
                cnt += 1
    
    _fix_category(input_jsonl_file, output_jsonl_file, category)
    while cnt < repeat_to:
        _fix_category(input_jsonl_file, output_jsonl_file, category)
    print(f'after fix, {category}: {cnt}')
    

def cnt_in_filename(basedir: str):
    for fname in os.listdir(basedir):
        if fname.endswith('jsonl'):
            num = 0
            with open(os.path.join(output_base_dir, fname), 'r') as f:
                num = len(f.readlines())
            # os.rename(os.path.join(basedir, fname), os.path.join(basedir, f"{fname.removeprefix(('.jsonl'))}_{num}.jsonl"))
            os.rename(os.path.join(basedir, fname), os.path.join(basedir, f"{fname.replace('.jsonl.jsonl', '.jsonl')}"))

def gather(basedir: str, sample_num_class):
    cnt = 0
    with open('arxiv_20k_rich.jsonl', 'w') as out_f:
        for fname in os.listdir(basedir):
            if fname.endswith('jsonl'):
                data = []
                with open(os.path.join(output_base_dir, fname), 'r') as f:
                    data = f.readlines()
                data = random.sample(data, min(sample_num_class, len(data)))
                for line in data:
                    out_f.write(line)
                out_f.flush()
                print(f'{fname}: {len(data)}')
                cnt += len(data)
    print(f'total {cnt}')
            

if __name__ == "__main__":

    print(f"""
        系统提示词模板数量：{len(instruction_templates)}
        用户提示词模板数量：{len(input_templates)}
          """)

    arxiv_json_file = 'd:/data/arxiv-metadata-oai-snapshot.json'
    # arxiv_json_file = './test.jsonl'
    output_json_file = './arxiv_sftdata.jsonl'

    # 归类单类别论文
    # preprocess_arxiv_json(arxiv_json_file, output_json_file)

    """
    quant-ph: 75119
    physics.chem-ph: 5999
    physics.atom-ph: 6848
    cond-mat.soft: 14530
    cs.RO: 15943
    cs.CL: 32125
    cs.SE: 10743
    cs.IR: 5137
    hep-th: 60558
    hep-ph: 83572
    physics.optics: 17736
    cs.AI: 12987
    cs.CV: 74045
    nucl-th: 19846
    astro-ph: 86911
    math.PR: 25289
    cs.OS: 347          问题：稀少 cs.OS only (347); cs.OS contains (565); new primary (1060); total (2174)
    eess.SP: 11193
    math.OC: 19764
    math.DS: 14277
    math.DG: 17389
    math.MP: 0          问题：别名 math-ph；交叉主题多（）            
    cs.MM: 1087         问题：交叉主题多（cs.CV、 cs.AI）               # 考虑尽量少选 2018 年以前的，避免交叉  primary (32848)
    stat.ME: 12315
    math.CO: 32513
    cs.NE: 2904         问题：较少；交叉主题多（cs.CV、 cs.AI）
    """

    # 处理特别类
    output_Q_json_file = os.path.join(output_base_dir, "cs.OS.jsonl")
    output_V_json_file = os.path.join(output_base_dir, "math.MP.jsonl")
    output_W_json_file = os.path.join(output_base_dir, "cs.MM.jsonl")
    # fix_category(arxiv_json_file, output_Q_json_file, 'cs.OS', judge_rule=lambda categories, category: category in categories, open_mode='w')                     # Q 1114
    # fix_category('arxiv_cs.OS_1120.jsonl', output_Q_json_file, 'cs.OS', judge_rule=lambda categories, category: True, open_mode='a', exclude_multi=True)            # Q 2174    有重复
    # fix_category(arxiv_json_file, output_V_json_file, 'math-ph', judge_rule=lambda categories, category: categories.startswith(category), open_mode='w')          # V 32848
    # fix_category(arxiv_json_file, output_W_json_file, 'cs.MM', judge_rule=lambda categories, category: categories.startswith(category), open_mode='w')            # W 2469 不去除交叉
    # fix_category('arxiv_cs.MM_5012.jsonl', output_W_json_file, 'cs.MM', judge_rule=lambda categories, category: True, open_mode='a', exclude_multi=True)            # W 1726 去除交叉 未使用

    # 归集
    gather(basedir=output_base_dir, sample_num_class=20000)

    # 计数
    # cnt_in_filename(basedir=output_base_dir)