<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>kafm&#39; blog</title>
  
  <subtitle>kafm&#39; blog</subtitle>
  <link href="https://www.kafm.eu.org/atom.xml" rel="self"/>
  
  <link href="https://www.kafm.eu.org/"/>
  <updated>2026-04-06T06:27:03.000Z</updated>
  <id>https://www.kafm.eu.org/</id>
  
  <author>
    <name>kafm</name>
    
  </author>
  
  <generator uri="https://hexo.io/">Hexo</generator>
  
  <entry>
    <title>书生大模型公式识别打榜赛参赛记录</title>
    <link href="https://www.kafm.eu.org/record/LLM/%E4%B9%A6%E7%94%9F%E5%A4%A7%E6%A8%A1%E5%9E%8B%E5%85%AC%E5%BC%8F%E8%AF%86%E5%88%AB%E6%89%93%E6%A6%9C%E8%B5%9B%E5%8F%82%E8%B5%9B%E8%AE%B0%E5%BD%95/"/>
    <id>https://www.kafm.eu.org/record/LLM/%E4%B9%A6%E7%94%9F%E5%A4%A7%E6%A8%A1%E5%9E%8B%E5%85%AC%E5%BC%8F%E8%AF%86%E5%88%AB%E6%89%93%E6%A6%9C%E8%B5%9B%E5%8F%82%E8%B5%9B%E8%AE%B0%E5%BD%95/</id>
    <published>2026-03-05T06:43:54.000Z</published>
    <updated>2026-04-06T06:27:03.000Z</updated>
    
    <content type="html"><![CDATA[<h1 id="参赛记录"><a class="markdownIt-Anchor" href="#参赛记录"></a> 参赛记录</h1><p>本文记录参加书生大模型社区比赛的一些过程。</p><h2 id="比赛简介"><a class="markdownIt-Anchor" href="#比赛简介"></a> 比赛简介</h2><p>本次<a class="link"   href="https://aicarrier.feishu.cn/wiki/JSA8wmd7liiQ72kBf8dcLnDxnHb" >比赛<i class="fas fa-external-link-alt"></i></a>是上海 AI Lab 举办的书生大模型实战营（第六期）的社区活动，在算力平台 <a class="link"   href="https://d.run/" >d.run<i class="fas fa-external-link-alt"></i></a> 上使用沐曦算力，通过微调 VLM 等方法识别输入的公式图片，输出对应的 LaTex 文本。限定使用 InternVL3.5-1B 模型。</p><p>之前也有过类似比赛，实战营（第五期）举办了论文分类打榜赛。微调 LLM 对论文摘要进行学科分类。</p><p><strong>任务</strong></p><p>具体来说，输入图片均为 texlive 渲染得到的 LaTex 公式图片，输出应为对应的 LaTex 公式文本。<br />例如对于输入：<br /><img                           lazyload                       alt="image"                       data-src="https://cdn.jsdelivr.net/gh/kafmws/pictures/notes/LaTex示例图片.png"                         alt="LaTex示例图片" style="clear:both;display:block;" width="70%"                 ><br />期望的输出：</p><figure class="highlight latex"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">\sum</span><span class="built_in">_</span>&#123;i=1&#125;<span class="built_in">^</span>&#123;<span class="keyword">\infty</span>&#125; <span class="keyword">\frac</span>&#123;1&#125;&#123;i<span class="built_in">^</span>2&#125; = <span class="keyword">\frac</span>&#123;<span class="keyword">\pi</span><span class="built_in">^</span>2&#125;&#123;6&#125; <span class="keyword">\quad</span> <span class="keyword">\text</span>&#123;and&#125; <span class="keyword">\quad</span> <span class="keyword">\left</span><span class="keyword">\|</span> <span class="keyword">\mathbf</span>&#123;A&#125; <span class="keyword">\right</span><span class="keyword">\|</span> = <span class="keyword">\sqrt</span>&#123;<span class="keyword">\lambda</span><span class="built_in">_</span>&#123;<span class="keyword">\max</span>&#125;(<span class="keyword">\mathbf</span>&#123;A&#125;<span class="built_in">^</span>T<span class="keyword">\mathbf</span>&#123;A&#125;)&#125; <span class="keyword">\quad</span> <span class="keyword">\text</span>&#123;where&#125; <span class="keyword">\quad</span> <span class="keyword">\mathbf</span>&#123;A&#125; = <span class="keyword">\begin</span>&#123;bmatrix&#125; <span class="keyword">\int</span><span class="built_in">_</span>&#123;0&#125;<span class="built_in">^</span>&#123;1&#125; x<span class="built_in">^</span>2 dx <span class="built_in">&amp;</span> <span class="keyword">\frac</span>&#123;1&#125;&#123;2&#125; <span class="keyword">\\</span> 2 <span class="built_in">&amp;</span> <span class="keyword">\int</span><span class="built_in">_</span>&#123;0&#125;<span class="built_in">^</span>&#123;2&#125; e<span class="built_in">^</span>&#123;-x&#125; dx <span class="keyword">\end</span>&#123;bmatrix&#125;</span><br></pre></td></tr></table></figure><p><strong>评估方式</strong></p><blockquote><p>哈希比较成功率: 模型生成的图片与参考图片哈希值<strong>完全相同</strong>的样本比例。<br />相似度比较成功率: 模型生成的图片与参考图片的<strong>图像相似度</strong>（直方图相似度/SSIM/MSE/特征点相似度 加权）高于阈值的样本比例。<br />最终综合得分: 上述两项成功率的<strong>加权平均值</strong>，全面反映模型的性能。</p></blockquote><p>提示：测试数据的图像可能经过增强</p><details>   <summary>点击展开详情</summary><blockquote><p>本次比赛可能会涉及一些基础图像增强变换，包括但不限于：</p><ol><li>几何与空间变换</li></ol><ul><li>旋转 (Rotation)</li><li>透视变换 (Perspective)</li><li>仿射变换 (Affine: 含缩放与错切)</li><li>镜头畸变 (Lens Distortion)</li><li>画布扩展与边缘裁剪 (Canvas Expansion &amp; Border Trimming)</li></ul><ol start="2"><li>颜色与光照调整</li></ol><ul><li>颜色抖动 (Color Jitter: 含亮度与对比度)</li><li>RGB 通道偏移 (RGB Shift)</li><li>色温调整 (Color Temperature)</li><li>Gamma 校正 (Gamma Correction)</li><li>通道随机丢失 (Channel Dropout)</li></ul><ol start="3"><li>噪声干扰</li></ol><ul><li>高斯噪声 (Gaussian Noise)</li><li>椒盐噪声 (Salt &amp; Pepper Noise)</li><li>泊松噪声 (Poisson Noise)</li><li>散斑噪声 (Speckle Noise)</li></ul><ol start="4"><li>模糊与画质损伤</li></ol><ul><li>高斯模糊 (Gaussian Blur)</li><li>运动模糊 (Motion Blur)</li><li>JPEG 压缩伪影 (JPEG Artifacts)</li></ul></blockquote></details>&emsp;<p>评估数据包括两版，A榜 和 B榜，类似于验证集和测试集，样本均未公开。</p><p><strong>资源</strong></p><ul><li><p>代码：提供了两份 Lora SFT 的 baseline，基于 <a class="link"   href="https://swift.readthedocs.io/zh-cn/v3.12/Instruction/Command-line-parameters.html" >ms-swift<i class="fas fa-external-link-alt"></i></a> 的 <a class="link"   href="https://aicarrier.feishu.cn/wiki/JBiWwLG4cishC6kJ4LucAfbonoc" >demo<i class="fas fa-external-link-alt"></i></a> 和基于 <a class="link"   href="https://github.com/InternLM/xtuner" >XTuner<i class="fas fa-external-link-alt"></i></a> 的 <a class="link"   href="https://aicarrier.feishu.cn/wiki/C4Z5wnvD8iOQPhk3aZhcWVRhndh" >demo<i class="fas fa-external-link-alt"></i></a>，均包括评估框架和微调代码。<br />实测第一个 demo 在A榜得分 56.50，B榜得分 48.60.（其实这个分数一下就能进前 60 了。。）</p></li><li><p>数据：提供有一个包含 3000 个图像文本对的训练数据集，上面两个 demo 都使用该数据集。</p></li><li><p><strong>奖金</strong>：1st, 2nd, 3rd, 4-20th, 21-60th 分别有 6/4/3/2/1k RMB. 是真的我作证，因为上一期拿到了 1k.</p></li></ul><h2 id="分析"><a class="markdownIt-Anchor" href="#分析"></a> 分析</h2><p>InternVL3.5-1B 是 LLaVA 式的 VLM, 这类的 VLM 约等于 Vison Backbone + MLP + LLM，其中 MLP 相当于视觉和语言的桥梁，把视觉 token 映射到视觉语义共享的特征空间，作为 LLM 的视觉输入。<br />跑完 baseline 后，简单来说有以下改进方向：</p><ul><li>数据：①增加训练数据（寻找/合成）；②图像数据增强</li><li>训练：③尝试不同的微调方式，如全量微调；调整微调不同部分的权重</li><li>寻求更大的增益：④因为有明确的奖励信号，适合用 RL 优化模型偏好，把输出对齐到合法的 LaTex 文本；⑤通过后处理约束输出 LaTex 文本的合法性</li></ul><h2 id="过程"><a class="markdownIt-Anchor" href="#过程"></a> 过程</h2><p><strong>样本数量扩充</strong><br />首先尝试了更大的开源数据集，直接扑街，A榜 1 分，B榜 0.2 分，统计命令出现频率发现数据分布天差地别，估计初始训练数据和 AB 榜相关性比较强。（后来举办方明确说明了初始训练数据和 AB 榜生成模式相同。好吧如果是我，我肯定给 B 榜数据构造点儿偏移）看了下初始数据，全是包含微分和矩阵的合成公式，本质上一个 LaTex 的高频子集，语法正确但不保证内容有意义。那么就根据初始数据的分布扩充数据。</p><p>第一版数据生成器比较粗糙，数据扩充到 9k 分数有提升。扩充到 20k 反而 A 榜分数有下降。这时由于评测时间过长，B 榜分数出现了严重的滞后，根据 A 榜表现放弃了 20k 版本的数据。<br />第二版数据生成器严格按照命令分布和长度生成数据，90k 数据效果依然拉跨，甚至不如 20k 版本。放弃。重心转移到下面几个方向。<br />这时还有两个可能没有探索：数据量上来后，是否该切到用全量微调；只约束了样本文本长度均值，没有约束分布。</p><p><strong>数据清洗</strong><br />观察文本命令分布时发现有很多不规则空格及换行符，感觉会干扰预测空间，与 AI 讨论后对文本做标准化，删除了所有换行符和多余空格。分数有提升。</p><p><strong>数据增强</strong><br />按照提示实现了类似 RandAug 的图像增强，调整某些增强的强度例如旋转角度在合理范围内。</p><p><strong>多打几个补丁</strong><br />Lora 仿佛在学习权重（而非特征）的残差，而且还能合并回原参数，像在预训练参数上打升级补丁。<br />那自然可以多打几个补丁强化任务适配，LLM，MLP, Vison, LLM+MLP 尝试了依次在这四个配置上微调，效果均有提升。<br />尝试了全量微调 LLM + MLP，A 榜效果等同 Lora 微调甚至略低，有点反直觉，可能是同时调整的参数太多。但这也符合（权重的）“残差”更易于学习的刻板印象。</p><p><strong>小乌龙</strong><br />提交允许自定义 prompt。什么？微调的 1B VLM 模型还有 prompt 工程价值？好吧确实可能有，我至少应该提示模型输出的公式语法要正确。于是设计了如下 prompt：</p><figure class="highlight text"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">请根据图片中的公式生成对应的 latex 语法正确的公式文本。</span><br><span class="line">*原 prompt：请根据图片中的公式生成对应的 latex 公式文本。</span><br></pre></td></tr></table></figure><p><del>这是 native speaker 写出的中文句子？LLM 看了可能都不习惯，以后还是要多写作、多表达</del><br />结果，提交了八九次才发现第二版开始训练数据用的还是默认模板，搞笑。这个低级错误害我又提交了好几次重复权重来测 prompt 的影响，结论是好像没太大影响，B 榜波动小，A 榜波动大。</p><h2 id="更进一步"><a class="markdownIt-Anchor" href="#更进一步"></a> 更进一步</h2><p>一直瞎训提交 A 榜抽奖也没啥意思（主要是没啥进步空间了），不如趁这机会试试其他玩意</p><p><strong>RL</strong><br />对 RL 一直敬而远之，感觉需要大量时间和空间，事实证明确实如此。基于 trl 搭好 GRPO 框架，在 LLM 的辅助下设计了三个奖励函数，分别评估识别结果的语法正确性，和标签的编辑距离以及渲染结果和输入图片相似度（直接借用评测框架里的计算方式）。</p><p>小小的吐槽一下，怀疑比赛举办方没有 review 过 vibe coding 的评测框架代码，每种相似度评分都重复地进行预处理，以及实现了批量推理却没有使用，不知道是不是 B 榜出分缓慢的原因之一。</p><p>跑起来 GRPO 了，此时距比赛结束还有不到 48 小时，一看预计运行时间 40+ 小时，实际上更慢。中间还断了两次，感觉沐曦卡还不是特别稳定，国产算力任重道远。vllm 在沐曦上没找到合适环境，torch 版本太低，swift sample 也没调通，就龟速地一边推一遍训了。</p><p>log 里 clip_ratio=0、entropy 指标也在下降，Claude 说模型没学到啥，毕竟我看示例都是训了几百 ep，我这 1 ep 都跑不完，训了 600 steps，Lora 权重也没啥变化，有图为证，<br /><img                           lazyload                       alt="image"                       data-src="https://cdn.jsdelivr.net/gh/kafmws/pictures/notes/600 steps 后 Lora 参数的变化比较.png"                         alt="600 steps 后 Lora 参数的变化比较" style="clear:both;display:block;" width="50%"                 ><br />RL 失败，按下不表。此时距比赛结束还有不到 30 小时.</p><p><strong>约束解码/后处理</strong><br />其实看到比赛题目时，有个疑问就浮在脑海：<strong>相似度评估真的有效吗</strong><br />公式渲染方式一致，预处理方式一致，白/透明底黑字之间会有非常显著的差别吗？<br />对 10 张测试图像两两组合计算相似度均值，在默认加权中，相似度基本略高于 0.5-0.6，如果 B 榜评测时设置的相似度阈值较低，或者渲染成功就有分，那给 LLM 配个能正确渲染的保底输出，就能稳定提分。这里就要提到 transformers 的 AutoClass 类的 trust_remote_code 了，简单来说它允许加载模型仓库里的代码用于推理，给评估时的后处理提供了注入条件。实测默认返回一张初始训练集的样本 sample19996.png 结果是 A 榜 0 分，此路不通，推测相似度阈值应该在 0.7 甚至 0.75 以上。</p><p>按照约束解码的思路，根据完整的 LaTex 文法约束 token 采样，但 LaTex 语法略显复杂，先从错误样本做个简单版本：<br />推理 100 张测试图像有 6 个渲染错误，检查错误样本，归类错误原因。通过 logits processor 修复已知错误。于是请求 Claude，</p><blockquote><p>在生成的公式中观察到的错误有：<br />1、没任何矩阵环境，生成了 &amp;<br />2、生成不匹配的括号，例如 (]<br />3、编造的命令，例如：\vbeta<br />4、数字后出现下划线如 b_6_4<br />5、命令层级不匹配，如\begin{array} \begin{pmatrix} 后首先出现 \end{array}<br />6、…<br />请以类似约束解码的方式，举一反三地抑制不合法的输出，每种错误由一个函数来检查，使得易于检查和使用</p></blockquote><p>Claude 创建了一个类 LatexState 记录生成过程的各种状态</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">@dataclass</span></span><br><span class="line"><span class="keyword">class</span> <span class="title class_">LatexState</span>:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;追踪当前已生成序列的语法状态&quot;&quot;&quot;</span></span><br><span class="line">    <span class="comment"># 括号栈：追踪未闭合的括号</span></span><br><span class="line">    bracket_stack: <span class="built_in">list</span> = field(default_factory=<span class="built_in">list</span>)</span><br><span class="line">    <span class="comment"># 环境栈：追踪 \begin&#123;&#125; \end&#123;&#125; 嵌套</span></span><br><span class="line">    env_stack: <span class="built_in">list</span> = field(default_factory=<span class="built_in">list</span>)</span><br><span class="line">    <span class="comment"># 当前是否在矩阵环境中</span></span><br><span class="line">    in_matrix_env: <span class="built_in">bool</span> = <span class="literal">False</span></span><br><span class="line">    <span class="comment"># 上一个有效 token 的文本</span></span><br><span class="line">    last_token_text: <span class="built_in">str</span> = <span class="string">&quot;&quot;</span></span><br><span class="line">    <span class="comment"># 已生成的完整文本</span></span><br><span class="line">    generated_text: <span class="built_in">str</span> = <span class="string">&quot;&quot;</span></span><br><span class="line"></span><br><span class="line">    <span class="comment"># 矩阵环境集合（允许使用 &amp; 的环境）</span></span><br><span class="line">    MATRIX_ENVS = &#123;</span><br><span class="line">        <span class="string">&quot;matrix&quot;</span>, <span class="string">&quot;pmatrix&quot;</span>, <span class="string">&quot;bmatrix&quot;</span>, <span class="string">&quot;vmatrix&quot;</span>, <span class="string">&quot;Vmatrix&quot;</span>,</span><br><span class="line">        <span class="string">&quot;smallmatrix&quot;</span>, <span class="string">&quot;array&quot;</span>, <span class="string">&quot;aligned&quot;</span>, <span class="string">&quot;align&quot;</span>, <span class="string">&quot;cases&quot;</span>,</span><br><span class="line">        <span class="string">&quot;gather&quot;</span>, <span class="string">&quot;eqnarray&quot;</span>,</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment"># ...</span></span><br></pre></td></tr></table></figure><p>基于 LatexState 的状态，定义了对应多种错误的检查函数，例如：</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">check_ampersand_outside_matrix</span>(<span class="params">candidate: <span class="built_in">str</span>, state: LatexState</span>) -&gt; <span class="built_in">bool</span>:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    错误类型1：没有矩阵环境时生成 &amp;</span></span><br><span class="line"><span class="string">    返回 True 表示该 token 非法</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    <span class="keyword">if</span> <span class="string">&#x27;&amp;&#x27;</span> <span class="keyword">in</span> candidate <span class="keyword">and</span> <span class="keyword">not</span> state.in_matrix_env:</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">True</span></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">False</span></span><br></pre></td></tr></table></figure><details>    <summary>点击展开其他检查函数</summary><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">def</span> <span class="title function_">check_mismatched_brackets</span>(<span class="params">candidate: <span class="built_in">str</span>, state: LatexState</span>) -&gt; <span class="built_in">bool</span>:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    错误类型2：生成不匹配的括号，如 (]</span></span><br><span class="line"><span class="string">    模拟将 candidate 加入当前括号栈，检查是否产生不匹配</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    PAIRS = &#123;<span class="string">&#x27;)&#x27;</span>: <span class="string">&#x27;(&#x27;</span>, <span class="string">&#x27;]&#x27;</span>: <span class="string">&#x27;[&#x27;</span>, <span class="string">&#x27;&#125;&#x27;</span>: <span class="string">&#x27;&#123;&#x27;</span>&#125;</span><br><span class="line">    OPEN = <span class="built_in">set</span>(<span class="string">&#x27;([&#123;&#x27;</span>)</span><br><span class="line">    CLOSE = <span class="built_in">set</span>(<span class="string">&#x27;)]&#125;&#x27;</span>)</span><br><span class="line"></span><br><span class="line">    <span class="comment"># 模拟当前栈</span></span><br><span class="line">    simulated_stack = <span class="built_in">list</span>(state.bracket_stack)</span><br><span class="line">    <span class="keyword">for</span> ch <span class="keyword">in</span> candidate:</span><br><span class="line">        <span class="keyword">if</span> ch <span class="keyword">in</span> OPEN:</span><br><span class="line">            simulated_stack.append(ch)</span><br><span class="line">        <span class="keyword">elif</span> ch <span class="keyword">in</span> CLOSE:</span><br><span class="line">            <span class="keyword">if</span> <span class="keyword">not</span> simulated_stack:</span><br><span class="line">                <span class="keyword">return</span> <span class="literal">True</span>  <span class="comment"># 无对应开括号</span></span><br><span class="line">            <span class="keyword">if</span> simulated_stack[-<span class="number">1</span>] != PAIRS[ch]:</span><br><span class="line">                <span class="keyword">return</span> <span class="literal">True</span>  <span class="comment"># 括号类型不匹配</span></span><br><span class="line">            simulated_stack.pop()</span><br><span class="line">    <span class="keyword">return</span> <span class="literal">False</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">check_invalid_command</span>(<span class="params">candidate: <span class="built_in">str</span>, state: LatexState</span>) -&gt; <span class="built_in">bool</span>:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    错误类型3：编造的命令，如 \vbeta, \invalidcmd</span></span><br><span class="line"><span class="string">    检测 candidate 中是否包含不在白名单的 LaTeX 命令</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    commands = re.findall(<span class="string">r&#x27;\\([a-zA-Z]+)&#x27;</span>, candidate)</span><br><span class="line">    <span class="keyword">for</span> cmd <span class="keyword">in</span> commands:</span><br><span class="line">        <span class="keyword">if</span> cmd <span class="keyword">not</span> <span class="keyword">in</span> VALID_COMMANDS:</span><br><span class="line">            <span class="keyword">return</span> <span class="literal">True</span></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">False</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">check_consecutive_subscripts</span>(<span class="params">candidate: <span class="built_in">str</span>, state: LatexState</span>) -&gt; <span class="built_in">bool</span>:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    错误类型4：数字或字母后出现连续下划线，如 b_6_4</span></span><br><span class="line"><span class="string">    检测合并后文本中是否有 x_y_z 的非法连续下标（未用&#123;&#125;包裹）</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    combined = state.generated_text + candidate</span><br><span class="line">    <span class="comment"># x_y_z 形式（未用 &#123;&#125; 包裹时的连续下标）</span></span><br><span class="line">    <span class="keyword">if</span> re.search(<span class="string">r&#x27;(?&lt;!\&#123;)[^_&#123;&#125;]+_[^_&#123;&#125;]+_&#x27;</span>, combined):</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">True</span></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">False</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">check_env_order_mismatch</span>(<span class="params">candidate: <span class="built_in">str</span>, state: LatexState</span>) -&gt; <span class="built_in">bool</span>:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    错误类型5：\end&#123;env&#125; 与栈顶 \begin&#123;env&#125; 不匹配</span></span><br><span class="line"><span class="string">    如 \begin&#123;array&#125;\begin&#123;pmatrix&#125; 后出现 \end&#123;array&#125;</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    ends = re.findall(<span class="string">r&#x27;\\end\&#123;(\w+)\&#125;&#x27;</span>, candidate)</span><br><span class="line">    simulated_stack = <span class="built_in">list</span>(state.env_stack)</span><br><span class="line">    <span class="keyword">for</span> env <span class="keyword">in</span> ends:</span><br><span class="line">        <span class="keyword">if</span> <span class="keyword">not</span> simulated_stack:</span><br><span class="line">            <span class="keyword">return</span> <span class="literal">True</span>  <span class="comment"># 没有对应的 \begin</span></span><br><span class="line">        <span class="keyword">if</span> simulated_stack[-<span class="number">1</span>] != env:</span><br><span class="line">            <span class="keyword">return</span> <span class="literal">True</span>  <span class="comment"># 与栈顶不匹配</span></span><br><span class="line">        simulated_stack.pop()</span><br><span class="line">    <span class="keyword">return</span> <span class="literal">False</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">check_illegal_characters</span>(<span class="params">candidate: <span class="built_in">str</span>, state: LatexState</span>) -&gt; <span class="built_in">bool</span>:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    错误类型6：非法 Unicode 字符（中文、全角等）</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    ILLEGAL_RANGES = [</span><br><span class="line">        (<span class="number">0x4E00</span>, <span class="number">0x9FFF</span>),   <span class="comment"># 中文基本汉字</span></span><br><span class="line">        (<span class="number">0x3400</span>, <span class="number">0x4DBF</span>),   <span class="comment"># 中文扩展A</span></span><br><span class="line">        (<span class="number">0xFF00</span>, <span class="number">0xFFEF</span>),   <span class="comment"># 全角字符</span></span><br><span class="line">        (<span class="number">0x3000</span>, <span class="number">0x303F</span>),   <span class="comment"># 中文标点</span></span><br><span class="line">        (<span class="number">0x0080</span>, <span class="number">0x009F</span>),   <span class="comment"># C1控制字符</span></span><br><span class="line">    ]</span><br><span class="line">    <span class="keyword">for</span> char <span class="keyword">in</span> candidate:</span><br><span class="line">        cp = <span class="built_in">ord</span>(char)</span><br><span class="line">        <span class="keyword">for</span> start, end <span class="keyword">in</span> ILLEGAL_RANGES:</span><br><span class="line">            <span class="keyword">if</span> start &lt;= cp &lt;= end:</span><br><span class="line">                <span class="keyword">return</span> <span class="literal">True</span></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">False</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">check_double_superscript</span>(<span class="params">candidate: <span class="built_in">str</span>, state: LatexState</span>) -&gt; <span class="built_in">bool</span>:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    错误类型7：连续上标，如 x^2^3（未用&#123;&#125;包裹）</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    combined = state.generated_text + candidate</span><br><span class="line">    <span class="keyword">if</span> re.search(<span class="string">r&#x27;\^[^&#123;&#125;\^]+\^&#x27;</span>, combined):</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">True</span></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">False</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">check_empty_group</span>(<span class="params">candidate: <span class="built_in">str</span>, state: LatexState</span>) -&gt; <span class="built_in">bool</span>:</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    错误类型8：空的花括号组 &#123;&#125;（通常无意义，在关键位置会导致渲染错误）</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    combined = state.generated_text + candidate</span><br><span class="line">    <span class="comment"># \frac&#123;&#125;&#123;&#125; 或 \sqrt&#123;&#125; 等关键命令后的空组</span></span><br><span class="line">    <span class="keyword">if</span> re.search(<span class="string">r&#x27;\\(frac|sqrt|hat|bar|vec|dot)\&#123;\&#125;&#x27;</span>, combined):</span><br><span class="line">        <span class="keyword">return</span> <span class="literal">True</span></span><br><span class="line">    <span class="keyword">return</span> <span class="literal">False</span></span><br></pre></td></tr></table></figure></details>&emsp;<p>斯巴拉西，然后 Claude 给出了 logits processor 的实现</p><details>    <summary>点击展开详情</summary><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">LatexConstraintProcessor</span>(<span class="title class_ inherited__">LogitsProcessor</span>):</span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">__init__</span>(<span class="params"></span></span><br><span class="line"><span class="params">        self,</span></span><br><span class="line"><span class="params">        tokenizer,</span></span><br><span class="line"><span class="params">        enabled_checkers: <span class="type">Optional</span>[<span class="built_in">set</span>] = <span class="literal">None</span>,  <span class="comment"># None 表示启用全部</span></span></span><br><span class="line"><span class="params">        penalty: <span class="built_in">float</span> = -<span class="built_in">float</span>(<span class="params"><span class="string">&#x27;inf&#x27;</span></span>),           <span class="comment"># 非法 token 的惩罚值</span></span></span><br><span class="line"><span class="params">        verbose: <span class="built_in">bool</span> = <span class="literal">False</span>,</span></span><br><span class="line"><span class="params">    </span>):</span><br><span class="line">        <span class="variable language_">self</span>.tokenizer = tokenizer</span><br><span class="line">        <span class="variable language_">self</span>.penalty = penalty</span><br><span class="line">        <span class="variable language_">self</span>.verbose = verbose</span><br><span class="line">        <span class="variable language_">self</span>.state = LatexState()</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 选择启用的检查器</span></span><br><span class="line">        <span class="keyword">if</span> enabled_checkers <span class="keyword">is</span> <span class="literal">None</span>:</span><br><span class="line">            <span class="variable language_">self</span>.checkers = CHECKERS</span><br><span class="line">        <span class="keyword">else</span>:</span><br><span class="line">            <span class="variable language_">self</span>.checkers = &#123;k: v <span class="keyword">for</span> k, v <span class="keyword">in</span> CHECKERS.items() <span class="keyword">if</span> k <span class="keyword">in</span> enabled_checkers&#125;</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 预计算词表中每个 token 的解码文本（避免每步重复 decode）</span></span><br><span class="line">        <span class="variable language_">self</span>._vocab_decoded = <span class="variable language_">self</span>._precompute_vocab(tokenizer)</span><br><span class="line">        <span class="built_in">print</span>(<span class="string">f&quot;[LatexConstraintProcessor] 启用检查器: <span class="subst">&#123;<span class="built_in">list</span>(self.checkers.keys())&#125;</span>&quot;</span>)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">_precompute_vocab</span>(<span class="params">self, tokenizer</span>) -&gt; <span class="built_in">dict</span>:</span><br><span class="line">        vocab = tokenizer.get_vocab()</span><br><span class="line">        decoded = &#123;&#125;</span><br><span class="line">        <span class="keyword">for</span> token, idx <span class="keyword">in</span> vocab.items():</span><br><span class="line">            <span class="keyword">try</span>:</span><br><span class="line">                decoded[idx] = tokenizer.convert_tokens_to_string([token])</span><br><span class="line">            <span class="keyword">except</span> Exception:</span><br><span class="line">                decoded[idx] = <span class="string">&quot;&quot;</span></span><br><span class="line">        <span class="keyword">return</span> decoded</span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">__call__</span>(<span class="params"></span></span><br><span class="line"><span class="params">        self,</span></span><br><span class="line"><span class="params">        input_ids: torch.LongTensor,</span></span><br><span class="line"><span class="params">        scores: torch.FloatTensor,</span></span><br><span class="line"><span class="params">    </span>) -&gt; torch.FloatTensor:</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 更新状态：解码最新生成的 token</span></span><br><span class="line">        <span class="keyword">if</span> input_ids.shape[<span class="number">1</span>] &gt; <span class="number">0</span>:</span><br><span class="line">            last_token_id = input_ids[<span class="number">0</span>, -<span class="number">1</span>].item()</span><br><span class="line">            last_text = <span class="variable language_">self</span>._vocab_decoded.get(last_token_id, <span class="string">&quot;&quot;</span>)</span><br><span class="line">            <span class="variable language_">self</span>.state.update(last_text)</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 对每个候选 token 逐一检查</span></span><br><span class="line">        banned_ids = []</span><br><span class="line">        <span class="keyword">for</span> token_id, candidate_text <span class="keyword">in</span> <span class="variable language_">self</span>._vocab_decoded.items():</span><br><span class="line">            <span class="keyword">for</span> checker_name, checker_fn <span class="keyword">in</span> <span class="variable language_">self</span>.checkers.items():</span><br><span class="line">                <span class="keyword">if</span> checker_fn(candidate_text, <span class="variable language_">self</span>.state):</span><br><span class="line">                    banned_ids.append(token_id)</span><br><span class="line">                    <span class="keyword">if</span> <span class="variable language_">self</span>.verbose:</span><br><span class="line">                        <span class="built_in">print</span>(<span class="string">f&quot;[banned] id=<span class="subst">&#123;token_id&#125;</span> text=<span class="subst">&#123;candidate_text!r&#125;</span> reason=<span class="subst">&#123;checker_name&#125;</span>&quot;</span>)</span><br><span class="line">                    <span class="keyword">break</span>  <span class="comment"># 一个检查失败即禁止，无需继续</span></span><br><span class="line"></span><br><span class="line">        <span class="keyword">if</span> banned_ids:</span><br><span class="line">            scores[<span class="number">0</span>, banned_ids] = <span class="variable language_">self</span>.penalty</span><br><span class="line"></span><br><span class="line">        <span class="keyword">return</span> scores</span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">reset</span>(<span class="params">self</span>):</span><br><span class="line">        <span class="string">&quot;&quot;&quot;每次新的生成前重置状态&quot;&quot;&quot;</span></span><br><span class="line">        <span class="variable language_">self</span>.state = LatexState()</span><br></pre></td></tr></table></figure></details>&emsp;<p>遍历整个词表可还行，试一下果然慢得要死。而且模型开始无休止地吐 token ，猜测是 EOS token 也被 mask 了，<br />把 special tokens 从 mask 名单去除，推理仍未正常结束。此时比赛已经结束了，提交了返回固定样本的版本。</p><p>再检查发现 EOS token 后出现的都是 pad token，应该是批量推理的其他样本尚未结束生成。进而发现上面版本的 batchsize 错误，input_ids[0, -1] 和 scores[0, banned_ids] 仅处理了批次中的第一个样本，继续提示 Claude，</p><blockquote><p>整理上面的代码，给出 LatexState 和 LogitsProcessor 的完整版本。<br />1、根据 LatexState 记录的生成状态动态计算需要 mask 的 token，例如括号、命令，_ 等；<br />2、计算包含中文字符等非 LaTex 标准字符的 token 进行静态 mask；<br />3、适配批量处理的场景；</p></blockquote><p>几轮提示和修改后，得到了一个勉强可用版本，不足之处在于把包含环境的复杂序列也在 token 级别处理，实际上应该从 token 流的角度考虑，但这样引入了太多复杂性，到此为止。</p><details>    <summary>点击展开详情</summary><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br><span class="line">146</span><br><span class="line">147</span><br><span class="line">148</span><br><span class="line">149</span><br><span class="line">150</span><br><span class="line">151</span><br><span class="line">152</span><br><span class="line">153</span><br><span class="line">154</span><br><span class="line">155</span><br><span class="line">156</span><br><span class="line">157</span><br><span class="line">158</span><br><span class="line">159</span><br><span class="line">160</span><br><span class="line">161</span><br><span class="line">162</span><br><span class="line">163</span><br><span class="line">164</span><br><span class="line">165</span><br><span class="line">166</span><br><span class="line">167</span><br><span class="line">168</span><br><span class="line">169</span><br><span class="line">170</span><br><span class="line">171</span><br><span class="line">172</span><br><span class="line">173</span><br><span class="line">174</span><br><span class="line">175</span><br><span class="line">176</span><br><span class="line">177</span><br><span class="line">178</span><br><span class="line">179</span><br><span class="line">180</span><br><span class="line">181</span><br><span class="line">182</span><br><span class="line">183</span><br><span class="line">184</span><br><span class="line">185</span><br><span class="line">186</span><br><span class="line">187</span><br><span class="line">188</span><br><span class="line">189</span><br><span class="line">190</span><br><span class="line">191</span><br><span class="line">192</span><br><span class="line">193</span><br><span class="line">194</span><br><span class="line">195</span><br><span class="line">196</span><br><span class="line">197</span><br><span class="line">198</span><br><span class="line">199</span><br><span class="line">200</span><br><span class="line">201</span><br><span class="line">202</span><br><span class="line">203</span><br><span class="line">204</span><br><span class="line">205</span><br><span class="line">206</span><br><span class="line">207</span><br><span class="line">208</span><br><span class="line">209</span><br><span class="line">210</span><br><span class="line">211</span><br><span class="line">212</span><br><span class="line">213</span><br><span class="line">214</span><br><span class="line">215</span><br><span class="line">216</span><br><span class="line">217</span><br><span class="line">218</span><br><span class="line">219</span><br><span class="line">220</span><br><span class="line">221</span><br><span class="line">222</span><br><span class="line">223</span><br><span class="line">224</span><br><span class="line">225</span><br><span class="line">226</span><br><span class="line">227</span><br><span class="line">228</span><br><span class="line">229</span><br><span class="line">230</span><br><span class="line">231</span><br><span class="line">232</span><br><span class="line">233</span><br><span class="line">234</span><br><span class="line">235</span><br><span class="line">236</span><br><span class="line">237</span><br><span class="line">238</span><br><span class="line">239</span><br><span class="line">240</span><br><span class="line">241</span><br><span class="line">242</span><br><span class="line">243</span><br><span class="line">244</span><br><span class="line">245</span><br><span class="line">246</span><br><span class="line">247</span><br><span class="line">248</span><br><span class="line">249</span><br><span class="line">250</span><br><span class="line">251</span><br><span class="line">252</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">LatexConstraintProcessor</span>(<span class="title class_ inherited__">LogitsProcessor</span>):</span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">__init__</span>(<span class="params"></span></span><br><span class="line"><span class="params">        self,</span></span><br><span class="line"><span class="params">        tokenizer=<span class="literal">None</span>,</span></span><br><span class="line"><span class="params">        model=<span class="literal">None</span>,</span></span><br><span class="line"><span class="params">        penalty: <span class="built_in">float</span> = -<span class="built_in">float</span>(<span class="params"><span class="string">&#x27;inf&#x27;</span></span>),</span></span><br><span class="line"><span class="params">        verbose: <span class="built_in">bool</span> = <span class="literal">False</span>,</span></span><br><span class="line"><span class="params">    </span>):</span><br><span class="line">        <span class="comment"># 支持从 model.name_or_path 自动加载 tokenizer</span></span><br><span class="line">        <span class="keyword">if</span> tokenizer <span class="keyword">is</span> <span class="literal">None</span>:</span><br><span class="line">            <span class="keyword">assert</span> model <span class="keyword">is</span> <span class="keyword">not</span> <span class="literal">None</span>, (</span><br><span class="line">                <span class="string">&quot;请传入 tokenizer 或 model：\n&quot;</span></span><br><span class="line">                <span class="string">&quot;  LatexConstraintProcessor(tokenizer=tokenizer)\n&quot;</span></span><br><span class="line">                <span class="string">&quot;  LatexConstraintProcessor(model=model)&quot;</span></span><br><span class="line">            )</span><br><span class="line">            <span class="keyword">from</span> transformers <span class="keyword">import</span> AutoTokenizer</span><br><span class="line">            tokenizer = AutoTokenizer.from_pretrained(</span><br><span class="line">                model.name_or_path, trust_remote_code=<span class="literal">True</span></span><br><span class="line">            )</span><br><span class="line"></span><br><span class="line">        <span class="variable language_">self</span>.tokenizer = tokenizer</span><br><span class="line">        <span class="variable language_">self</span>.penalty = penalty</span><br><span class="line">        <span class="variable language_">self</span>.verbose = verbose</span><br><span class="line"></span><br><span class="line">        <span class="comment"># EOS / 特殊 token</span></span><br><span class="line">        <span class="variable language_">self</span>._eos_token_id = tokenizer.eos_token_id</span><br><span class="line">        special_ids = <span class="built_in">set</span>(tokenizer.all_special_ids)</span><br><span class="line">        <span class="keyword">for</span> attr <span class="keyword">in</span> (<span class="string">&#x27;eos_token_id&#x27;</span>, <span class="string">&#x27;bos_token_id&#x27;</span>, <span class="string">&#x27;pad_token_id&#x27;</span>, <span class="string">&#x27;unk_token_id&#x27;</span>):</span><br><span class="line">            val = <span class="built_in">getattr</span>(tokenizer, attr, <span class="literal">None</span>)</span><br><span class="line">            <span class="keyword">if</span> val <span class="keyword">is</span> <span class="keyword">not</span> <span class="literal">None</span>:</span><br><span class="line">                special_ids.add(val)</span><br><span class="line">        <span class="variable language_">self</span>._special_token_ids = <span class="built_in">list</span>(special_ids)</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 预计算词表</span></span><br><span class="line">        <span class="variable language_">self</span>._vocab_decoded, <span class="variable language_">self</span>._static_banned = <span class="variable language_">self</span>._precompute_vocab(tokenizer, special_ids)</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 预计算动态检查索引</span></span><br><span class="line">        <span class="variable language_">self</span>._precompute_token_indices()</span><br><span class="line"></span><br><span class="line">        <span class="comment"># batch 状态（延迟初始化）</span></span><br><span class="line">        <span class="variable language_">self</span>.states: <span class="built_in">list</span>[LatexState] = []</span><br><span class="line"></span><br><span class="line">        <span class="built_in">print</span>(</span><br><span class="line">            <span class="string">f&quot;[LatexConstraintProcessor] &quot;</span></span><br><span class="line">            <span class="string">f&quot;特殊token: <span class="subst">&#123;<span class="built_in">len</span>(special_ids)&#125;</span>个  &quot;</span></span><br><span class="line">            <span class="string">f&quot;静态过滤: <span class="subst">&#123;<span class="built_in">len</span>(self._static_banned)&#125;</span>个  &quot;</span></span><br><span class="line">            <span class="string">f&quot;动态检查词表: <span class="subst">&#123;<span class="built_in">len</span>(self._vocab_decoded)&#125;</span>个&quot;</span></span><br><span class="line">        )</span><br><span class="line"></span><br><span class="line">    <span class="comment"># ── 词表预计算 ──────────────────────────────────────────────</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">_is_legal_latex_char</span>(<span class="params">self, text: <span class="built_in">str</span></span>) -&gt; <span class="built_in">bool</span>:</span><br><span class="line">        <span class="string">&quot;&quot;&quot;判断文本是否只含合法 LaTeX 字符&quot;&quot;&quot;</span></span><br><span class="line">        LEGAL_PATTERN = re.<span class="built_in">compile</span>(</span><br><span class="line">            <span class="string">r&#x27;^[&#x27;</span></span><br><span class="line">            <span class="string">r&#x27;a-zA-Z0-9&#x27;</span>            <span class="comment"># 英文字母和数字</span></span><br><span class="line">            <span class="string">r&#x27;\s&#x27;</span>                   <span class="comment"># 空白字符（空格、换行、tab）</span></span><br><span class="line">            <span class="string">r&#x27;\\&#123;&#125;()\[\]&#x27;</span>           <span class="comment"># LaTeX 核心符号</span></span><br><span class="line">            <span class="string">r&#x27;\+\-\*/=&lt;&gt;!&amp;|^~&#x27;</span>      <span class="comment"># 运算符</span></span><br><span class="line">            <span class="string">r&#x27;_\^&#x27;</span>                  <span class="comment"># 上下标</span></span><br><span class="line">            <span class="string">r&#x27;.,;:\&#x27;&quot;`&#x27;</span>             <span class="comment"># 标点</span></span><br><span class="line">            <span class="string">r&#x27;#%&#x27;</span>                   <span class="comment"># 其他常用  r&#x27;@#%&#x27;，按数据集改为 r&#x27;#%&#x27; </span></span><br><span class="line">            <span class="string">r&#x27;]+$&#x27;</span></span><br><span class="line">        )</span><br><span class="line">        ILLEGAL_RANGES = [</span><br><span class="line">            (<span class="number">0x4E00</span>, <span class="number">0x9FFF</span>),       <span class="comment"># 中文基本汉字</span></span><br><span class="line">            (<span class="number">0x3400</span>, <span class="number">0x4DBF</span>),       <span class="comment"># 中文扩展A</span></span><br><span class="line">            (<span class="number">0x20000</span>, <span class="number">0x2A6DF</span>),     <span class="comment"># 中文扩展B</span></span><br><span class="line">            (<span class="number">0xFF00</span>, <span class="number">0xFFEF</span>),       <span class="comment"># 全角字符</span></span><br><span class="line">            (<span class="number">0x3000</span>, <span class="number">0x303F</span>),       <span class="comment"># 中文标点</span></span><br><span class="line">            (<span class="number">0x0080</span>, <span class="number">0x009F</span>),       <span class="comment"># 不可见控制符</span></span><br><span class="line">            (<span class="number">0xD800</span>, <span class="number">0xDFFF</span>),       <span class="comment"># UTF-16 代理区</span></span><br><span class="line">            (<span class="number">0xE000</span>, <span class="number">0xF8FF</span>),       <span class="comment"># 私有区</span></span><br><span class="line">        ]</span><br><span class="line">        <span class="keyword">for</span> char <span class="keyword">in</span> text:</span><br><span class="line">            cp = <span class="built_in">ord</span>(char)</span><br><span class="line">            <span class="keyword">for</span> start, end <span class="keyword">in</span> ILLEGAL_RANGES:</span><br><span class="line">                <span class="keyword">if</span> start &lt;= cp &lt;= end:</span><br><span class="line">                    <span class="keyword">return</span> <span class="literal">False</span></span><br><span class="line">        <span class="keyword">return</span> <span class="built_in">bool</span>(LEGAL_PATTERN.<span class="keyword">match</span>(text))</span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">_precompute_vocab</span>(<span class="params"></span></span><br><span class="line"><span class="params">        self,</span></span><br><span class="line"><span class="params">        tokenizer,</span></span><br><span class="line"><span class="params">        special_ids: <span class="built_in">set</span>,</span></span><br><span class="line"><span class="params">    </span>) -&gt; <span class="built_in">tuple</span>[<span class="built_in">dict</span>[<span class="built_in">int</span>, <span class="built_in">str</span>], torch.Tensor]:</span><br><span class="line">        <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">        返回：</span></span><br><span class="line"><span class="string">          vocab_decoded: 合法 token 的 id → 文本（用于动态检查）</span></span><br><span class="line"><span class="string">          static_banned: 非法字符 token 的 id tensor（静态屏蔽）</span></span><br><span class="line"><span class="string">        &quot;&quot;&quot;</span></span><br><span class="line">        vocab = tokenizer.get_vocab()</span><br><span class="line">        static_banned = []</span><br><span class="line">        vocab_decoded = &#123;&#125;</span><br><span class="line"></span><br><span class="line">        <span class="keyword">for</span> token, idx <span class="keyword">in</span> vocab.items():</span><br><span class="line">            <span class="comment"># 特殊 token 直接保留，内容置空（不参与任何规则检查）</span></span><br><span class="line">            <span class="keyword">if</span> idx <span class="keyword">in</span> special_ids:</span><br><span class="line">                vocab_decoded[idx] = <span class="string">&quot;&quot;</span></span><br><span class="line">                <span class="keyword">continue</span></span><br><span class="line">            <span class="keyword">try</span>:</span><br><span class="line">                text = tokenizer.convert_tokens_to_string([token])</span><br><span class="line">            <span class="keyword">except</span> Exception:</span><br><span class="line">                vocab_decoded[idx] = <span class="string">&quot;&quot;</span></span><br><span class="line">                <span class="keyword">continue</span></span><br><span class="line"></span><br><span class="line">            <span class="keyword">if</span> <span class="keyword">not</span> <span class="variable language_">self</span>._is_legal_latex_char(text):</span><br><span class="line">                static_banned.append(idx)</span><br><span class="line">            <span class="keyword">else</span>:</span><br><span class="line">                vocab_decoded[idx] = text</span><br><span class="line"></span><br><span class="line">        <span class="keyword">return</span> vocab_decoded, torch.tensor(static_banned, dtype=torch.long)</span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">_precompute_token_indices</span>(<span class="params">self</span>):</span><br><span class="line">        <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">        按字符内容预分组，供动态屏蔽 O(1) 查找：</span></span><br><span class="line"><span class="string">          _tokens_contain[ch] → 以字符 ch 开头的所有 token id 列表</span></span><br><span class="line"><span class="string">          _end_env_ids[env]   → 包含 \\end&#123;env&#125; 的 token id 列表</span></span><br><span class="line"><span class="string">          _cmd_token_ids      → 包含未知命令的 token id 列表（静态）</span></span><br><span class="line"><span class="string">        &quot;&quot;&quot;</span></span><br><span class="line">        <span class="comment"># 关键字符分组</span></span><br><span class="line">        KEY_CHARS = (<span class="string">&#x27;^&#x27;</span>, <span class="string">&#x27;_&#x27;</span>, <span class="string">&#x27;&amp;&#x27;</span>, <span class="string">&#x27;]&#x27;</span>, <span class="string">&#x27;)&#x27;</span>, <span class="string">&#x27;&#125;&#x27;</span>)</span><br><span class="line">        <span class="variable language_">self</span>._tokens_startwith: <span class="built_in">dict</span>[<span class="built_in">str</span>, <span class="built_in">list</span>[<span class="built_in">int</span>]] = &#123;ch: [] <span class="keyword">for</span> ch <span class="keyword">in</span> KEY_CHARS&#125;</span><br><span class="line"></span><br><span class="line">        <span class="comment"># \end&#123;env&#125; 分组</span></span><br><span class="line">        <span class="variable language_">self</span>._end_env_ids: <span class="built_in">dict</span>[<span class="built_in">str</span>, <span class="built_in">list</span>[<span class="built_in">int</span>]] = &#123;&#125;</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 非法命令（静态，加入 static_banned）</span></span><br><span class="line">        invalid_cmd_ids = []</span><br><span class="line"></span><br><span class="line">        <span class="keyword">for</span> token_id, text <span class="keyword">in</span> <span class="variable language_">self</span>._vocab_decoded.items():</span><br><span class="line">            <span class="keyword">if</span> <span class="keyword">not</span> text:</span><br><span class="line">                <span class="keyword">continue</span></span><br><span class="line"></span><br><span class="line">            <span class="comment"># 按包含字符分组</span></span><br><span class="line">            <span class="keyword">for</span> ch <span class="keyword">in</span> KEY_CHARS:</span><br><span class="line">                <span class="keyword">if</span> text.lstrip().startswith(ch):</span><br><span class="line">                    <span class="variable language_">self</span>._tokens_startwith[ch].append(token_id)</span><br><span class="line"></span><br><span class="line">            <span class="comment"># \end&#123;env&#125; 分组</span></span><br><span class="line">            m = re.search(<span class="string">r&#x27;\\end\&#123;(\w+)\&#125;&#x27;</span>, text)</span><br><span class="line">            <span class="keyword">if</span> m:</span><br><span class="line">                env = m.group(<span class="number">1</span>)</span><br><span class="line">                <span class="variable language_">self</span>._end_env_ids.setdefault(env, []).append(token_id)</span><br><span class="line"></span><br><span class="line">            <span class="comment"># 非法命令检查</span></span><br><span class="line">            commands = re.findall(<span class="string">r&#x27;\\([a-zA-Z]+)&#x27;</span>, text)</span><br><span class="line">            <span class="keyword">if</span> <span class="built_in">any</span>(cmd <span class="keyword">not</span> <span class="keyword">in</span> VALID_COMMANDS <span class="keyword">for</span> cmd <span class="keyword">in</span> commands):</span><br><span class="line">                invalid_cmd_ids.append(token_id)</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 非法命令并入静态禁止</span></span><br><span class="line">        <span class="keyword">if</span> invalid_cmd_ids:</span><br><span class="line">            extra = torch.tensor(invalid_cmd_ids, dtype=torch.long)</span><br><span class="line">            <span class="variable language_">self</span>._static_banned = torch.cat([<span class="variable language_">self</span>._static_banned, extra]).unique()</span><br><span class="line"></span><br><span class="line">    <span class="comment"># ── 动态屏蔽 ────────────────────────────────────────────────</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">_get_context_banned</span>(<span class="params">self, state: LatexState</span>) -&gt; <span class="built_in">list</span>[<span class="built_in">int</span>]:</span><br><span class="line">        <span class="string">&quot;&quot;&quot;根据当前状态返回需要动态屏蔽的 token id&quot;&quot;&quot;</span></span><br><span class="line">        banned = []</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 规则1：^ 或 _ 之后尚未完整参数，禁止再出现 ^ 和 _</span></span><br><span class="line">        <span class="keyword">if</span> <span class="keyword">not</span> state.script_has_arg:</span><br><span class="line">            banned.extend(<span class="variable language_">self</span>._tokens_startwith[<span class="string">&#x27;^&#x27;</span>])</span><br><span class="line">            banned.extend(<span class="variable language_">self</span>._tokens_startwith[<span class="string">&#x27;_&#x27;</span>])</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 规则2：刚完成一个脚本参数（如 a^1 或 a^&#123;12&#125; 末尾），</span></span><br><span class="line">        <span class="comment"># 则禁止紧接着生成同类脚本（防止 a^1^ 或 a_1_2）</span></span><br><span class="line">        <span class="comment"># 注意：不禁止另一类，a^1_2 是合法的</span></span><br><span class="line">        <span class="keyword">elif</span> state.script_just_completed <span class="keyword">and</span> state.last_script_char:</span><br><span class="line">            banned.extend(<span class="variable language_">self</span>._tokens_startwith[state.last_script_char])</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 规则3：不在矩阵环境，禁止 &amp;</span></span><br><span class="line">        <span class="keyword">if</span> <span class="keyword">not</span> state.in_matrix_env:</span><br><span class="line">            banned.extend(<span class="variable language_">self</span>._tokens_startwith[<span class="string">&#x27;&amp;&#x27;</span>])</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 规则4：括号栈顶不匹配，禁止对应非法闭合括号</span></span><br><span class="line">        <span class="keyword">if</span> state.bracket_stack:</span><br><span class="line">            top = state.bracket_stack[-<span class="number">1</span>]</span><br><span class="line">            <span class="keyword">for</span> illegal_close <span class="keyword">in</span> BRACKET_MISMATCH.get(top, ()):</span><br><span class="line">                banned.extend(<span class="variable language_">self</span>._tokens_startwith[illegal_close])</span><br><span class="line">        <span class="keyword">else</span>:</span><br><span class="line">            <span class="comment"># 禁止所有闭合括号</span></span><br><span class="line">            <span class="keyword">for</span> illegal_close <span class="keyword">in</span> CLOSE_BRACKETS:</span><br><span class="line">                banned.extend(<span class="variable language_">self</span>._tokens_startwith[illegal_close])</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 规则5：\end&#123;env&#125; 必须与栈顶匹配</span></span><br><span class="line">        <span class="keyword">if</span> state.env_stack:</span><br><span class="line">            top_env = state.env_stack[-<span class="number">1</span>]</span><br><span class="line">            <span class="keyword">for</span> env, ids <span class="keyword">in</span> <span class="variable language_">self</span>._end_env_ids.items():</span><br><span class="line">                <span class="keyword">if</span> env != top_env:</span><br><span class="line">                    banned.extend(ids)</span><br><span class="line"></span><br><span class="line">        <span class="keyword">return</span> banned</span><br><span class="line"></span><br><span class="line">    <span class="comment"># ── __call__ ────────────────────────────────────────────────</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">__call__</span>(<span class="params"></span></span><br><span class="line"><span class="params">        self,</span></span><br><span class="line"><span class="params">        input_ids: torch.LongTensor,   <span class="comment"># [batch, seq_len]</span></span></span><br><span class="line"><span class="params">        scores: torch.FloatTensor,     <span class="comment"># [batch, vocab_size]</span></span></span><br><span class="line"><span class="params">    </span>) -&gt; torch.FloatTensor:</span><br><span class="line"></span><br><span class="line">        batch_size = input_ids.shape[<span class="number">0</span>]</span><br><span class="line">        device = scores.device</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 延迟初始化 / batch size 变化时重置状态</span></span><br><span class="line">        <span class="keyword">if</span> <span class="built_in">len</span>(<span class="variable language_">self</span>.states) != batch_size:</span><br><span class="line">            <span class="variable language_">self</span>.states = [LatexState() <span class="keyword">for</span> _ <span class="keyword">in</span> <span class="built_in">range</span>(batch_size)]</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 设备对齐（只在首次或设备变化时迁移）</span></span><br><span class="line">        <span class="keyword">if</span> <span class="variable language_">self</span>._static_banned.device != device:</span><br><span class="line">            <span class="variable language_">self</span>._static_banned = <span class="variable language_">self</span>._static_banned.to(device)</span><br><span class="line"></span><br><span class="line">        <span class="keyword">for</span> i <span class="keyword">in</span> <span class="built_in">range</span>(batch_size):</span><br><span class="line">            <span class="comment"># 已生成 EOS 的样本跳过</span></span><br><span class="line">            <span class="keyword">if</span> <span class="variable language_">self</span>._eos_token_id <span class="keyword">is</span> <span class="keyword">not</span> <span class="literal">None</span>:</span><br><span class="line">                <span class="keyword">if</span> (input_ids[i] == <span class="variable language_">self</span>._eos_token_id).<span class="built_in">any</span>():</span><br><span class="line">                    <span class="keyword">continue</span></span><br><span class="line"></span><br><span class="line">            <span class="comment"># 更新第 i 个样本的状态</span></span><br><span class="line">            <span class="keyword">if</span> input_ids.shape[<span class="number">1</span>] &gt; <span class="number">0</span>:</span><br><span class="line">                last_id = input_ids[i, -<span class="number">1</span>].item()</span><br><span class="line">                last_text = <span class="variable language_">self</span>._vocab_decoded.get(last_id, <span class="string">&quot;&quot;</span>)</span><br><span class="line">                <span class="variable language_">self</span>.states[i].update(last_text)</span><br><span class="line"></span><br><span class="line">                <span class="keyword">if</span> <span class="variable language_">self</span>.verbose:</span><br><span class="line">                    <span class="built_in">print</span>(</span><br><span class="line">                        <span class="string">f&quot;[batch=<span class="subst">&#123;i&#125;</span>] last_token=<span class="subst">&#123;last_text!r&#125;</span>  &quot;</span></span><br><span class="line">                        <span class="string">f&quot;env_stack=<span class="subst">&#123;self.states[i].env_stack&#125;</span>  &quot;</span></span><br><span class="line">                        <span class="string">f&quot;bracket_stack=<span class="subst">&#123;self.states[i].bracket_stack&#125;</span>  &quot;</span></span><br><span class="line">                        <span class="string">f&quot;in_matrix=<span class="subst">&#123;self.states[i].in_matrix_env&#125;</span>&quot;</span></span><br><span class="line">                    )</span><br><span class="line"></span><br><span class="line">            <span class="comment"># 静态屏蔽</span></span><br><span class="line">            scores[i, <span class="variable language_">self</span>._static_banned] = <span class="variable language_">self</span>.penalty</span><br><span class="line"></span><br><span class="line">            <span class="comment"># 动态屏蔽</span></span><br><span class="line">            context_banned = <span class="variable language_">self</span>._get_context_banned(<span class="variable language_">self</span>.states[i])</span><br><span class="line">            <span class="keyword">if</span> context_banned:</span><br><span class="line">                scores[i, context_banned] = <span class="variable language_">self</span>.penalty</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 强制恢复所有特殊 token（确保 EOS 不被屏蔽）</span></span><br><span class="line">        <span class="keyword">for</span> sp_id <span class="keyword">in</span> <span class="variable language_">self</span>._special_token_ids:</span><br><span class="line">            scores[:, sp_id] = scores[:, sp_id].clamp(<span class="built_in">min</span>=<span class="number">0</span>)</span><br><span class="line"></span><br><span class="line">        <span class="keyword">return</span> scores</span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">reset</span>(<span class="params">self</span>):</span><br><span class="line">        <span class="string">&quot;&quot;&quot;每次新的 generate 调用前重置所有状态&quot;&quot;&quot;</span></span><br><span class="line">        <span class="variable language_">self</span>.states = []</span><br></pre></td></tr></table></figure></details>&emsp;<p><strong>效果</strong><br />实现的简易约束解码效果挺抽象，公式生成变得合法了，但正确性下降了。<br />随机 100 个合成公式的测试结果：评分大幅下降，渲染失败个数从 3 个减少到 1 个，换了个子集甚至得到更多的渲染失败。<br />问题可能在于单个 token 级别的约束实际破坏了模型的 LaTex 表示能力，表示能力是建立在语法单元上的，<br />而它们由多个 token 组成（或者相反一个 token 超过了一个基本单元），因此单个 token 的约束和语法单元并不对齐</p><img                           lazyload                       alt="image"                       data-src="https://cdn.jsdelivr.net/gh/kafmws/pictures/notes/随机 100 个合成公式的测试结果.png"                         alt="随机 100 个合成公式的测试结果" style="clear:both;display:block;" width="50%"                 ><h2 id="插曲"><a class="markdownIt-Anchor" href="#插曲"></a> 插曲</h2><p>跑评测时发现某些情况会报错，图片成功渲染了相似度却通通是 0，细察发现图片读取部分有问题，<br />用 dvipng 把 latex 编译结果转换为 png 格式时，可能会把 RGBA 优化成 colormap，导致 cv2.imread 读取有误，应该也影响 hash 。。<br />复现 bug 提了 issue 和 PR，在微信群里看到好像有 B 榜分数有震荡，不知道是不是这个原因<br />同一张图片，因读取问题产生的相似度分数差异：</p><div style="display: flex; align-items: center; gap: 20px; max-width: 100%">  <img                           lazyload                       alt="image"                       data-src="https://cdn.jsdelivr.net/gh/kafmws/pictures/notes/评分结果1.png"                                style="width: auto; height: auto; max-height: 100%; max-width: 49%;" alt="bug 评分结果1"                 >  <img                           lazyload                       alt="image"                       data-src="https://cdn.jsdelivr.net/gh/kafmws/pictures/notes/评分结果2.png"                                style="width: auto; height: auto; max-height: 100%; max-width: 49%;" alt="bug-fix 评分结果2"                 ></div><h2 id="经验-教训与感受"><a class="markdownIt-Anchor" href="#经验-教训与感受"></a> 经验、教训与感受</h2><p>想了下，比赛过程有些缺乏章法，接近炼丹了。</p><ul><li>应该尽早构造验证集和本地评测流程，检查错误样本。<br />这样就能更早地推进约束解码/后处理的方案，以及通过本地评估指导数据合成。</li><li>科学炼丹：应该排实验计划，对每个有希望的方向进行试训。<br />比如对全量微调探索太少，可以尝试分阶段微调和清洗后的合成数据集微调。</li><li>应该尽早投入比赛。。好吧其实没邀请对象没有那么多算力。</li><li>拥抱 AI 力度不够（还是舍不得买 Claude，以及完全放手给 Coding Agent）<br />上一期最后没抢到算力，准备的最后一版数据没用上，直接被挤出前 20 了。最后观摩一个公开出来的前三的方案，AI 代码 AI 报告看得我两眼一黑。毫无疑问这次我又将与 Claude Opus 作战，本着打不过就加入的原则在 review 和验证中大量使用 AI 代码，但效率有限。这次我将把奖金上交<del>智谱</del>。（好吧有 Codex 已经满足了）</li><li>比第一次有进步<br />第一次完全没理解模型仓库 GitHub 格式和 HF 格式的差异，这次终于有所了解，对 LLM/VLM 处理输入到输出的过程也更清晰了。尝试 ms-swift 框架的同时第一次跑起来 RL 代码（虽然是用 trl），强化学习并非高高在上，遥不可及！</li><li>把 LLM 的知识蒸馏给自己<br />接受 AI，拥抱 AI，努力达到 AI 的边界，但不把 AI 的能力误认为自己的能力。<br />可是我们做到一件事，必须要掌握全部细节吗，我此刻产生了怀疑。领导者的能力一定要覆盖或超过下属吗？知其然知其所以然，可学也无涯。只能寄希望于“本体越强，替身使者越强”，否则只需要强化人机同步率了，健身还有什么用呢？也许我还停留在“学习技术”提供的多巴胺舒适区吧。</li><li>对于可靠系统 vibe coding requires review.</li></ul><h2 id="仓库"><a class="markdownIt-Anchor" href="#仓库"></a> 仓库</h2><p>除了最初版的数据生成器丢失了，本次比赛产生的各种未经整理的代码放在了 kafmws/VLM-formula-recognition-dataset.git</p><h2 id="未尽探索"><a class="markdownIt-Anchor" href="#未尽探索"></a> 未尽探索</h2><p>增加视觉 token 数量、增大图像输入分辨率，也是显而易见的方向。但担心会破坏预训练模型的视觉特征提取能力，没尝试这些脑海里一闪而过的念头，还是有些可惜。RL 和约束解码也尝试太晚，没有成功应用到提交中。<br />下次再来！</p><p>2026/03/02</p>]]></content>
    
    
      
      
    <summary type="html">&lt;h1 id=&quot;参赛记录&quot;&gt;&lt;a class=&quot;markdownIt-Anchor&quot; href=&quot;#参赛记录&quot;&gt;&lt;/a&gt; 参赛记录&lt;/h1&gt;
&lt;p&gt;本文记录参加书生大模型社区比赛的一些过程。&lt;/p&gt;
&lt;h2 id=&quot;比赛简介&quot;&gt;&lt;a class=&quot;markdownIt-Ancho</summary>
      
    
    
    
    <category term="record" scheme="https://www.kafm.eu.org/categories/record/"/>
    
    
    <category term="LLM杂七杂八" scheme="https://www.kafm.eu.org/tags/LLM%E6%9D%82%E4%B8%83%E6%9D%82%E5%85%AB/"/>
    
    <category term="实践" scheme="https://www.kafm.eu.org/tags/%E5%AE%9E%E8%B7%B5/"/>
    
  </entry>
  
  <entry>
    <title>LLM 结构化输出原理、实现方式及实践</title>
    <link href="https://www.kafm.eu.org/note/LLM/LLM%20%E7%BB%93%E6%9E%84%E5%8C%96%E8%BE%93%E5%87%BA%E5%8E%9F%E7%90%86%E3%80%81%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F%E5%8F%8A%E5%AE%9E%E8%B7%B5/"/>
    <id>https://www.kafm.eu.org/note/LLM/LLM%20%E7%BB%93%E6%9E%84%E5%8C%96%E8%BE%93%E5%87%BA%E5%8E%9F%E7%90%86%E3%80%81%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F%E5%8F%8A%E5%AE%9E%E8%B7%B5/</id>
    <published>2026-02-03T02:43:54.000Z</published>
    <updated>2026-03-05T08:20:24.000Z</updated>
    
    <content type="html"><![CDATA[<h2 id="结构化输出的意义"><a class="markdownIt-Anchor" href="#结构化输出的意义"></a> 结构化输出的意义</h2><p>LLM 是输出不可控的概率模型，而结构化输出则增强输出的可控性。<br />有了这种可控性，LLM 就具备了一个程序的交互 interface 可以稳定地接入上下游程序，比如调用工具</p><h2 id="结构化输出的原理及实现方式"><a class="markdownIt-Anchor" href="#结构化输出的原理及实现方式"></a> 结构化输出的原理及实现方式</h2><p>实现格式化输出通常有以下几种手段，</p><ul><li>提示词工程<br />通过指令 “只输出JSON”，few-shot 示例等提示词技巧引导模型进行格式化输出</li><li>后处理<br />instruct 等库对模型输出进行后处理，进行提取或修复等。包括解析后反馈错误重新生成等。</li><li>约束解码<br />在选择 token 时用 mask 机制直接排除不符合输出格式的候选 token，在有限输出空间内选择 token。<br /><code>{&quot;key&quot;: value}</code>，从仅让模型预测格式串中的<code>value</code>部分，到约束<code>value</code>的格式/类型，约束解码都可以保证可靠的格式化输出。<br />再进一步，通过等价的状态机进行约束解码，可以进行任意的格式化输出。约束解码是结构化输出的主要工程手段。</li><li>SFT / RL<br />通过后训练手段形成输出偏好，使模型遵循一定模式。可靠性高，是结构化输出的理论保障、能力来源。</li></ul><p>当前许多模型厂商从 SFT/RL 的角度增强模型的格式化输出能力，并从推理端实现约束解码（例如Ollama），从而提供可靠的结构化输出 API。</p><h2 id="工具调用与结构化输出先有鸡还是先有蛋"><a class="markdownIt-Anchor" href="#工具调用与结构化输出先有鸡还是先有蛋"></a> 工具调用与结构化输出，先有鸡还是先有蛋</h2><p>依靠结构化输出的结果，我们才能稳定地处理工具调用<br />然而，如果模型能调用给定的工具，那么任意输出格式都可以包装成工具，通过 LLM 填写调用参数，也就实现了指定格式的输出<br />看起来两者相互成就，已经变成了鸡生蛋，蛋生鸡的问题<br />许多厂商都强调模型有<code>tool use/ function call</code>能力，却不强调模型支持结构化输出，可能工具使用能力是结构化输出的内核吧，结构化输出结合了众多工程手段来提供可靠性</p><h2 id="本地模型在-langchain-中的结构化输出"><a class="markdownIt-Anchor" href="#本地模型在-langchain-中的结构化输出"></a> 本地模型在 langchain 中的结构化输出</h2><p>本文在 langchain 以下生态中进行讨论（2026-02-02）</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">langchain-core            1.2.7</span><br><span class="line">langchain-openai          1.1.7</span><br><span class="line">langchain-ollama          1.0.1</span><br></pre></td></tr></table></figure><p>以 <code>YesNoJudge</code> 类的 JSON 格式，搭配不同 <code>with_structured_output()</code> 的格式化输出方式（即<code>method</code>参数）测试结构化输出表现<br /><code>method</code> 参数可选项有 <code>json_schema</code>, <code>function_calling</code> 和 <code>json_mode</code>，其中<code>json_mode</code>因无法给出确定格式的 JSON ，需要后处理，不推荐使用。（来源：<a class="link"   href="https://platform.openai.com/docs/guides/structured-outputs#json-mode%EF%BC%89" >https://platform.openai.com/docs/guides/structured-outputs#json-mode）<i class="fas fa-external-link-alt"></i></a></p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">class</span> <span class="title class_">YesNoJudge</span>(<span class="title class_ inherited__">BaseModel</span>):</span><br><span class="line">    <span class="string">&quot;&quot;&quot;return judge result from the context with this function.&quot;&quot;&quot;</span></span><br><span class="line"></span><br><span class="line">    yes_or_no: <span class="type">Literal</span>[<span class="string">&quot;yes&quot;</span>, <span class="string">&quot;no&quot;</span>] = Field(</span><br><span class="line">        description=<span class="string">&quot;The answer should be either &#x27;yes&#x27; or &#x27;no&#x27;&quot;</span></span><br><span class="line">    )</span><br><span class="line">    explanation: <span class="built_in">str</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line">judge = ChatOpenAI(</span><br><span class="line">            model=model_name, temperature=<span class="number">0.6</span>, streaming=<span class="literal">True</span>, **fast_judge_config</span><br><span class="line">        )</span><br><span class="line">        judge = judge.with_structured_output(</span><br><span class="line">            schema=YesNoJudge, method=<span class="string">&quot;function_calling&quot;</span>, strict=<span class="literal">True</span>, include_raw=<span class="literal">True</span></span><br><span class="line">        )</span><br></pre></td></tr></table></figure><img                           lazyload                       alt="image"                       data-src="https://cdn.jsdelivr.net/gh/kafmws/pictures/notes/20260201223737.png"                         alt="json_schema 方式下在 prompt 中提示按JSON格式输出后，gpt-oss:20b 仍调用不存在的工具进行格式化输出"                 >图：指定`json_schema` 方式，在 prompt 中提示按JSON格式输出后，gpt-oss:20b 仍调用不存在的工具进行格式化输出<img                           lazyload                       alt="image"                       data-src="https://cdn.jsdelivr.net/gh/kafmws/pictures/notes/20260202030229.png"                         alt="function_calling 方式下在 prompt 中提示按JSON格式输出后，qwen3:4b 会在内容中输出 json，不调用工具"                 >图：指定 `function_calling` 方式，在 prompt 中提示按JSON格式输出后，qwen3:4b 会在内容中输出 json，不调用工具<img                           lazyload                       alt="image"                       data-src="https://cdn.jsdelivr.net/gh/kafmws/pictures/notes/20260202030504.png"                         alt="function_calling 方式下去除json prompt，qwen3:4b 仍倾向于遵循指令在内容中输出结果，不调用工具"                 >图：指定 `function_calling` 方式，去除json prompt，qwen3:4b 仍倾向于遵循指令在内容中输出结果，不调用工具<img                           lazyload                       alt="image"                       data-src="https://cdn.jsdelivr.net/gh/kafmws/pictures/notes/20260202031010.png"                         alt="20260202031010"                 >图：指定 `function_calling` 方式，去除指令式prompt，qwen3:4b 有概率调用工具返回结果<p><strong>可见 Qwen 模型更重视指令遵循，而 GPT 偏向于使用工具</strong></p><h3 id="langchain-中-with_structured_output-的实现简析"><a class="markdownIt-Anchor" href="#langchain-中-with_structured_output-的实现简析"></a> langchain 中 with_structured_output() 的实现简析</h3><p>不同 backend 生态包的 with_structured_output() 实现不同，<br />对于langchan-openai：</p><ul><li>function_calling 把 JSON schema 作为 tool 传入，并填充 tool_choice 参数，强制 LLM 使用工具输出工具调用的 schema<br />相当于把工具调用能力转化为结构化输出能力</li><li>json_schema 方式则是对输出内容进行解析，该参数</li><li>json_mode 建议配合 prompt 使用，全靠模型发挥<br />而 create_agnet API 中 response_format 也是直接调用 LLM 后端的结构化输出能力</li></ul><h2 id="langchain-langchain-openai-ollama-后端结构化输出踩的坑"><a class="markdownIt-Anchor" href="#langchain-langchain-openai-ollama-后端结构化输出踩的坑"></a> langchain + langchain-openai + Ollama 后端结构化输出踩的坑</h2><p>为什么用 langchain-openai 不用 langchain-ollama，因为想着兼容 OpenAI API 会好一些<br />甚至用 langchain-ollama + Ollama 还一直出现 502 报错。。openai / curl 均正常</p><ul><li><p>看似兼容，实则不兼容<br />如 langchain[openai, ollama] 中 chat_models 的 with_structured_output() 实现不一致<br />且将 Ollama 作为后端使用 langchain-openai 时 with_structured_output 的表现和模型高度相关，需根据模型调优 method 参数<br />实际支持多个后端仍需要进行适配</p></li><li><p>多处实现，效果不一<br />如<code>langchain[openai, ollama]</code>中<code>chat_models</code>的<code>with_structured_output()</code>与<code>from langchain.agents import create_agent</code>的<code>response_format</code>都用于结构化输出，然而可靠性不同</p></li><li><p>过度封装<br />源码里到处是消除警告的类型不匹配 # type: ignore<br />MessageState 的消息 messages 传入 AgentState 的 messages 提示类型不匹配，有点搞笑</p></li></ul>]]></content>
    
    
      
      
    <summary type="html">&lt;h2 id=&quot;结构化输出的意义&quot;&gt;&lt;a class=&quot;markdownIt-Anchor&quot; href=&quot;#结构化输出的意义&quot;&gt;&lt;/a&gt; 结构化输出的意义&lt;/h2&gt;
&lt;p&gt;LLM 是输出不可控的概率模型，而结构化输出则增强输出的可控性。&lt;br /&gt;
有了这种可控性，LLM 就具备</summary>
      
    
    
    
    <category term="note" scheme="https://www.kafm.eu.org/categories/note/"/>
    
    
    <category term="LLM杂七杂八" scheme="https://www.kafm.eu.org/tags/LLM%E6%9D%82%E4%B8%83%E6%9D%82%E5%85%AB/"/>
    
  </entry>
  
  <entry>
    <title>龙潭公园里的鹅和鸭们日子过得无忧无虑</title>
    <link href="https://www.kafm.eu.org/fragment/Fragments/%E9%BE%99%E6%BD%AD%E5%85%AC%E5%9B%AD%E9%87%8C%EF%BC%8C%E9%B9%85%E5%92%8C%E9%B8%AD%E4%BB%AC%E6%97%A5%E5%AD%90%E8%BF%87%E5%BE%97%E6%97%A0%E5%BF%A7%E6%97%A0%E8%99%91/"/>
    <id>https://www.kafm.eu.org/fragment/Fragments/%E9%BE%99%E6%BD%AD%E5%85%AC%E5%9B%AD%E9%87%8C%EF%BC%8C%E9%B9%85%E5%92%8C%E9%B8%AD%E4%BB%AC%E6%97%A5%E5%AD%90%E8%BF%87%E5%BE%97%E6%97%A0%E5%BF%A7%E6%97%A0%E8%99%91/</id>
    <published>2025-12-21T12:43:54.000Z</published>
    <updated>2026-03-05T08:20:24.000Z</updated>
    
    <content type="html"><![CDATA[<p>龙潭公园里的鹅和鸭们日子过得无忧无虑呐。</p><p>潭水表面大都结着冰，沿潭水绕行一圈，每个鹅鸭聚集的岸边都有人在喂食，有馒头，有不知名的食儿，甚至还有个眼镜哥带一大串葡萄来喂。<br />馒头屑浮在水面上，鸭子悠悠游来伸脖一衔便进肚，而葡萄都沉入水中。<br />有鸭子站在冰面上了，就有葡萄滚落周围，这下总能吃了吧！<br />可葡萄圆滚滚的，这黑绿头的鸭子用窄窄扁扁的嘴衔几次掉落几次，只有极少数葡萄完成了使命。<br />眼镜哥毫不在乎，在冰上、在水里，吃不吃、吃不吃得着无关紧要，喂食本是一种过程。</p><p>继续往前走，又一大群鸭，和拎一大袋馒头的大姨。这边鸭也是游水时偷闲吃两口，但成群结队游向岸边还是给足了大姨面子。<br />有旁人说道，一清早就有人喂过了！<br />怨不得鸭吃得糊弄，原来是尝个咸淡，看合不合胃口。</p><p>龙潭公园里，鹅和鸭们日子过得无忧无虑呐。</p><p>25年12月21日</p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;龙潭公园里的鹅和鸭们日子过得无忧无虑呐。&lt;/p&gt;
&lt;p&gt;潭水表面大都结着冰，沿潭水绕行一圈，每个鹅鸭聚集的岸边都有人在喂食，有馒头，有不知名的食儿，甚至还有个眼镜哥带一大串葡萄来喂。&lt;br /&gt;
馒头屑浮在水面上，鸭子悠悠游来伸脖一衔便进肚，而葡萄都沉入水中。&lt;br /&gt;
</summary>
      
    
    
    
    <category term="fragment" scheme="https://www.kafm.eu.org/categories/fragment/"/>
    
    
    <category term="随笔" scheme="https://www.kafm.eu.org/tags/%E9%9A%8F%E7%AC%94/"/>
    
  </entry>
  
  <entry>
    <title>书生大模型论文分类微调赛参赛记录</title>
    <link href="https://www.kafm.eu.org/record/LLM/%E4%B9%A6%E7%94%9F%E5%A4%A7%E6%A8%A1%E5%9E%8B%E8%AE%BA%E6%96%87%E5%88%86%E7%B1%BB%E5%BE%AE%E8%B0%83%E8%B5%9B%E5%8F%82%E8%B5%9B%E8%AE%B0%E5%BD%95/"/>
    <id>https://www.kafm.eu.org/record/LLM/%E4%B9%A6%E7%94%9F%E5%A4%A7%E6%A8%A1%E5%9E%8B%E8%AE%BA%E6%96%87%E5%88%86%E7%B1%BB%E5%BE%AE%E8%B0%83%E8%B5%9B%E5%8F%82%E8%B5%9B%E8%AE%B0%E5%BD%95/</id>
    <published>2025-08-06T06:43:54.000Z</published>
    <updated>2026-04-06T06:33:50.000Z</updated>
    
    <content type="html"><![CDATA[<h2 id="写在前面"><a class="markdownIt-Anchor" href="#写在前面"></a> 写在前面</h2><p><a class="link"   href="https://aicarrier.feishu.cn/wiki/Gr7Iw6vhTiniMUkBIPvcfBiAnkg" >https://aicarrier.feishu.cn/wiki/Gr7Iw6vhTiniMUkBIPvcfBiAnkg<i class="fas fa-external-link-alt"></i></a></p><p>有点遗憾，后面没抢到 GPU，感觉找到正确堆数据的方法了，没 scale 起来，最终从 10 名左右滑落到 28 名</p><p>看了排名靠前的一个佬的报告和代码，AI 成分非常重，代码和报告都是 AI 写的</p><p>我们古法编程爱好者还是输给了AI 🤡</p><p>不，我还没输，我只是没抢到卡没 scale 起来</p><p>基于以下信条，参加比赛：</p><blockquote><p>Random is a strong baseline.</p></blockquote><blockquote><p>DL is data-driven ML.</p></blockquote><h2 id="微调策略"><a class="markdownIt-Anchor" href="#微调策略"></a> 微调策略</h2><p>草草调研了下微调方法，还是回到了 Lora，LLM 量化了应该算应该算 QLora 吧</p><p>调整了 batchsize 和 梯度累积，学习率和 Lora 参数没动，Lora 的 rank 和 alpha 参数可以尝试下倍数关系</p><p>但我觉得对这个比赛来说，数据才是制胜法宝</p><p>直接 SFT，不进行 Pretrain，（虽然教程教大家 Pretrain …</p><p><strong>数据处理</strong></p><ul><li>有个类别的名字 math.PH 需要修正，或者把 math-ph 映射回去</li></ul><p><strong>丰富模板</strong></p><ul><li>用 DeepSeek 和 ChatGPT 各生成了一些模板，最后用了 136 个对话模板，22 个系统指令</li><li>尝试改变选项和题干的先后顺序、对选项增加中文名称，反而掉点，老老实实用回原本格式的模板了</li><li>如果我设置 test split，肯定整很多花样的模板，比如变换选项顺序<br />（可能需要考虑能不能收敛，对 Lora 来说应该还好）</li></ul><p><strong>Scaling up</strong></p><ul><li>Kaggle 上的数据集挺新的，和 arxiv 上没差多少，爬 arxiv 的收益感觉可以忽略不计，像尾部类别 cs.OS 好像也就多了近两三个月的 几十条 数据</li><li>找到正确方向（一定量的简单模板 + 正确的论文抽取和样本生成后）逐渐尝试了每类别 1k，2k，3k，6k 的数据量</li><li>尾部类别也没多采样，小模型怕过拟合，而且不到 10 倍也不算严重长尾；何况所有尾类数据都包括进去了，已经拿着答案在背了</li></ul><p><strong>数据过滤</strong></p><ul><li>单类别 =&gt; 主类别 =&gt; 过滤交叉类别样本</li><li>一开始为了数据质量，限制只选择单类别样本，数量太少；后来扩大到 主类别 为目标类别的样本</li><li>后来检查数据，发现有些类别重叠很严重，像 cs.MM，本来就是 CV/AI 之类的交叉严重的研究领域，而且这个类别样本数量有限；尝试了去掉交叉类别样本，训练感觉没有明显差异</li><li>最后看到 赛题文档有更新。。。<br /><img                           lazyload                       alt="image"                       data-src="https://cdn.jsdelivr.net/gh/kafmws/pictures/notes/20260406132427.png"                         alt="更新的赛题文档规则" style="width:50%"                 ></li></ul><p>放心地把交叉类别样本都过滤了</p><h2 id="不足之处"><a class="markdownIt-Anchor" href="#不足之处"></a> 不足之处</h2><ul><li>没考虑也没尝试全量训练，（full 在这个任务上一定能优于 Lora 吗，存疑）</li><li>调整 Lora 参数范围，例如增加 o_proj 之类的</li><li>没牢牢占住算力，导致没 scale 起来</li></ul><h2 id="数据处理代码"><a class="markdownIt-Anchor" href="#数据处理代码"></a> 数据处理代码</h2><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br><span class="line">146</span><br><span class="line">147</span><br><span class="line">148</span><br><span class="line">149</span><br><span class="line">150</span><br><span class="line">151</span><br><span class="line">152</span><br><span class="line">153</span><br><span class="line">154</span><br><span class="line">155</span><br><span class="line">156</span><br><span class="line">157</span><br><span class="line">158</span><br><span class="line">159</span><br><span class="line">160</span><br><span class="line">161</span><br><span class="line">162</span><br><span class="line">163</span><br><span class="line">164</span><br><span class="line">165</span><br><span class="line">166</span><br><span class="line">167</span><br><span class="line">168</span><br><span class="line">169</span><br><span class="line">170</span><br><span class="line">171</span><br><span class="line">172</span><br><span class="line">173</span><br><span class="line">174</span><br><span class="line">175</span><br><span class="line">176</span><br><span class="line">177</span><br><span class="line">178</span><br><span class="line">179</span><br><span class="line">180</span><br><span class="line">181</span><br><span class="line">182</span><br><span class="line">183</span><br><span class="line">184</span><br><span class="line">185</span><br><span class="line">186</span><br><span class="line">187</span><br><span class="line">188</span><br><span class="line">189</span><br><span class="line">190</span><br><span class="line">191</span><br><span class="line">192</span><br><span class="line">193</span><br><span class="line">194</span><br><span class="line">195</span><br><span class="line">196</span><br><span class="line">197</span><br><span class="line">198</span><br><span class="line">199</span><br><span class="line">200</span><br><span class="line">201</span><br><span class="line">202</span><br><span class="line">203</span><br><span class="line">204</span><br><span class="line">205</span><br><span class="line">206</span><br><span class="line">207</span><br><span class="line">208</span><br><span class="line">209</span><br><span class="line">210</span><br><span class="line">211</span><br><span class="line">212</span><br><span class="line">213</span><br><span class="line">214</span><br><span class="line">215</span><br><span class="line">216</span><br><span class="line">217</span><br><span class="line">218</span><br><span class="line">219</span><br><span class="line">220</span><br><span class="line">221</span><br><span class="line">222</span><br><span class="line">223</span><br><span class="line">224</span><br><span class="line">225</span><br><span class="line">226</span><br><span class="line">227</span><br><span class="line">228</span><br><span class="line">229</span><br><span class="line">230</span><br><span class="line">231</span><br><span class="line">232</span><br><span class="line">233</span><br><span class="line">234</span><br><span class="line">235</span><br><span class="line">236</span><br><span class="line">237</span><br><span class="line">238</span><br><span class="line">239</span><br><span class="line">240</span><br><span class="line">241</span><br><span class="line">242</span><br><span class="line">243</span><br><span class="line">244</span><br><span class="line">245</span><br><span class="line">246</span><br><span class="line">247</span><br><span class="line">248</span><br><span class="line">249</span><br><span class="line">250</span><br><span class="line">251</span><br><span class="line">252</span><br><span class="line">253</span><br><span class="line">254</span><br><span class="line">255</span><br><span class="line">256</span><br><span class="line">257</span><br><span class="line">258</span><br><span class="line">259</span><br><span class="line">260</span><br><span class="line">261</span><br><span class="line">262</span><br><span class="line">263</span><br><span class="line">264</span><br><span class="line">265</span><br><span class="line">266</span><br><span class="line">267</span><br><span class="line">268</span><br><span class="line">269</span><br><span class="line">270</span><br><span class="line">271</span><br><span class="line">272</span><br><span class="line">273</span><br><span class="line">274</span><br><span class="line">275</span><br><span class="line">276</span><br><span class="line">277</span><br><span class="line">278</span><br><span class="line">279</span><br><span class="line">280</span><br><span class="line">281</span><br><span class="line">282</span><br><span class="line">283</span><br><span class="line">284</span><br><span class="line">285</span><br><span class="line">286</span><br><span class="line">287</span><br><span class="line">288</span><br><span class="line">289</span><br><span class="line">290</span><br><span class="line">291</span><br><span class="line">292</span><br><span class="line">293</span><br><span class="line">294</span><br><span class="line">295</span><br><span class="line">296</span><br><span class="line">297</span><br><span class="line">298</span><br><span class="line">299</span><br><span class="line">300</span><br><span class="line">301</span><br><span class="line">302</span><br><span class="line">303</span><br><span class="line">304</span><br><span class="line">305</span><br><span class="line">306</span><br><span class="line">307</span><br><span class="line">308</span><br><span class="line">309</span><br><span class="line">310</span><br><span class="line">311</span><br><span class="line">312</span><br><span class="line">313</span><br><span class="line">314</span><br><span class="line">315</span><br><span class="line">316</span><br><span class="line">317</span><br><span class="line">318</span><br><span class="line">319</span><br><span class="line">320</span><br><span class="line">321</span><br><span class="line">322</span><br><span class="line">323</span><br><span class="line">324</span><br><span class="line">325</span><br><span class="line">326</span><br><span class="line">327</span><br><span class="line">328</span><br><span class="line">329</span><br><span class="line">330</span><br><span class="line">331</span><br><span class="line">332</span><br><span class="line">333</span><br><span class="line">334</span><br><span class="line">335</span><br><span class="line">336</span><br><span class="line">337</span><br><span class="line">338</span><br><span class="line">339</span><br><span class="line">340</span><br><span class="line">341</span><br><span class="line">342</span><br><span class="line">343</span><br><span class="line">344</span><br><span class="line">345</span><br><span class="line">346</span><br><span class="line">347</span><br><span class="line">348</span><br><span class="line">349</span><br><span class="line">350</span><br><span class="line">351</span><br><span class="line">352</span><br><span class="line">353</span><br><span class="line">354</span><br><span class="line">355</span><br><span class="line">356</span><br><span class="line">357</span><br><span class="line">358</span><br><span class="line">359</span><br><span class="line">360</span><br><span class="line">361</span><br><span class="line">362</span><br><span class="line">363</span><br><span class="line">364</span><br><span class="line">365</span><br><span class="line">366</span><br><span class="line">367</span><br><span class="line">368</span><br><span class="line">369</span><br><span class="line">370</span><br><span class="line">371</span><br><span class="line">372</span><br><span class="line">373</span><br><span class="line">374</span><br><span class="line">375</span><br><span class="line">376</span><br><span class="line">377</span><br><span class="line">378</span><br><span class="line">379</span><br><span class="line">380</span><br><span class="line">381</span><br><span class="line">382</span><br><span class="line">383</span><br><span class="line">384</span><br><span class="line">385</span><br><span class="line">386</span><br><span class="line">387</span><br><span class="line">388</span><br><span class="line">389</span><br><span class="line">390</span><br><span class="line">391</span><br><span class="line">392</span><br><span class="line">393</span><br><span class="line">394</span><br><span class="line">395</span><br><span class="line">396</span><br><span class="line">397</span><br><span class="line">398</span><br><span class="line">399</span><br><span class="line">400</span><br><span class="line">401</span><br><span class="line">402</span><br><span class="line">403</span><br><span class="line">404</span><br><span class="line">405</span><br><span class="line">406</span><br><span class="line">407</span><br><span class="line">408</span><br><span class="line">409</span><br><span class="line">410</span><br><span class="line">411</span><br><span class="line">412</span><br><span class="line">413</span><br><span class="line">414</span><br><span class="line">415</span><br><span class="line">416</span><br><span class="line">417</span><br><span class="line">418</span><br><span class="line">419</span><br><span class="line">420</span><br><span class="line">421</span><br><span class="line">422</span><br><span class="line">423</span><br><span class="line">424</span><br><span class="line">425</span><br><span class="line">426</span><br><span class="line">427</span><br><span class="line">428</span><br><span class="line">429</span><br><span class="line">430</span><br><span class="line">431</span><br><span class="line">432</span><br><span class="line">433</span><br><span class="line">434</span><br><span class="line">435</span><br><span class="line">436</span><br><span class="line">437</span><br><span class="line">438</span><br><span class="line">439</span><br><span class="line">440</span><br><span class="line">441</span><br><span class="line">442</span><br><span class="line">443</span><br><span class="line">444</span><br><span class="line">445</span><br><span class="line">446</span><br><span class="line">447</span><br><span class="line">448</span><br><span class="line">449</span><br><span class="line">450</span><br><span class="line">451</span><br><span class="line">452</span><br><span class="line">453</span><br><span class="line">454</span><br><span class="line">455</span><br><span class="line">456</span><br><span class="line">457</span><br><span class="line">458</span><br><span class="line">459</span><br><span class="line">460</span><br><span class="line">461</span><br><span class="line">462</span><br><span class="line">463</span><br><span class="line">464</span><br><span class="line">465</span><br><span class="line">466</span><br><span class="line">467</span><br><span class="line">468</span><br><span class="line">469</span><br><span class="line">470</span><br><span class="line">471</span><br><span class="line">472</span><br><span class="line">473</span><br><span class="line">474</span><br><span class="line">475</span><br><span class="line">476</span><br><span class="line">477</span><br><span class="line">478</span><br><span class="line">479</span><br><span class="line">480</span><br><span class="line">481</span><br><span class="line">482</span><br><span class="line">483</span><br><span class="line">484</span><br><span class="line">485</span><br><span class="line">486</span><br><span class="line">487</span><br><span class="line">488</span><br><span class="line">489</span><br><span class="line">490</span><br><span class="line">491</span><br><span class="line">492</span><br><span class="line">493</span><br><span class="line">494</span><br><span class="line">495</span><br><span class="line">496</span><br></pre></td><td class="code"><pre><span class="line"><span class="keyword">import</span> os</span><br><span class="line"><span class="keyword">import</span> json</span><br><span class="line"><span class="keyword">import</span> random</span><br><span class="line"></span><br><span class="line">random.seed(<span class="number">42</span>)</span><br><span class="line"></span><br><span class="line">output_base_dir = <span class="string">&#x27;internlm/dataset&#x27;</span></span><br><span class="line"><span class="keyword">if</span> <span class="keyword">not</span> os.path.exists(output_base_dir):</span><br><span class="line">    os.makedirs(output_base_dir)</span><br><span class="line"></span><br><span class="line">input_templates = [</span><br><span class="line">    <span class="string">&quot;Based on the title &#x27;&#123;title&#125;&#x27;, authors &#x27;&#123;authors&#125;&#x27;, and abstract &#x27;&#123;abstract&#125;&#x27;, please determine the scientific category of this paper.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Classification Request: Given the title &#x27;&#123;title&#125;&#x27;, authored by &#x27;&#123;authors&#125;&#x27;, and abstract &#x27;&#123;abstract&#125;&#x27;, identify the research field of this paper.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Field Determination: Analyze the title &#x27;&#123;title&#125;&#x27;, authors &#x27;&#123;authors&#125;&#x27;, and abstract &#x27;&#123;abstract&#125;&#x27; to assign a discipline category.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Academic Categorization: Based on &#x27;&#123;title&#125;&#x27; (authors: &#x27;&#123;authors&#125;&#x27;) and abstract content &#x27;&#123;abstract&#125;&#x27;, classify this paper into a scientific domain.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Domain Assignment: Using the title &#x27;&#123;title&#125;&#x27;, author list &#123;authors&#125;, and abstract text &#x27;&#123;abstract&#125;&#x27;, determine the most relevant academic field.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Research Area Identification: From the paper titled &#x27;&#123;title&#125;&#x27; (by &#123;authors&#125;) and abstract &#x27;&#123;abstract&#125;&#x27;, infer its primary research area.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Paper Taxonomy: Categorize the paper with title &#x27;&#123;title&#125;&#x27;, authors &#123;authors&#125;, and abstract &#x27;&#123;abstract&#125;&#x27; into a specific scientific discipline.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Subject Labeling: With the metadata: Title &#x27;&#123;title&#125;&#x27;, Authors &#123;authors&#125;, Abstract &#x27;&#123;abstract&#125;&#x27;, generate a subject classification.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Knowledge Domain Inference: Based on &#x27;&#123;title&#125;&#x27; (by &#123;authors&#125;) and abstract &#x27;&#123;abstract&#125;&#x27;, predict the broad field of study.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Scientific Field Prediction: Analyze the title &#x27;&#123;title&#125;&#x27;, authors &#123;authors&#125;, and abstract &#x27;&#123;abstract&#125;&#x27; to output a single discipline label.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Multi-Metadata Classification: Integrate the paper\&#x27;s title &#x27;&#123;title&#125;&#x27;, author affiliations &#x27;&#123;authors&#125;&#x27;, and abstract &#x27;&#123;abstract&#125;&#x27; to assign a research category.&quot;</span>,</span><br><span class="line">    </span><br><span class="line">    <span class="string">&quot;分类请求：根据标题“&#123;title&#125;”、作者“&#123;authors&#125;”和摘要“&#123;abstract&#125;”，请确定该论文的研究领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;领域判定：结合标题“&#123;title&#125;”、作者&#123;authors&#125;及摘要内容“&#123;abstract&#125;”，判断此论文所属学科类别。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;学术分类：基于论文标题“&#123;title&#125;”（作者：&#123;authors&#125;）和摘要“&#123;abstract&#125;”，将其划分到具体的科学领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;学科标注：根据标题“&#123;title&#125;”、作者列表&#123;authors&#125;和摘要文本“&#123;abstract&#125;”，确定最相关的学术领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;研究方向识别：从标题为“&#123;title&#125;”（作者&#123;authors&#125;）及摘要“&#123;abstract&#125;”中推断其主要研究方向。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;文献归类：将标题“&#123;title&#125;”、作者&#123;authors&#125;、摘要“&#123;abstract&#125;”的论文归类至特定学科门类。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;主题分类：根据元数据：标题“&#123;title&#125;”、作者&#123;authors&#125;、摘要“&#123;abstract&#125;”，生成一个学科分类标签。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;知识领域推断：基于标题“&#123;title&#125;”（作者&#123;authors&#125;）及摘要“&#123;abstract&#125;”，预测其所属广泛研究领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;科学领域预测：分析标题“&#123;title&#125;”、作者&#123;authors&#125;和摘要“&#123;abstract&#125;”，输出单一学科标签。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;多维度分类：综合论文标题“&#123;title&#125;”、作者信息&#123;authors&#125;和摘要“&#123;abstract&#125;”，划分研究类别。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Label the research domain of this paper by analyzing:\nTitle: &#123;title&#125;\nAuthors: &#123;authors&#125;\nKey findings: &#123;abstract&#125;&#x27;&quot;</span>,</span><br><span class="line">    </span><br><span class="line">    <span class="string">&quot;Q: Which academic field does the paper &#x27;&#123;title&#125;&#x27; by &#123;authors&#125; belong to, given its abstract: &#x27;&#123;abstract&#125;&#x27;?\nA: The field is:&quot;</span>,</span><br><span class="line">    </span><br><span class="line">    <span class="string">&quot;This paper [&#123;title&#125;] authored by &#123;authors&#125; primarily focuses on ______ (fill in the field), as evidenced by the abstract: &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line">    </span><br><span class="line">    <span class="string">&quot;Reviewer Task: Based on the title &#x27;&#123;title&#125;&#x27;, author affiliations &#123;authors&#125;, and abstract summary &#x27;&#123;abstract&#125;&#x27;, assign a discipline category from the taxonomy codes.&quot;</span>,</span><br><span class="line">    </span><br><span class="line">    <span class="string">&quot;If the paper &#x27;&#123;title&#125;&#x27; by &#123;authors&#125; were a book in a library, which section would it shelve in? Abstract clues: &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Step 1: Extract keywords from &#x27;&#123;title&#125;&#x27; and abstract: &#x27;&#123;abstract&#125;&#x27;.\nStep 2: Cross-reference with author &#x27;&#123;authors&#125;&#x27; expertise.\nStep 3: Output the dominant field.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Compare these metadata to classify the paper:\nTitle focus: &#123;title&#125;\nAuthor expertise: &#123;authors&#125;\nAbstract emphasis: &#123;abstract&#125;\nConclusion: The paper belongs to _____ field.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Can you accurately categorize &#123;title&#125; by &#123;authors&#125; just from this abstract? Prove it: &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Inputs:\nMetadata: Title=&#123;title&#125;, Authors=&#123;authors&#125;\nContent: Abstract=&#123;abstract&#125;\nProcessing: Apply field codes.\nOutput: Field=?&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;The DNA of this paper (&#123;title&#125; by &#123;authors&#125;) reveals its academic species. Abstract strand: &#x27;&#123;abstract&#125;&#x27;. Species identification:&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Research Area Identification: From the paper titled &#x27;&#123;title&#125;&#x27; (by &#123;authors&#125;) and abstract &#x27;&#123;abstract&#125;&#x27;, infer its primary research area.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Paper Taxonomy: Categorize the paper with title &#x27;&#123;title&#125;&#x27;, authors &#123;authors&#125;, and abstract &#x27;&#123;abstract&#125;&#x27; into a specific scientific discipline.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Subject Labeling: With the metadata: Title &#x27;&#123;title&#125;&#x27;, Authors &#123;authors&#125;, Abstract &#x27;&#123;abstract&#125;&#x27;, generate a subject classification.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Knowledge Domain Inference: Based on &#x27;&#123;title&#125;&#x27; (by &#123;authors&#125;) and abstract &#x27;&#123;abstract&#125;&#x27;, predict the broad field of study.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Discipline Prediction: Analyze the abstract &#x27;&#123;abstract&#125;&#x27; of the paper &#x27;&#123;title&#125;&#x27; authored by &#123;authors&#125; and suggest the academic domain.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Field Classification Task: Use the title &#x27;&#123;title&#125;&#x27;, authors &#123;authors&#125;, and abstract &#x27;&#123;abstract&#125;&#x27; to assign a research category.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Scientific Area Determination: Given the information — Title: &#x27;&#123;title&#125;&#x27;, Authors: &#123;authors&#125;, Abstract: &#x27;&#123;abstract&#125;&#x27; — identify the scientific domain.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Area Tagging: From the context of the paper &#x27;&#123;title&#125;&#x27; and its abstract &#x27;&#123;abstract&#125;&#x27;, assign a field label.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Disciplinary Mapping: With the title &#x27;&#123;title&#125;&#x27;, the author(s) &#123;authors&#125;, and the abstract &#x27;&#123;abstract&#125;&#x27;, map this paper to a discipline.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Research Field Suggestion: Based on the content in the title &#x27;&#123;title&#125;&#x27; and abstract &#x27;&#123;abstract&#125;&#x27;, recommend the research field.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Topic Classification: Classify the following paper by title &#x27;&#123;title&#125;&#x27;, authors &#123;authors&#125;, and abstract &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Academic Field Categorization: Given the title &#x27;&#123;title&#125;&#x27; and abstract &#x27;&#123;abstract&#125;&#x27;, determine which academic field this paper falls into.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Scientific Discipline Inference: Determine the scientific discipline of the paper titled &#x27;&#123;title&#125;&#x27; (authors: &#123;authors&#125;) based on the abstract &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Field Assignment Task: Use the provided paper metadata to assign the appropriate research area. Title: &#x27;&#123;title&#125;&#x27;, Authors: &#123;authors&#125;, Abstract: &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Content-Based Field Classification: Determine the field of study using the paper&#x27;s title &#x27;&#123;title&#125;&#x27;, authors &#123;authors&#125;, and abstract &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Scholarly Classification Prompt: Use the paper title &#x27;&#123;title&#125;&#x27;, author list &#123;authors&#125;, and abstract &#x27;&#123;abstract&#125;&#x27; to classify the research area.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Discipline Deduction: From the title &#x27;&#123;title&#125;&#x27;, author list &#123;authors&#125;,  and abstract &#x27;&#123;abstract&#125;&#x27;, deduce the primary academic discipline.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Study Area Determination: Determine the core area of study of the paper titled &#x27;&#123;title&#125;&#x27; authored by &#123;authors&#125; from the abstract &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Category Prediction Task: Predict the research category using the paper title &#x27;&#123;title&#125;&#x27; and abstract &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Field Analysis Instruction: Based on metadata (title: &#x27;&#123;title&#125;&#x27;, authors: &#123;authors&#125;, abstract: &#x27;&#123;abstract&#125;&#x27;), identify the study field.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**分类请求：**根据标题“&#123;title&#125;”、作者“&#123;authors&#125;”和摘要“&#123;abstract&#125;”，请确定该论文的研究领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**领域判定：**结合标题“&#123;title&#125;”、作者&#123;authors&#125;及摘要内容“&#123;abstract&#125;”，判断此论文所属学科类别。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**学术分类：**基于论文标题“&#123;title&#125;”（作者：&#123;authors&#125;）和摘要“&#123;abstract&#125;”，将其划分到具体的科学领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**主题标签生成：**请依据论文的标题“&#123;title&#125;”、作者“&#123;authors&#125;”及摘要“&#123;abstract&#125;”，为其生成对应的学科标签。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**领域识别任务：**请根据以下论文信息（标题：“&#123;title&#125;”，作者：&#123;authors&#125;，摘要：“&#123;abstract&#125;”）识别其研究领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**学科归类请求：**请将题为“&#123;title&#125;”、作者为&#123;authors&#125;的论文，基于摘要“&#123;abstract&#125;”进行学科归类。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**研究领域预测：**请根据论文摘要“&#123;abstract&#125;”内容，预测标题为“&#123;title&#125;”的论文的研究领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**论文领域自动识别：**输入信息包括标题“&#123;title&#125;”、作者&#123;authors&#125;、摘要“&#123;abstract&#125;”，请自动判断其学科领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**学术方向分类任务：**请根据以下论文元数据，判断其研究方向。标题：&#123;title&#125;，作者：&#123;authors&#125;，摘要：&#123;abstract&#125;。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**科学领域分类：**根据论文题目“&#123;title&#125;”和作者“&#123;authors&#125;”、摘要“&#123;abstract&#125;”，将其归类到相应的科学领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**领域推理任务：**利用标题“&#123;title&#125;”、作者“&#123;authors&#125;”及摘要“&#123;abstract&#125;”对论文进行研究方向推理。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**领域划分：**请根据“&#123;title&#125;”和“&#123;abstract&#125;”信息，作者为“&#123;authors&#125;”，判断其归属的学术领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**分类辅助：**请依据标题“&#123;title&#125;”和作者&#123;authors&#125;的摘要“&#123;abstract&#125;”内容，推荐一个合适的研究分类。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**领域归属分析：**根据论文内容判断其属于哪个研究领域。信息如下：标题：&#123;title&#125;；作者：&#123;authors&#125;；摘要：&#123;abstract&#125;。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**学科方向识别：**请根据摘要“&#123;abstract&#125;”和标题“&#123;title&#125;”，作者是“&#123;authors&#125;”，识别该论文的学科方向。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**论文归类任务：**依据论文元数据“&#123;title&#125;”、“&#123;authors&#125;”、“&#123;abstract&#125;”，请将其归类为某一学科类别。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;hich academic field does this paper belong to? Based on its title &#x27;&#123;title&#125;&#x27;, authors &#123;authors&#125;, and abstract &#x27;&#123;abstract&#125;&#x27;, determine the most suitable classification.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Assign a scientific category to the paper below, using its metadata: Title: &#x27;&#123;title&#125;&#x27;, Authors: &#123;authors&#125;, Abstract: &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Summarize the domain of study that best fits the research described in &#x27;&#123;title&#125;&#x27; by &#123;authors&#125;. Consider the abstract: &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Field estimation challenge: Based on the content of this scholarly work (Title: &#x27;&#123;title&#125;&#x27;, by &#123;authors&#125;. Abstract: &#x27;&#123;abstract&#125;&#x27;), which field is it most aligned with?&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Discipline tagging assistant: Help identify the most relevant field for the paper titled &#x27;&#123;title&#125;&#x27; by &#123;authors&#125;, summarized as: &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Knowledge scope detection: Use the following metadata to detect the academic scope: Title - &#x27;&#123;title&#125;&#x27;; Authors - &#123;authors&#125;; Abstract - &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Contextual paper classification: Examine the title and abstract provided, and place the research in an appropriate scientific taxonomy.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Suggest a domain label for the paper titled &#x27;&#123;title&#125;&#x27; with abstract &#x27;&#123;abstract&#125;&#x27;. Focus on broad scientific or technical fields.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Research domain detection: This paper (title: &#x27;&#123;title&#125;&#x27;; abstract: &#x27;&#123;abstract&#125;&#x27;) was written by &#123;authors&#125;. What is its academic category?&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Infer the scholarly classification from the semantic cues in the abstract &#x27;&#123;abstract&#125;&#x27;, title &#x27;&#123;title&#125;&#x27;, and authorship &#123;authors&#125;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**请问这篇论文属于哪个研究领域？**以下是其基本信息：标题“&#123;title&#125;”，作者&#123;authors&#125;，摘要“&#123;abstract&#125;”。&quot;</span>,</span><br><span class="line">    </span><br><span class="line">    <span class="string">&quot;**基于内容的领域分类：**请分析论文标题“&#123;title&#125;”和摘要“&#123;abstract&#125;”，判断其所属的科学门类。&quot;</span>,</span><br><span class="line">    </span><br><span class="line">    <span class="string">&quot;请对以下论文信息进行分类，包括标题“&#123;title&#125;”、作者&#123;authors&#125;和摘要“&#123;abstract&#125;”。&quot;</span>,</span><br><span class="line">    </span><br><span class="line">    <span class="string">&quot;**根据语义内容判断类别：**请从摘要“&#123;abstract&#125;”和标题“&#123;title&#125;”中提取关键信息，为论文分配一个学术领域。&quot;</span>,</span><br><span class="line">    </span><br><span class="line">    <span class="string">&quot;**帮我标注该论文的研究方向：**信息如下：&#123;title&#125;，作者：&#123;authors&#125;，摘要内容：“&#123;abstract&#125;”。&quot;</span>,</span><br><span class="line">    </span><br><span class="line">    <span class="string">&quot;**该研究更偏向哪个学科？**结合论文标题与摘要信息，请给出一个合理的分类建议。&quot;</span>,</span><br><span class="line">    </span><br><span class="line">    <span class="string">&quot;**从专业角度判断：**基于论文“&#123;title&#125;”与其研究摘要“&#123;abstract&#125;”，其应属于哪个专业领域？&quot;</span>,</span><br><span class="line">    </span><br><span class="line">    <span class="string">&quot;**请推荐一个学术标签，**用于表示这篇由&#123;authors&#125;撰写、标题为“&#123;title&#125;”的论文所属领域。&quot;</span>,</span><br><span class="line">    </span><br><span class="line">    <span class="string">&quot;**摘要分析分类：**请从该摘要“&#123;abstract&#125;”推测研究方向，并结合论文标题“&#123;title&#125;”做出归属判断。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;**内容归类任务提示：**请使用该论文的元数据（&#123;title&#125;、&#123;authors&#125;、&#123;abstract&#125;）对其进行领域标签的生成。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Classify this paper into a research field. Title: &#x27;&#123;title&#125;&#x27;, Authors: (&#123;authors&#125;), Abstract: &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Given: title &#x27;&#123;title&#125;&#x27;, authors &#x27;&#123;authors&#125;&#x27;, abstract &#x27;&#123;abstract&#125;&#x27;. Determine the academic domain.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Use the abstract to assign a research category. Title: &#x27;&#123;title&#125;&#x27;, Authors: &#x27;&#123;authors&#125;&#x27;,  Abstract: &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Input: &#x27;&#123;title&#125;&#x27; by &#x27;&#123;authors&#125;&#x27;. Abstract: &#x27;&#123;abstract&#125;&#x27;. Output: scientific field.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;From the title and abstract, categorize this paper. Title: &#x27;&#123;title&#125;&#x27;. Abstract: &#x27;&#123;abstract&#125;&#x27;, Authors: (&#123;authors&#125;).&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Can you help me figure out what field this paper belongs to? Here&#x27;s the info: title &#x27;&#123;title&#125;&#x27;, authors &#123;authors&#125;, abstract &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;I\&#x27;m trying to organize some papers. What category should this one go into? Title: &#x27;&#123;title&#125;&#x27;, Authors: &#123;authors&#125;, Abstract: &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;I read this paper, but I&#x27;m unsure about its domain. Can you classify it? Title: &#x27;&#123;title&#125;&#x27;, Abstract: &#x27;&#123;abstract&#125;&#x27;, Authors: &#x27;&#123;authors&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Which research area would you assign to this work based on its abstract and title? Title: &#x27;&#123;title&#125;&#x27;, Authors: &#x27;&#123;authors&#125;&#x27;,  Abstract: &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;You are an academic journal editor. Based on the title &#x27;&#123;title&#125;&#x27;, authors &#123;authors&#125;, and abstract &#x27;&#123;abstract&#125;&#x27;, assign this paper to a suitable discipline.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;As a librarian building a research taxonomy, determine the subject area for the paper: &#x27;&#123;title&#125;&#x27; by &#123;authors&#125; and abstract: &#x27;&#123;abstract&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Act as a scientific reviewer. Categorize this manuscript by domain using: Title: &#x27;&#123;title&#125;&#x27;, Abstract: &#x27;&#123;abstract&#125;&#x27;, Author List: &#x27;&#123;authors&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;From the abstract &#x27;&#123;abstract&#125;&#x27; and title &#x27;&#123;title&#125;&#x27;, (authors &#123;authors&#125;), what can you infer about the research domain of the paper?&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;What clues in the abstract &#x27;&#123;abstract&#125;&#x27; and title &#x27;&#123;title&#125;&#x27;, (authors &#123;authors&#125;) suggest the field of study?&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Analyze the keywords and topics in &#x27;&#123;abstract&#125;&#x27; and classify accordingly. And title &#x27;&#123;title&#125;&#x27;, (authors &#123;authors&#125;).&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;[System] Input received. Paper Title: &#x27;&#123;title&#125;&#x27;, Authors: &#123;authors&#125;, Abstract: &#x27;&#123;abstract&#125;&#x27;. Proceed to classify by domain.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;[AI_Tagger] Please assign subject label based on: Title = &#x27;&#123;title&#125;&#x27;, Abstract = &#x27;&#123;abstract&#125;&#x27;, Author List: &#x27;&#123;authors&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;[MetadataAnalyzer] Classify this entry using embedded text: &#x27;&#123;abstract&#125;&#x27; (title: &#x27;&#123;title&#125;&#x27;), (authors &#123;authors&#125;).&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Title = &#x27;&#123;title&#125;&#x27;, Abstract = &#x27;&#123;abstract&#125;&#x27;, Author List: &#x27;&#123;authors&#125;&#x27;. This paper was submitted for classification. Use the metadata to determine the category.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Title = &#x27;&#123;title&#125;&#x27;, Abstract = &#x27;&#123;abstract&#125;&#x27;, Author List: &#x27;&#123;authors&#125;&#x27;. Generate a domain label based on the core ideas from the abstract and title provided.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;请根据标题“&#123;title&#125;”、作者&#123;authors&#125;和摘要“&#123;abstract&#125;”，对该论文进行学科分类。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;任务：对以下论文分类。标题：&#123;title&#125;；摘要：&#123;abstract&#125;； 作者列表“&#123;authors&#125;”。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;输入元信息：标题“&#123;title&#125;”，摘要“&#123;abstract&#125;”，作者列表“&#123;authors&#125;”。输出：研究领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;Title = &#x27;&#123;title&#125;&#x27; Author List: &#x27;&#123;authors&#125;&#x27;, Abstract = &#x27;&#123;abstract&#125;&#x27;,. 分类需求：根据论文摘要和标题内容，为其指定一个研究类别。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;给出以下论文信息，请判断所属学科门类。 Title = &#x27;&#123;title&#125;&#x27;, Abstract = &#x27;&#123;abstract&#125;&#x27;, Author List: &#x27;&#123;authors&#125;&#x27;.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;请问这篇文章属于哪个领域？标题是“&#123;title&#125;”，摘要如下：“&#123;abstract&#125;”。作者列表“&#123;authors&#125;”。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;我正在整理文献，不确定这篇论文的研究方向。你能帮我分类吗？信息如下。标题是“&#123;title&#125;”，摘要如下：“&#123;abstract&#125;”。作者列表“&#123;authors&#125;”。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;根据摘要“&#123;abstract&#125;”的内容，这篇题为“&#123;title&#125;” （作者列表“&#123;authors&#125;”）的论文应该归属哪个研究领域？&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;我不太确定这篇文章的学科归属，可以请你判断一下吗？标题是“&#123;title&#125;”，摘要如下：“&#123;abstract&#125;”。作者列表“&#123;authors&#125;”。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;你是一位资深学术期刊编辑，请根据标题“&#123;title&#125;”、作者&#123;authors&#125;、摘要“&#123;abstract&#125;”为其确定研究方向。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;作为图书馆分类员，你需要为这篇论文分配一个学科分类。标题“&#123;title&#125;”、作者&#123;authors&#125;、摘要“&#123;abstract&#125;”。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;请模拟审稿人角色，为该论文选择一个最合适的研究领域。标题“&#123;title&#125;”、作者&#123;authors&#125;、摘要“&#123;abstract&#125;”。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;标题“&#123;title&#125;”、作者&#123;authors&#125;、摘要“&#123;abstract&#125;”。 请模拟审稿人角色，为该论文选择一个最合适的研究领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;从摘要“&#123;abstract&#125;”中的关键词判断，该论文属于哪一类学科？额外的信息：标题“&#123;title&#125;”、作者&#123;authors&#125;。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;从研究目标和方法出发，请为该论文做出领域归属判断。标题“&#123;title&#125;”、作者&#123;authors&#125;、摘要“&#123;abstract&#125;”。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;通过标题“&#123;title&#125;”及其对应的研究内容“&#123;abstract&#125;”，推断其最可能的研究方向。作者列表：“&#123;authors&#125;”。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;[系统请求] 输入论文元信息：标题“&#123;title&#125;”、作者&#123;authors&#125;、摘要“&#123;abstract&#125;”。请进行自动分类。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;[分类助手] 请为该论文分配一个领域标签。标题“&#123;title&#125;”、作者&#123;authors&#125;、摘要“&#123;abstract&#125;”。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;[AI 分类引擎] 任务输入：&#123;title&#125;，摘要：&#123;abstract&#125;。请输出所属学科。作者列表：“&#123;authors&#125;”。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;如果你只读了以下论文摘要“&#123;abstract&#125;”和标题“&#123;title&#125;”，（作者列表你可能不关心：“&#123;authors&#125;”）你会认为它属于哪个领域？&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;假设你是一个“论文归类机器人”，你的任务是为这篇论文打上一个准确的学科标签。标题“&#123;title&#125;”、作者&#123;authors&#125;、摘要“&#123;abstract&#125;”。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;[System Instruction] Paper classification task initiated. Input: title &#x27;&#123;title&#125;&#x27;, authors &#123;authors&#125;, abstract &#x27;&#123;abstract&#125;&#x27;. Please assign an appropriate research domain label.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;[MetadataClassifier::Invoke] -&gt; Analyze the paper with metadata &#123;title&#125;, &#123;authors&#125;, and &#123;abstract&#125;. Output: scientific discipline.&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;[Task: ResearchFieldDetection] Paper metadata received. Begin classification using the abstract and title.\n&gt; Title: &#x27;&#123;title&#125;&#x27;\n&gt; Authors: &#123;authors&#125;\n&gt; Abstract: &#x27;&#123;abstract&#125;&#x27;&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;[CLASSIFY_PAPER] Inputs:\n- TITLE = &#x27;&#123;title&#125;&#x27;\n- AUTHORS = &#123;authors&#125;\n- ABSTRACT = &#x27;&#123;abstract&#125;&#x27;\n→ RETURN: FIELD_LABEL&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;[System Input] A new research paper has been submitted. Please determine the academic category based on:\n• Title: &#x27;&#123;title&#125;&#x27;\n• Authors: &#123;authors&#125;\n• Abstract: &#x27;&#123;abstract&#125;&#x27;&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;【系统指令】已接收到论文元数据。请根据标题“&#123;title&#125;”、作者&#123;authors&#125;和摘要“&#123;abstract&#125;”，判定所属学科领域。&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;【研究领域分类模块】接收到一篇新论文，请根据摘要与标题内容进行自动归类。\n→ 论文信息：&#123;title&#125;，&#123;authors&#125;，&#123;abstract&#125;&quot;</span>,</span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;[调用接口：学科分类] 参数如下：标题：&#123;title&#125;作者：&#123;authors&#125;摘要：&#123;abstract&#125;→ 返回值：学术领域标签&quot;</span></span><br><span class="line"></span><br><span class="line">]</span><br><span class="line"></span><br><span class="line">options = <span class="string">&quot;A. quant-ph\nB. physics.chem-ph\nC. physics.atom-ph\nD. cond-mat.soft\nE. cs.RO\nF. cs.CL\nG. cs.SE\nH. cs.IR\nI. hep-th\nJ. hep-ph\nK. physics.optics\nL. cs.AI\nM. cs.CV\nN. nucl-th\nO. astro-ph\nP. math.PR\nQ. cs.OS\nR. eess.SP\nS. math.OC\nT. math.DS\nU. math.DG\nV. math.MP\nW. cs.MM\nX. stat.ME\nY. math.CO\nZ. cs.NE&quot;</span></span><br><span class="line"><span class="comment"># options = &quot;A. 量子物理 quant-ph\nB. 化学物理 physics.chem-ph\nC. 原子物理 physics.atom-ph\nD. 软凝聚态物理 cond-mat.soft\nE. 机器人学 cs.RO\nF. 计算语言学 cs.CL\nG. 软件工程 cs.SE\nH. 信息检索 cs.IR\nI. 高能理论物理 hep-th\nJ. 高能现象学 hep-ph\nK. 光学 physics.optics\nL. 人工智能 cs.AI\nM. 计算机视觉 cs.CV\nN. 核理论 nucl-th\nO. 天体物理 astro-ph\nP. 概率论 math.PR\nQ. 操作系统 cs.OS\nR. 信号处理 eess.SP\nS. 最优化与控制 math.OC\nT. 动力系统 math.DS\nU. 微分几何 math.DG\nV. 数学物理 math.MP\nW. 多媒体 cs.MM\nX. 统计方法 stat.ME\nY. 组合数学 math.CO\nZ. 神经与进化计算 cs.NE&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># author_template = []</span></span><br><span class="line"></span><br><span class="line">instruction_templates = [</span><br><span class="line">    <span class="string">&quot;You are an AI academic librarian trained to classify research papers with 99%\ accuracy.&quot;</span>,</span><br><span class="line">    <span class="string">&quot;[SYSTEM ROLE] Domain Classification Officer\n Mission: Categorize the paper&quot;</span>,</span><br><span class="line">    <span class="string">&quot;你是个优秀的论文分类师&quot;</span>,</span><br><span class="line">    <span class="string">&quot;As a meta-reviewer AI, you must:\n1. Identify 4 key terms from title of the paper\n2. Cross-check with authors publication history\n3. Map abstract to the most ralted subfileds&quot;</span>,</span><br><span class="line">    <span class="string">&quot;By academic protocol GPT-2025, you are required to\n1. Disclose uncertainty if abstract is ambiguous\n2. Prioritize author-specified keywords in title\n3. Identify the most ralted subfileds&quot;</span>,</span><br><span class="line">    <span class="string">&quot;Task: Teach a graduate student how to classify title.\nSteps:\na) Highlight disciplinary cues in abstract\nb) Explain why authors affiliations suggest _____ field\nc) Conclude with the option [A-Z Arxiv field code]&quot;</span>,</span><br><span class="line">    <span class="string">&quot;[AI CLASSIFIER v3.1 INPUT]\nTitle: title\nAuthors: authors\nAbstract: abstract\nPROCESSING...\nOUTPUT: [A-Z Arxiv code]&quot;</span>,</span><br><span class="line">    <span class="string">&quot;As an ethical AI classifier, you MUST:\nAvoid overgeneralization (e.g., &#x27;Engineering&#x27; is too broad)\nCite classification rationale from abstract\nExample output: [Arxiv field code]&quot;</span>,</span><br><span class="line">    <span class="string">&quot;[URGENT PEER REVIEW REQUEST] Deadline: 10s to classify title (authors) for conference track assignment. Abstract snapshot: abstract. Respond ONLY with track option from provided list.&quot;</span>,</span><br><span class="line">    <span class="string">&quot;你是一名学术档案管理员，需根据《Arxiv图书馆分类法》根据题目、作者和摘要内容对论文进行精准分类。并输出Arixv分类代码&quot;</span>,</span><br><span class="line">    <span class="string">&quot;[系统指令] 国家自然科学基金委AI评审员  任务：依据标题、作者及摘要，从申请代码A-Z中选择最匹配的子领域&quot;</span>,</span><br><span class="line">    <span class="string">&quot;作为学术审计AI，你必须：\n① 从摘要提取方法论关键词\n② 核对authors在Scopus的研究主题\n③ 对照 A-Z 的《学科分类与代码》\n最终输出分类代码：&quot;</span>,</span><br><span class="line">    <span class="string">&quot;根据《AI科研分类规范》2024版：\n标题中的&#x27;研究&#x27;/&#x27;分析&#x27;等词不得作为分类依据\n需明确摘要中的3处领域特征\n输出包含 A-Z 的分类代码&quot;</span>,</span><br><span class="line">    <span class="string">&quot;假设你是一个“论文归类机器人”，你的任务是为这篇论文打上一个准确的学科标签。&quot;</span>,</span><br><span class="line">    <span class="string">&quot;[分类助手] 请为该论文分配一个领域标签。&quot;</span>,</span><br><span class="line">    <span class="string">&quot;从研究目标和方法出发，请为该论文做出领域归属判断。&quot;</span>,</span><br><span class="line">    <span class="string">&quot;从摘要和题目中的关键词判断，该论文属于哪一类学科？&quot;</span>,</span><br><span class="line">    <span class="string">&quot;作为图书馆分类员，你需要从摘要和题目中的关键词判断为这篇论文分配一个学科分类。&quot;</span>,</span><br><span class="line">    <span class="string">&quot;请模拟审稿人角色，为该论文选择一个最合适的研究领域，可以从摘要和题目进行判断。&quot;</span>,</span><br><span class="line">    <span class="string">&quot;作为论文资深读者，你可以通过论文元信息判断所属学科门类。&quot;</span>,</span><br><span class="line">    <span class="string">&quot;This paper was submitted for classification. Use the metadata to determine the category.&quot;</span>,</span><br><span class="line">    <span class="string">&quot;Generate a domain label based on the core ideas from the abstract and title provided.&quot;</span>,</span><br><span class="line">]</span><br><span class="line"></span><br><span class="line">option_map = &#123;<span class="string">&quot;A&quot;</span>: <span class="string">&quot;quant-ph&quot;</span>, <span class="string">&quot;B&quot;</span>: <span class="string">&quot;physics.chem-ph&quot;</span>, <span class="string">&quot;C&quot;</span>: <span class="string">&quot;physics.atom-ph&quot;</span>, <span class="string">&quot;D&quot;</span>: <span class="string">&quot;cond-mat.soft&quot;</span>, <span class="string">&quot;E&quot;</span>: <span class="string">&quot;cs.RO&quot;</span>, <span class="string">&quot;F&quot;</span>: <span class="string">&quot;cs.CL&quot;</span>,</span><br><span class="line">            <span class="string">&quot;G&quot;</span>: <span class="string">&quot;cs.SE&quot;</span>, <span class="string">&quot;H&quot;</span>: <span class="string">&quot;cs.IR&quot;</span>, <span class="string">&quot;I&quot;</span>: <span class="string">&quot;hep-th&quot;</span>, <span class="string">&quot;J&quot;</span>: <span class="string">&quot;hep-ph&quot;</span>, <span class="string">&quot;K&quot;</span>: <span class="string">&quot;physics.optics&quot;</span>, <span class="string">&quot;L&quot;</span>: <span class="string">&quot;cs.AI&quot;</span>, <span class="string">&quot;M&quot;</span>: <span class="string">&quot;cs.CV&quot;</span>, <span class="string">&quot;N&quot;</span>: <span class="string">&quot;nucl-th&quot;</span>,</span><br><span class="line">            <span class="string">&quot;O&quot;</span>: <span class="string">&quot;astro-ph&quot;</span>, <span class="string">&quot;P&quot;</span>: <span class="string">&quot;math.PR&quot;</span>, <span class="string">&quot;Q&quot;</span>: <span class="string">&quot;cs.OS&quot;</span>, <span class="string">&quot;R&quot;</span>: <span class="string">&quot;eess.SP&quot;</span>, <span class="string">&quot;S&quot;</span>: <span class="string">&quot;math.OC&quot;</span>, <span class="string">&quot;T&quot;</span>: <span class="string">&quot;math.DS&quot;</span>, <span class="string">&quot;U&quot;</span>: <span class="string">&quot;math.DG&quot;</span>, <span class="string">&quot;V&quot;</span>: <span class="string">&quot;math.MP&quot;</span>,</span><br><span class="line">            <span class="string">&quot;W&quot;</span>: <span class="string">&quot;cs.MM&quot;</span>, <span class="string">&quot;X&quot;</span>: <span class="string">&quot;stat.ME&quot;</span>, <span class="string">&quot;Y&quot;</span>: <span class="string">&quot;math.CO&quot;</span>, <span class="string">&quot;Z&quot;</span>: <span class="string">&quot;cs.NE&quot;</span>&#125;</span><br><span class="line">get_options = <span class="built_in">dict</span>(<span class="built_in">zip</span>(option_map.values(), option_map.keys()))</span><br><span class="line"></span><br><span class="line">other_option_map = &#123;&#125;</span><br><span class="line"><span class="keyword">for</span> category <span class="keyword">in</span> get_options.keys():</span><br><span class="line">    other_categories = <span class="built_in">set</span>(option_map.values())</span><br><span class="line">    other_categories.remove(category)</span><br><span class="line">    other_option_map[category] = other_categories</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">preprocess_arxiv_json</span>(<span class="params">input_jsonl_file, output_jsonl_file</span>):</span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    Preprocess the arXiv JSONL file to extract and save the &#x27;title&#x27;, &#x27;abstract&#x27; </span></span><br><span class="line"><span class="string">    and other fields to build a sft dataset for a category.</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">    Args:</span></span><br><span class="line"><span class="string">        input_jsonl_file (str): Path to the input JSONL file.</span></span><br><span class="line"><span class="string">        output_jsonl_file (str): Path to the output JSONL file.</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line">    papers = <span class="built_in">dict</span>(<span class="built_in">zip</span>(option_map.values(), [<span class="built_in">list</span>() <span class="keyword">for</span> _ <span class="keyword">in</span> option_map.values()]))</span><br><span class="line">    <span class="keyword">with</span> <span class="built_in">open</span>(input_jsonl_file, <span class="string">&#x27;r&#x27;</span>, encoding=<span class="string">&#x27;utf-8&#x27;</span>) <span class="keyword">as</span> f:</span><br><span class="line">        <span class="keyword">for</span> line <span class="keyword">in</span> f:</span><br><span class="line">            item = json.loads(line)</span><br><span class="line">            title = item.get(<span class="string">&#x27;title&#x27;</span>, <span class="string">&#x27;&#x27;</span>)</span><br><span class="line">            authors: <span class="built_in">str</span> = item.get(<span class="string">&#x27;authors&#x27;</span>, <span class="string">&#x27;&#x27;</span>)</span><br><span class="line">            abstract: <span class="built_in">str</span> = item.get(<span class="string">&#x27;abstract&#x27;</span>, <span class="string">&#x27;&#x27;</span>)</span><br><span class="line">            categories: <span class="built_in">str</span> = item.get(<span class="string">&#x27;categories&#x27;</span>, <span class="string">&#x27;&#x27;</span>)</span><br><span class="line">            <span class="comment"># if categories.startswith(category):</span></span><br><span class="line">            <span class="keyword">for</span> category <span class="keyword">in</span> papers.keys():</span><br><span class="line">                <span class="keyword">if</span> category <span class="keyword">in</span> categories <span class="keyword">and</span> <span class="keyword">not</span> <span class="built_in">any</span>(c <span class="keyword">in</span> categories <span class="keyword">for</span> c <span class="keyword">in</span> other_option_map[category]): <span class="comment"># 排除交集样本</span></span><br><span class="line">                    instruction = random.choice(instruction_templates)</span><br><span class="line">                    input_text = random.choice(input_templates).<span class="built_in">format</span>(title=json.dumps(title), authors=json.dumps(authors), abstract=json.dumps(abstract))</span><br><span class="line">                    <span class="comment"># if random.randint(0, 1) == 0:</span></span><br><span class="line">                    <span class="comment">#     input_text = input_text + &#x27;\n\n&#x27; + options</span></span><br><span class="line">                    <span class="comment"># else:</span></span><br><span class="line">                    <span class="comment">#     input_text = options + &#x27;\n\n&#x27; + input_text</span></span><br><span class="line">                    input_text = input_text + <span class="string">&#x27;\n\n&#x27;</span> + options</span><br><span class="line">                    output = get_options[category]</span><br><span class="line">                    papers[category].append(&#123;<span class="string">&quot;instruction&quot;</span>: instruction, <span class="string">&quot;input&quot;</span>: input_text, <span class="string">&quot;output&quot;</span>: output&#125;)</span><br><span class="line">                    <span class="keyword">break</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">for</span> category <span class="keyword">in</span> papers.keys():</span><br><span class="line">        cnt = <span class="number">0</span></span><br><span class="line">        output_file = os.path.join(output_base_dir, <span class="string">f&quot;<span class="subst">&#123;category&#125;</span>.jsonl&quot;</span>)</span><br><span class="line">        <span class="keyword">with</span> <span class="built_in">open</span>(output_file, <span class="string">&#x27;w&#x27;</span>, encoding=<span class="string">&#x27;utf-8&#x27;</span>) <span class="keyword">as</span> out_f:</span><br><span class="line">            <span class="keyword">for</span> item <span class="keyword">in</span> papers[category]:</span><br><span class="line">                out_f.write(json.dumps(item, ensure_ascii=<span class="literal">False</span>))</span><br><span class="line">                out_f.write(<span class="string">&#x27;\n&#x27;</span>)</span><br><span class="line">                out_f.flush()</span><br><span class="line">                cnt += <span class="number">1</span></span><br><span class="line">        <span class="built_in">print</span>(<span class="string">f&#x27;<span class="subst">&#123;category&#125;</span>: <span class="subst">&#123;cnt&#125;</span>&#x27;</span>)</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">fix_category</span>(<span class="params">input_jsonl_file, output_jsonl_file, category, repeat_to=<span class="number">0</span>, judge_rule=<span class="keyword">lambda</span> x, y: x.startswith(<span class="params">y</span>), open_mode=<span class="string">&#x27;w&#x27;</span>, exclude_multi=<span class="literal">True</span></span>):</span><br><span class="line"></span><br><span class="line">    cnt = <span class="number">0</span></span><br><span class="line">    <span class="keyword">if</span> open_mode != <span class="string">&#x27;w&#x27;</span>:</span><br><span class="line">        <span class="keyword">with</span> <span class="built_in">open</span>(output_jsonl_file, <span class="string">&#x27;r&#x27;</span>, encoding=<span class="string">&#x27;utf-8&#x27;</span>) <span class="keyword">as</span> out_f:</span><br><span class="line">            cnt = <span class="built_in">len</span>(out_f.readlines())</span><br><span class="line"></span><br><span class="line">    <span class="keyword">def</span> <span class="title function_">_fix_category</span>(<span class="params">input_jsonl_file, output_jsonl_file, category</span>):</span><br><span class="line">        <span class="keyword">nonlocal</span> cnt</span><br><span class="line"></span><br><span class="line">        <span class="comment"># 补充 Q，V, W 类</span></span><br><span class="line">        <span class="keyword">with</span> <span class="built_in">open</span>(input_jsonl_file, <span class="string">&#x27;r&#x27;</span>, encoding=<span class="string">&#x27;utf-8&#x27;</span>) <span class="keyword">as</span> f, <span class="built_in">open</span>(output_jsonl_file, open_mode, encoding=<span class="string">&#x27;utf-8&#x27;</span>) <span class="keyword">as</span> out_f:</span><br><span class="line">            data = []</span><br><span class="line">            <span class="keyword">for</span> line <span class="keyword">in</span> f:</span><br><span class="line">                item = json.loads(line)</span><br><span class="line">                title = item.get(<span class="string">&#x27;title&#x27;</span>, <span class="string">&#x27;&#x27;</span>)</span><br><span class="line">                authors: <span class="built_in">str</span> = item.get(<span class="string">&#x27;authors&#x27;</span>, <span class="string">&#x27;&#x27;</span>)</span><br><span class="line">                abstract: <span class="built_in">str</span> = item.get(<span class="string">&#x27;abstract&#x27;</span>, <span class="string">&#x27;&#x27;</span>)</span><br><span class="line">                categories: <span class="built_in">str</span> = item.get(<span class="string">&#x27;categories&#x27;</span>, <span class="string">&#x27;&#x27;</span>)</span><br><span class="line">                <span class="keyword">if</span> exclude_multi <span class="keyword">and</span> <span class="built_in">any</span>(c <span class="keyword">in</span> categories <span class="keyword">for</span> c <span class="keyword">in</span> other_option_map[category]): <span class="comment"># 排除交集样本</span></span><br><span class="line">                    <span class="keyword">continue</span></span><br><span class="line">                <span class="comment"># if categories.startswith(category):</span></span><br><span class="line">                <span class="keyword">if</span> judge_rule(categories, category):</span><br><span class="line">                    categories = category</span><br><span class="line">                    <span class="keyword">if</span> categories == <span class="string">&#x27;math-ph&#x27;</span>:</span><br><span class="line">                        categories = <span class="string">&#x27;math.MP&#x27;</span></span><br><span class="line">                    instruction = random.choice(instruction_templates)</span><br><span class="line">                    input_text = random.choice(input_templates).<span class="built_in">format</span>(title=json.dumps(title), authors=json.dumps(authors), abstract=json.dumps(abstract))</span><br><span class="line">                    input_text = input_text + <span class="string">&#x27;\n\n&#x27;</span> + options</span><br><span class="line">                    output = get_options[categories]</span><br><span class="line">                    item = json.dumps(&#123;<span class="string">&quot;instruction&quot;</span>: instruction, <span class="string">&quot;input&quot;</span>: input_text, <span class="string">&quot;output&quot;</span>: output&#125;, ensure_ascii=<span class="literal">False</span>)</span><br><span class="line">                    data.append(item)</span><br><span class="line">            <span class="keyword">for</span> item <span class="keyword">in</span> data:</span><br><span class="line">                out_f.write(item)</span><br><span class="line">                out_f.write(<span class="string">&#x27;\n&#x27;</span>)</span><br><span class="line">                cnt += <span class="number">1</span></span><br><span class="line">    </span><br><span class="line">    _fix_category(input_jsonl_file, output_jsonl_file, category)</span><br><span class="line">    <span class="keyword">while</span> cnt &lt; repeat_to:</span><br><span class="line">        _fix_category(input_jsonl_file, output_jsonl_file, category)</span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&#x27;after fix, <span class="subst">&#123;category&#125;</span>: <span class="subst">&#123;cnt&#125;</span>&#x27;</span>)</span><br><span class="line">    </span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">cnt_in_filename</span>(<span class="params">basedir: <span class="built_in">str</span></span>):</span><br><span class="line">    <span class="keyword">for</span> fname <span class="keyword">in</span> os.listdir(basedir):</span><br><span class="line">        <span class="keyword">if</span> fname.endswith(<span class="string">&#x27;jsonl&#x27;</span>):</span><br><span class="line">            num = <span class="number">0</span></span><br><span class="line">            <span class="keyword">with</span> <span class="built_in">open</span>(os.path.join(output_base_dir, fname), <span class="string">&#x27;r&#x27;</span>) <span class="keyword">as</span> f:</span><br><span class="line">                num = <span class="built_in">len</span>(f.readlines())</span><br><span class="line">            <span class="comment"># os.rename(os.path.join(basedir, fname), os.path.join(basedir, f&quot;&#123;fname.removeprefix((&#x27;.jsonl&#x27;))&#125;_&#123;num&#125;.jsonl&quot;))</span></span><br><span class="line">            os.rename(os.path.join(basedir, fname), os.path.join(basedir, <span class="string">f&quot;<span class="subst">&#123;fname.replace(<span class="string">&#x27;.jsonl.jsonl&#x27;</span>, <span class="string">&#x27;.jsonl&#x27;</span>)&#125;</span>&quot;</span>))</span><br><span class="line"></span><br><span class="line"><span class="keyword">def</span> <span class="title function_">gather</span>(<span class="params">basedir: <span class="built_in">str</span>, sample_num_class</span>):</span><br><span class="line">    cnt = <span class="number">0</span></span><br><span class="line">    <span class="keyword">with</span> <span class="built_in">open</span>(<span class="string">&#x27;arxiv_20k_rich.jsonl&#x27;</span>, <span class="string">&#x27;w&#x27;</span>) <span class="keyword">as</span> out_f:</span><br><span class="line">        <span class="keyword">for</span> fname <span class="keyword">in</span> os.listdir(basedir):</span><br><span class="line">            <span class="keyword">if</span> fname.endswith(<span class="string">&#x27;jsonl&#x27;</span>):</span><br><span class="line">                data = []</span><br><span class="line">                <span class="keyword">with</span> <span class="built_in">open</span>(os.path.join(output_base_dir, fname), <span class="string">&#x27;r&#x27;</span>) <span class="keyword">as</span> f:</span><br><span class="line">                    data = f.readlines()</span><br><span class="line">                data = random.sample(data, <span class="built_in">min</span>(sample_num_class, <span class="built_in">len</span>(data)))</span><br><span class="line">                <span class="keyword">for</span> line <span class="keyword">in</span> data:</span><br><span class="line">                    out_f.write(line)</span><br><span class="line">                out_f.flush()</span><br><span class="line">                <span class="built_in">print</span>(<span class="string">f&#x27;<span class="subst">&#123;fname&#125;</span>: <span class="subst">&#123;<span class="built_in">len</span>(data)&#125;</span>&#x27;</span>)</span><br><span class="line">                cnt += <span class="built_in">len</span>(data)</span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&#x27;total <span class="subst">&#123;cnt&#125;</span>&#x27;</span>)</span><br><span class="line">            </span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> __name__ == <span class="string">&quot;__main__&quot;</span>:</span><br><span class="line"></span><br><span class="line">    <span class="built_in">print</span>(<span class="string">f&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">        系统提示词模板数量：<span class="subst">&#123;<span class="built_in">len</span>(instruction_templates)&#125;</span></span></span><br><span class="line"><span class="string">        用户提示词模板数量：<span class="subst">&#123;<span class="built_in">len</span>(input_templates)&#125;</span></span></span><br><span class="line"><span class="string">          &quot;&quot;&quot;</span>)</span><br><span class="line"></span><br><span class="line">    arxiv_json_file = <span class="string">&#x27;d:/data/arxiv-metadata-oai-snapshot.json&#x27;</span></span><br><span class="line">    <span class="comment"># arxiv_json_file = &#x27;./test.jsonl&#x27;</span></span><br><span class="line">    output_json_file = <span class="string">&#x27;./arxiv_sftdata.jsonl&#x27;</span></span><br><span class="line"></span><br><span class="line">    <span class="comment"># 归类单类别论文</span></span><br><span class="line">    <span class="comment"># preprocess_arxiv_json(arxiv_json_file, output_json_file)</span></span><br><span class="line"></span><br><span class="line">    <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">    quant-ph: 75119</span></span><br><span class="line"><span class="string">    physics.chem-ph: 5999</span></span><br><span class="line"><span class="string">    physics.atom-ph: 6848</span></span><br><span class="line"><span class="string">    cond-mat.soft: 14530</span></span><br><span class="line"><span class="string">    cs.RO: 15943</span></span><br><span class="line"><span class="string">    cs.CL: 32125</span></span><br><span class="line"><span class="string">    cs.SE: 10743</span></span><br><span class="line"><span class="string">    cs.IR: 5137</span></span><br><span class="line"><span class="string">    hep-th: 60558</span></span><br><span class="line"><span class="string">    hep-ph: 83572</span></span><br><span class="line"><span class="string">    physics.optics: 17736</span></span><br><span class="line"><span class="string">    cs.AI: 12987</span></span><br><span class="line"><span class="string">    cs.CV: 74045</span></span><br><span class="line"><span class="string">    nucl-th: 19846</span></span><br><span class="line"><span class="string">    astro-ph: 86911</span></span><br><span class="line"><span class="string">    math.PR: 25289</span></span><br><span class="line"><span class="string">    cs.OS: 347          问题：稀少 cs.OS only (347); cs.OS contains (565); new primary (1060); total (2174)</span></span><br><span class="line"><span class="string">    eess.SP: 11193</span></span><br><span class="line"><span class="string">    math.OC: 19764</span></span><br><span class="line"><span class="string">    math.DS: 14277</span></span><br><span class="line"><span class="string">    math.DG: 17389</span></span><br><span class="line"><span class="string">    math.MP: 0          问题：别名 math-ph；交叉主题多（）            </span></span><br><span class="line"><span class="string">    cs.MM: 1087         问题：交叉主题多（cs.CV、 cs.AI）               # 考虑尽量少选 2018 年以前的，避免交叉  primary (32848)</span></span><br><span class="line"><span class="string">    stat.ME: 12315</span></span><br><span class="line"><span class="string">    math.CO: 32513</span></span><br><span class="line"><span class="string">    cs.NE: 2904         问题：较少；交叉主题多（cs.CV、 cs.AI）</span></span><br><span class="line"><span class="string">    &quot;&quot;&quot;</span></span><br><span class="line"></span><br><span class="line">    <span class="comment"># 处理特别类</span></span><br><span class="line">    output_Q_json_file = os.path.join(output_base_dir, <span class="string">&quot;cs.OS.jsonl&quot;</span>)</span><br><span class="line">    output_V_json_file = os.path.join(output_base_dir, <span class="string">&quot;math.MP.jsonl&quot;</span>)</span><br><span class="line">    output_W_json_file = os.path.join(output_base_dir, <span class="string">&quot;cs.MM.jsonl&quot;</span>)</span><br><span class="line">    <span class="comment"># fix_category(arxiv_json_file, output_Q_json_file, &#x27;cs.OS&#x27;, judge_rule=lambda categories, category: category in categories, open_mode=&#x27;w&#x27;)                     # Q 1114</span></span><br><span class="line">    <span class="comment"># fix_category(&#x27;arxiv_cs.OS_1120.jsonl&#x27;, output_Q_json_file, &#x27;cs.OS&#x27;, judge_rule=lambda categories, category: True, open_mode=&#x27;a&#x27;, exclude_multi=True)            # Q 2174    有重复</span></span><br><span class="line">    <span class="comment"># fix_category(arxiv_json_file, output_V_json_file, &#x27;math-ph&#x27;, judge_rule=lambda categories, category: categories.startswith(category), open_mode=&#x27;w&#x27;)          # V 32848</span></span><br><span class="line">    <span class="comment"># fix_category(arxiv_json_file, output_W_json_file, &#x27;cs.MM&#x27;, judge_rule=lambda categories, category: categories.startswith(category), open_mode=&#x27;w&#x27;)            # W 2469 不去除交叉</span></span><br><span class="line">    <span class="comment"># fix_category(&#x27;arxiv_cs.MM_5012.jsonl&#x27;, output_W_json_file, &#x27;cs.MM&#x27;, judge_rule=lambda categories, category: True, open_mode=&#x27;a&#x27;, exclude_multi=True)            # W 1726 去除交叉 未使用</span></span><br><span class="line"></span><br><span class="line">    <span class="comment"># 归集</span></span><br><span class="line">    gather(basedir=output_base_dir, sample_num_class=<span class="number">20000</span>)</span><br><span class="line"></span><br><span class="line">    <span class="comment"># 计数</span></span><br><span class="line">    <span class="comment"># cnt_in_filename(basedir=output_base_dir)</span></span><br></pre></td></tr></table></figure>]]></content>
    
    
      
      
    <summary type="html">&lt;h2 id=&quot;写在前面&quot;&gt;&lt;a class=&quot;markdownIt-Anchor&quot; href=&quot;#写在前面&quot;&gt;&lt;/a&gt; 写在前面&lt;/h2&gt;
&lt;p&gt;&lt;a class=&quot;link&quot;   href=&quot;https://aicarrier.feishu.cn/wiki/Gr7Iw6vhT</summary>
      
    
    
    
    <category term="record" scheme="https://www.kafm.eu.org/categories/record/"/>
    
    
    <category term="LLM杂七杂八" scheme="https://www.kafm.eu.org/tags/LLM%E6%9D%82%E4%B8%83%E6%9D%82%E5%85%AB/"/>
    
    <category term="实践" scheme="https://www.kafm.eu.org/tags/%E5%AE%9E%E8%B7%B5/"/>
    
  </entry>
  
</feed>
