# 基于 Docling 的多格式文档支持方案

## Context

当前系统仅支持 PDF 上传（PyMuPDF 提取文本）。需扩展为支持 **pdf, doc, docx, xls, xlsx, ppt, pptx, wps, et, ofd, 图片, txt, markdown** 等格式。要求：
- 按语义和文档自然结构分块，每个 chunk 保留页码、标题层级等结构信息
- 不需要向后兼容，清空旧文档数据
- PDF 可能是扫描件，需启用 RapidOCR
- 新增 Java 转换服务（Spring Boot 2.7 + Java 8 + Spire.Office 10.6）
- 上传入库时先判断 MD5，已存在则跳过内容处理管道

## 架构总览

```
上传文件 → 检测格式 → [需转换?] ─YES→ 调 Java 转换服务 → 得到 docx/xlsx/pptx/pdf
                      │                                        ↓
                      └─NO──────────────────────────→ Docling DocumentConverter
                                                              ↓
                                                     HybridChunker → DocChunk[]
                                                              ↓
                                                     映射为 Chunk → 嵌入 → OpenSearch
```

### 格式处理矩阵

| 格式 | 处理方式 |
|------|---------|
| PDF | Docling 直接解析（启用 RapidOCR 处理扫描件） |
| DOCX | Docling 直接解析 |
| PPTX | Docling 直接解析 |
| XLSX | Docling 直接解析 |
| 图片 (PNG/JPG/TIFF) | Docling 内置 OCR |
| Markdown | Docling 直接解析 |
| TXT | 直接读取 + 现有 DocumentChunker（不需 Docling） |
| DOC | Java 转换服务 → DOCX → Docling |
| PPT | Java 转换服务 → PPTX → Docling |
| XLS | Java 转换服务 → XLSX → Docling |
| WPS | Java 转换服务 → DOCX → Docling |
| ET | Java 转换服务 → XLSX → Docling |
| OFD | Java 转换服务 → PDF → Docling |

---

## Part A：Java 文档转换服务

### 技术栈
- Spring Boot 2.7.x / Java 8
- Spire.Office for Java 10.6（e-iceblue Maven 仓库）
- 独立服务，通过 HTTP API 供 Python backend 调用

### 项目结构
```
converter/
├── pom.xml
├── src/main/java/com/zmrag/converter/
│   ├── ConverterApplication.java          # Spring Boot 入口
│   ├── controller/
│   │   └── ConvertController.java         # REST API: POST /api/convert, GET /api/convert/logs
│   ├── service/
│   │   └── DocumentConvertService.java    # 转换逻辑 (Spire.Office) + 日志写 DB
│   ├── entity/
│   │   └── ConvertLog.java               # JPA Entity — convert_log 表
│   ├── repository/
│   │   └── ConvertLogRepository.java     # Spring Data JPA Repository
│   ├── model/
│   │   ├── ConvertResult.java            # 转换响应模型
│   │   └── PageResult.java              # 分页查询响应
│   └── util/
│       └── FileFormatDetector.java        # 真实格式校验（magic bytes, 不单看后缀）
├── src/main/resources/
│   ├── application.yml                    # 端口、MySQL、文件路径等配置
│   └── schema.sql                         # convert_log 建表语句（spring.jpa.hibernate.ddl-auto=update 也会自动建表）
└── Dockerfile
```

### API 设计
```
POST /api/convert
Content-Type: multipart/form-data
- file: 原始文件
- targetFormat: 目标格式 (docx/xlsx/pptx/pdf)

Response 200:
{
  "success": true,
  "originalFormat": "doc",
  "targetFormat": "docx",
  "originalSize": 102400,
  "outputSize": 98304,
  "durationMs": 1523,
  "outputPath": "/data/files/converted/abc123.docx"
}
```

### 日志记录（Log + MySQL 持久化）

每次转换记录写入 log 并**持久化到 MySQL**：
- 原始文档路径/名称、真实格式（magic bytes 校验）
- 输出文档路径
- 转换耗时 (ms)
- 输入文件大小、输出文件大小
- 转换状态（success / failed）、失败原因
- 创建时间

#### MySQL 表结构
```sql
CREATE TABLE convert_log (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    original_filename VARCHAR(500) NOT NULL COMMENT '原始文件名',
    original_format VARCHAR(20) NOT NULL COMMENT '真实格式(magic bytes检测)',
    target_format VARCHAR(20) NOT NULL COMMENT '目标格式',
    original_size BIGINT NOT NULL COMMENT '原始文件大小(bytes)',
    output_size BIGINT DEFAULT NULL COMMENT '输出文件大小(bytes)',
    output_path VARCHAR(1000) DEFAULT NULL COMMENT '输出文件路径',
    duration_ms BIGINT NOT NULL COMMENT '转换耗时(ms)',
    status VARCHAR(20) NOT NULL DEFAULT 'success' COMMENT 'success/failed',
    error_message TEXT DEFAULT NULL COMMENT '失败原因',
    created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_created_at (created_at),
    INDEX idx_status (status)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='文档转换记录';
```

#### 查询接口
```
GET /api/convert/logs?page=1&size=20&status=&startDate=&endDate=
Response 200:
{
  "total": 156,
  "page": 1,
  "size": 20,
  "records": [
    {
      "id": 1,
      "originalFilename": "通知.doc",
      "originalFormat": "doc",
      "targetFormat": "docx",
      "originalSize": 102400,
      "outputSize": 98304,
      "durationMs": 1523,
      "status": "success",
      "createdAt": "2026-03-11T10:30:00"
    }
  ]
}
```

#### MySQL 配置
```
jdbc:mysql://localhost:3306/zm_tag?zeroDateTimeBehavior=convertToNull
用户: root / root
```

### Docker 配置
```yaml
# docker-compose.yml 新增
doc-converter:
  build:
    context: ./converter
    dockerfile: Dockerfile
  container_name: zm-rag-doc-converter
  ports:
    - "8901:8080"
  volumes:
    - D:/data/files:/data/files
  networks:
    - zm-rag-net
  restart: unless-stopped
  healthcheck:
    test: ["CMD-SHELL", "curl -sf http://localhost:8080/actuator/health || exit 1"]
    interval: 30s
    timeout: 10s
    retries: 5
```

### pom.xml 核心依赖
```xml
<repositories>
  <repository>
    <id>com.e-iceblue</id>
    <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
  </repository>
</repositories>

<dependencies>
  <!-- Web -->
  <dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
  </dependency>
  <!-- JPA + MySQL -->
  <dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-data-jpa</artifactId>
  </dependency>
  <dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <scope>runtime</scope>
  </dependency>
  <!-- Spire.Office 文档转换 -->
  <dependency>
    <groupId>e-iceblue</groupId>
    <artifactId>spire.office</artifactId>
    <version>10.6.0</version>
  </dependency>
  <!-- Apache Tika 用于 magic bytes 格式检测 -->
  <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.28.5</version>
  </dependency>
  <!-- Actuator 健康检查 -->
  <dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
  </dependency>
</dependencies>
```

---

## Part B：Python Backend 改造

### 新增文件

```
app/core/docling_processor.py       # Docling 解析 + HybridChunker
app/utils/file_type.py              # 文件类型检测 + 转换需求判断
scripts/reset_and_migrate.py        # 清空旧数据 + 创建新 mapping
```

### 需修改的现有文件

#### 1. `app/config.py`
- `pdf_storage_path` → `file_storage_path`（保留旧名 alias）
- 新增：
  ```python
  # 文件格式支持
  supported_file_types: list[str] = [
      "pdf", "doc", "docx", "xls", "xlsx", "ppt", "pptx",
      "wps", "et", "ofd", "png", "jpg", "jpeg", "tiff", "bmp",
      "txt", "md", "markdown",
  ]
  max_file_size_mb: int = 100

  # Docling
  docling_ocr_enabled: bool = True        # 启用 RapidOCR 处理扫描件
  docling_max_tokens: int = 512           # HybridChunker token 上限

  # Java 转换服务
  converter_base_url: str = "http://localhost:8901"
  converter_timeout: int = 120            # 秒

  # 需要转换的格式（发送到 Java 服务）
  formats_need_conversion: list[str] = ["doc", "xls", "ppt", "wps", "et", "ofd"]
  ```

#### 2. `app/core/docling_processor.py`（新文件）

```python
class DoclingProcessor:
    """使用 Docling 解析多格式文档并结构化分块。"""

    def __init__(self, *, max_tokens=512, ocr_enabled=True):
        # 初始化 DocumentConverter，配置 RapidOCR pipeline
        ...

    def process(self, file_path: Path) -> ProcessedDocument:
        """
        1. Docling DocumentConverter.convert(file_path)
        2. HybridChunker.chunk(doc) → DocChunk 列表
        3. 从 DocChunk.meta 提取:
           - page_number: doc_items[].prov[].page_no
           - heading_hierarchy: meta.headings
           - element_type: doc_items[].label
        4. full_text = doc.export_to_markdown()
        返回 ProcessedDocument(full_text, chunks_with_meta, page_count, file_type)
        """
```

TXT 文件不走 Docling，直接读取文本后用现有 `DocumentChunker.chunk_document()`。

#### 3. `app/core/chunker.py`
- `Chunk` dataclass 新增字段：
    - `page_number: int | None = None`
    - `page_numbers: list[int] = field(default_factory=list)`
    - `heading_hierarchy: list[str] = field(default_factory=list)`
    - `element_type: str = ""`
- `to_dict()` 输出新字段
- 新增 `chunks_from_docling()` 工厂方法：DoclingProcessor 输出 → Chunk 列表

#### 4. `app/core/ingest_pipeline.py`
- 参数 `pdf_path` → `file_path`（不需兼容旧调用）
- `_handle_new_content()`:
    1. 检测文件类型
    2. 如需转换 → 调 Java 转换服务 HTTP API → 得到转换后文件路径
    3. 非 TXT → `DoclingProcessor.process()` → `chunks_from_docling()`
    4. TXT → 读取文本 → `DocumentChunker.chunk_document()`
- 文件存储：`{content_hash}.{ext}`
- `_write_doc_meta()` 新增 `file_type` 字段
- **MD5 去重**：`ingest_document()` 计算 MD5 后检查 `find_by_content_hash()`，已存在则直接走 `_handle_duplicate()` 快速路径（现有逻辑已有此功能，保持不变）

#### 5. `app/api/v1/ingest.py`
- webhook 上传：
    - 检测文件类型，校验白名单、文件大小
    - 临时文件名：`_tmp_{doc_id}.{ext}`（根据真实格式）
- `trigger_ingest` 参数改为 `file_path`

#### 6. `app/tasks/ingest_task.py`
- 参数 `pdf_path` → `file_path`

#### 7. `app/infrastructure/es_client.py`
- `GOV_DOC_CHUNKS_MAPPING` 新增字段：
    - `page_number` (integer)
    - `page_numbers` (integer[])
    - `heading_hierarchy` (keyword[])
    - `element_type` (keyword)
    - `file_type` (keyword)
- `GOV_DOC_META_MAPPING` 新增：
    - `file_type` (keyword)
    - `file_path` (keyword, index: false)

#### 8. `app/api/v1/document.py`
- 下载端点按 `file_type` 确定 MIME 和扩展名

#### 9. `requirements.txt`
```
# 新增
docling>=2.78.0
filetype>=1.2.0
charset-normalizer>=3.3.0
# 移除
# PyMuPDF==1.26.5  (Docling 自带 PDF 解析)
```

### 清空旧数据脚本 `scripts/reset_and_migrate.py`

```python
"""清空旧索引数据，重建 mapping。"""
# 1. 删除旧索引 gov_doc_chunks, gov_doc_meta
# 2. 用新的 mapping 重建索引
# 3. 清空 Neo4j 图数据 (MATCH (n) DETACH DELETE n)
```

---

## 实施顺序

### 阶段 1 — Java 转换服务
1. 创建 `converter/` 项目骨架（pom.xml, Application, application.yml）
2. `FileFormatDetector.java` — Apache Tika magic bytes 格式检测
3. `DocumentConvertService.java` — Spire.Office 转换逻辑 + 日志记录
4. `ConvertController.java` — REST API
5. `Dockerfile` + docker-compose.yml 添加 doc-converter 服务

### 阶段 2 — Python 基础设施
6. `requirements.txt` 更新
7. `app/config.py` 新增配置
8. `app/infrastructure/es_client.py` 更新 mappings
9. `scripts/reset_and_migrate.py` 清空旧数据 + 重建
10. `app/utils/file_type.py` 文件类型检测

### 阶段 3 — Docling 处理器
11. `app/core/docling_processor.py` — DoclingProcessor（启用 RapidOCR）
12. `app/core/chunker.py` — Chunk 新字段 + `chunks_from_docling()`

### 阶段 4 — 管道集成
13. `app/core/ingest_pipeline.py` — 集成 DoclingProcessor + 转换服务调用
14. `app/tasks/ingest_task.py` — 参数更新
15. `app/api/v1/ingest.py` — 多格式上传支持
16. `app/api/v1/document.py` — 多格式下载
17. `docker-compose.yml` — celery-worker 依赖 doc-converter

### 阶段 5 — 清理 & 测试
18. 运行 `reset_and_migrate.py` 清空旧数据
19. 测试：上传各格式文档 → 验证 chunk 含 page_number、heading_hierarchy
20. 测试：上传需转换格式(doc/wps/et/ofd) → 验证转换服务日志 + 入库成功
21. 测试：重复上传同一文件 → 验证 MD5 去重跳过处理

## 验证方法

1. 启动 Java 转换服务，验证 `/api/convert` 接口
2. `pip install -r requirements.txt`
3. `python scripts/reset_and_migrate.py` 清空旧数据并重建索引
4. 重启 backend + celery worker
5. 上传各格式文档(pdf/docx/xlsx/pptx/图片/txt/md)，验证入库成功
6. 上传需转换格式(doc/xls/ppt/wps/et/ofd)，检查：
    - Java 服务日志包含：原始格式、输出格式、耗时、大小
    - OpenSearch chunk 包含 page_number、heading_hierarchy
7. 重复上传相同文件，确认 MD5 去重生效
8. `/v1/search` 搜索新文档正常返回