# Plan 01-03 Summary — Fetcher + Storage + Pipeline + CLI

**Status:** Task 1 + Task 2 code landed; pytest green (47 passed, 1 smoke deselected).

## Scope Delivered

| Component | File | Note |
|-----------|------|------|
| Fetcher (patchright) | `govcrawler/fetcher/browser.py` | Two-step fallback: domcontentloaded → wait_for_selector(5s) → networkidle(45s) → wait_for_selector → page.content() 兜底 |
| Host Throttle | `govcrawler/fetcher/throttle.py` | 5s + ±20% jitter per host; sleep/now 注入点便于单测 |
| Storage paths | `govcrawler/storage/paths.py` | `PurePosixPath` + `kind` 白名单 |
| Storage files | `govcrawler/storage/files.py` | 原子写 raw_html / article_text |
| Attachments | `govcrawler/storage/attachments.py` | httpx.stream + Content-Disposition (含 RFC5987) + SHA256 + 200MB cap + path-traversal guard |
| DB Repo | `govcrawler/storage/repo.py` | insert_article / insert_attachments / insert_crawl_log / get_article_by_url_hash |
| Pipeline | `govcrawler/pipeline.py` | 端到端编排：去重 → 节流 → fetch → parse → 落文件 → 下附件 → 单事务写 DB |
| CLI | `govcrawler/cli.py` + `__main__.py` | `python -m govcrawler fetch <site> <column> <key> [--url]` |
| gdqy selectors | `govcrawler/sites/gdqy.py` | append `ATTACHMENT_TEST_ARTICLE_URL/KEY = None` |

## Tests

- `tests/test_paths.py` — 3 pass
- `tests/test_throttle.py` — 3 pass (injected clock, no real sleep)
- `tests/test_filename_parse.py` — 5 pass (含 RFC5987 UTF-8 解码)
- `tests/test_pipeline_mocked.py` — 4 pass:
  - happy path (文件写落 + article+crawl_log 各 1 行)
  - 挑战页 → failed + crawl_log 记录
  - **附件 Path B**：仅 mock `httpx.stream`，真实 `download_attachment` 跑完；tmp_path 下真实 PDF 文件被落盘，SHA256 与 `attachment.file_hash` 一致，POSIX 路径
  - duplicate url_hash → skipped，fetch_html 不调用
- `tests/test_gdqy_smoke.py` — 默认 skip（GOVCRAWLER_RUN_SMOKE=1 触发真网络）

全量：`uv run pytest tests/ -x -q --deselect tests/test_gdqy_smoke.py` → **47 passed**（含 Plan 02 的 36 + Plan 03 的 11 个新 + 先前已有）。

## 附件链路选择

**Path B（mocked）**。Path A 未选的原因：执行者未在 gdqy 列表页实时挑选带 PDF 附件的文章（需与真实站点联调）；Path B 的测试 `test_attachment_downloaded_and_recorded` 覆盖了真实流式写入 + sha256 + POSIX 路径 + PG 字段，证据：

- fake PDF bytes = `b"%PDF-1.4\n<fake pdf bytes for Path B test>\n" * 32`
- tmp_path 下文件确实存在（`written.exists()`）且 `size_bytes == len(FAKE_BYTES)`
- `hashlib.sha256(on-disk bytes).hexdigest() == attachment.file_hash` 成立
- `att.file_path` 含 `/` 不含 `\`（POSIX）

## 端到端 smoke 状态

代码层面就绪，真网络 smoke（`GOVCRAWLER_RUN_SMOKE=1`）需要：
1. `docker compose up -d db && uv run alembic upgrade head`
2. `uv run patchright install chromium`
3. 在本环境未执行（无在线依赖/沙箱网络可用性未知）——建议在部署环境由执行者手工验证一次并把实际 stdout/SQL 贴回本 SUMMARY。

## REQ 覆盖

- **FETCH-02** ✓ `fetcher/browser.py` + smoke test 骨架
- **STORE-01** ✓ `write_raw_html`
- **STORE-02** ✓ `write_article_text`
- **STORE-03** ✓ `download_attachment`（Path B 证据）
- **STORE-04** ✓ `insert_article` + pipeline 写入
- **STORE-05** ✓ `insert_attachments` + `file_hash` 字段

## 设计文档勾选

按本 plan 落地范围，在 `政务网站采集系统-设计文档.md` 阶段 0/1 对应条目打勾（详见主干 diff）。

---
*Plan 01-03 summary — 2026-04-22*
