请稍候` → 挑战 - 主 XPath 的 title 选择器抽空且 GNE 抽出正文 < 100 字 → 疑似挑战 **Warning signs:** `articles.status='failed'` 且 `content_text` 全是"请稍候正在验证您的请求"字样。 ### Pitfall 2: `wait_until="networkidle"` 永远不 idle **What goes wrong:** 政务站常集成 CNZZ、51.la、百度统计、腾讯埋点 → 长轮询或定时心跳 → networkidle 永远不到 → 30s 超时。 **Why:** `networkidle` 要求 500ms 内没有网络请求，埋点不满足。 **How to avoid:** 1. 主尝试用 `wait_until="domcontentloaded"` + 显式 `wait_for_selector("div.article-content, div.content", timeout=10000)` 等正文。 2. 回退：若 `domcontentloaded` 后正文 selector 没出现，再 `wait_until="networkidle"` 并把 timeout 放宽到 45s。 **Warning signs:** 每次抓取都 30s 超时但 page.content() 能拿到正文。 ### Pitfall 3: URL 归一化不一致导致 url_hash 漂移 **What goes wrong:** 同一篇文章从列表页拿到的 URL 是 `https://www.gdqy.gov.cn/gdqy/newxxgk/fgwj/szfwj/content/post_2136593.html`，从另一个入口可能拿到 `http://www.gdqy.gov.cn/.../post_2136593.html`（scheme 或 trailing slash 不同），sha256 不同，被当作两篇。 **How to avoid:** 归一化规则固定： ```python def normalize_url(u: str) -> str: p = urlparse(u) scheme = "https" # 强制 https host = p.netloc.lower() path = p.path.rstrip("/") or "/" # 去掉尾斜杠但根路径保留 # 丢弃 fragment；保留 query 但按 key 排序 q = urlencode(sorted(parse_qsl(p.query))) return urlunparse((scheme, host, path, "", q, "")) ``` **Warning signs:** 同一 key 多行 → 查 `SELECT url, url_hash FROM articles WHERE url LIKE '%post_2136593%'`。 ### Pitfall 4: Windows 反斜杠路径 **What goes wrong:** `raw_html_path` 在 Windows 写入 DB 成 `data\govcrawler\raw_html\gdqy\...`，Linux 部署读取失败。 **How to avoid:** **始终用 `pathlib.PurePosixPath` 拼接存入 DB**；本地落盘用 `pathlib.Path` 转成 OS-native。DB 里只存 POSIX 形式。 **Warning signs:** DB 里 `raw_html_path` 含反斜杠。 ### Pitfall 5: `Content-Disposition` 中文附件名 **What goes wrong:** 政务站附件响应头 `Content-Disposition: attachment; filename="\xe5\x85\xac\xe5\x91\x8a.pdf"` 或 RFC 5987 `filename*=UTF-8''%E5%85%AC%E5%91%8A.pdf`，直接用 `os.path.basename(url)` 拿不到真名。 **How to avoid:** ```python from email.message import Message def parse_disposition(header: str) -> str | None: m = Message(); m["content-disposition"] = header return m.get_param("filename*") or m.get_param("filename") ``` 拿不到再回退到 URL 最后一段。用 `python-slugify` 清理为 `safe_filename`（保留中文，替换 `\/?:*"<>|`）。 **Warning signs:** 附件文件名是 `xxxxx.pdf` 或全是百分号编码。 ### Pitfall 6: GNE 对 lxml 版本的隐式依赖 **What goes wrong:** GNE 0.1.3 pinned lxml 非常宽，但新 lxml（5.x）API 微调可能导致某些 xpath 调用异常。 **How to avoid:** `lxml<6` + 冒烟测试跑一遍 gdqy 真实文章验证。若出错可改用 `trafilatura` 作替代。 **Warning signs:** `AttributeError: 'HtmlElement' object has no attribute 'xxx'`。 ### Pitfall 7: Playwright 首次运行下载 Chromium 失败 **What goes wrong:** CI 或同事机器没跑 `patchright install chromium`，抓取时报 `Executable doesn't exist at /.../chromium`. **How to avoid:** `README.md` 明确写出 setup 三步：`uv pip install -r requirements.txt` → `patchright install chromium` → `docker compose up -d db`。CI 第一步加 `patchright install --with-deps chromium`。 ## Code Examples ### patchright Fetcher 启动（含 stealth 默认） ```python # govcrawler/fetcher/playwright_fetcher.py # Source: patchright README https://github.com/Kaliiiiiiiiii-Vinyzu/patchright-python from dataclasses import dataclass from patchright.async_api import async_playwright @dataclass class FetchResult: url: str final_url: str status: int html: str fetched_at: float USER_AGENT = "GovCrawlerBot/1.0 (contact: xxx@example.com)" async def fetch(url: str, *, timeout_ms: int = 30000) -> FetchResult: async with async_playwright() as p: # patchright 的 chromium 已经 patched，不用加 --disable-blink-features browser = await p.chromium.launch( headless=True, # 这两个对 ctct 盾有轻微帮助且不影响 patchright 的补丁： args=["--disable-blink-features=AutomationControlled"], ) context = await browser.new_context( user_agent=USER_AGENT, locale="zh-CN", timezone_id="Asia/Shanghai", viewport={"width": 1440, "height": 900}, ) page = await context.new_page() try: response = await page.goto(url, wait_until="domcontentloaded", timeout=timeout_ms) # 等正文出现，最多 5s；若超时不抛错 try: await page.wait_for_selector("div.article-content, div.content, body", timeout=5000) except Exception: pass html = await page.content() final = page.url status = response.status if response else 0 return FetchResult(url=url, final_url=final, status=status, html=html, fetched_at=__import__("time").time()) finally: await context.close() await browser.close() ``` ### 挑战页检测 ```python # govcrawler/fetcher/playwright_fetcher.py CHALLENGE_MARKERS = ("ctct-slider-canvas", "请稍候", "请稍候…") def is_challenge_page(fr: FetchResult) -> bool: if fr.status == 412: return True return any(m in fr.html for m in CHALLENGE_MARKERS) ``` ### parsel 列表抽取 ```python # govcrawler/parser/list_parser.py # Source: parsel docs https://parsel.readthedocs.io/ from parsel import Selector from dataclasses import dataclass from urllib.parse import urljoin @dataclass class ListItem: url: str title: str publish_time: str # keep raw string; parse later def parse_list(html: str, base_url: str) -> list[ListItem]: sel = Selector(text=html) items = [] # 广清府网站列表 selector（本阶段硬编码；Phase 2 → YAML） for row in sel.css("ul.list_news li"): href = row.css("a::attr(href)").get() title = row.css("a::text").get(default="").strip() date = row.css("span.date::text").get(default="").strip() if not href: continue items.append(ListItem( url=urljoin(base_url, href), title=title, publish_time=date, )) return items ``` ### parsel 详情 + GNE 兜底 ```python # govcrawler/parser/detail_parser.py from parsel import Selector from gne import GeneralNewsExtractor from dataclasses import dataclass GNE = GeneralNewsExtractor() @dataclass class DetailFields: title: str publish_time: str source: str content_html: str content_text: str attachment_urls: list[str] used_fallback: bool ATTACH_CSS = ", ".join(f"a[href$='.{ext}'], a[href*='.{ext}?']" for ext in ("pdf", "doc", "docx", "xls", "xlsx", "zip")) def parse_detail(html: str, base_url: str) -> DetailFields: sel = Selector(text=html) title = sel.css("h1.article-title::text").get(default="").strip() pub = sel.css("span.time::text").get(default="").strip() source = sel.css("span.source::text").get(default="").strip() content_el = sel.css("div.article-content") content_html = content_el.get() or "" attachments = [a.get() for a in sel.css(f"div.article-content {ATTACH_CSS}::attr(href)")] used_fallback = False if len(content_html) < 200 or not title: # GNE 兜底：它接收完整 HTML + 页面 URL result = GNE.extract(html, host=base_url) content_html = result.get("content", content_html) if not title: title = result.get("title", "") if not pub: pub = result.get("publish_time", "") used_fallback = True content_text = html_to_text(content_html) # 绝对化附件 URL from urllib.parse import urljoin attachments = [urljoin(base_url, a) for a in attachments if a] return DetailFields(title, pub, source, content_html, content_text, attachments, used_fallback) ``` ### HTML → 纯文本（保留段落） ```python # govcrawler/parser/html_to_text.py from lxml import html as lxml_html BLOCK_TAGS = {"p", "div", "br", "li", "tr", "h1", "h2", "h3", "h4", "h5", "h6"} def html_to_text(html_str: str) -> str: if not html_str: return "" doc = lxml_html.fragment_fromstring(html_str, create_parent="div") # 在块级标签后插入换行 for el in doc.iter(): if el.tag in BLOCK_TAGS: if el.tail: el.tail = "\n" + el.tail else: el.tail = "\n" text = doc.text_content() # 规整换行 lines = [ln.strip() for ln in text.splitlines()] return "\n".join(ln for ln in lines if ln) ``` ### SimHash 计算（中文 2-gram） ```python # govcrawler/dedup.py # Source: simhash package https://github.com/1e0ng/simhash from simhash import Simhash import hashlib def content_simhash(text: str) -> str: """返回 16 个 hex 字符（64-bit）。""" if not text: return "0" * 16 # 中文 2-gram：比分词更稳、不引入 jieba 依赖 features = [text[i:i+2] for i in range(len(text) - 1)] sh = Simhash(features, f=64) return f"{sh.value:016x}" def url_hash(url: str) -> str: return hashlib.sha256(normalize_url(url).encode()).hexdigest() def normalize_url(u: str) -> str: from urllib.parse import urlparse, urlunparse, urlencode, parse_qsl p = urlparse(u.strip()) scheme = "https" host = p.netloc.lower() path = p.path.rstrip("/") or "/" q = urlencode(sorted(parse_qsl(p.query, keep_blank_values=True))) return urlunparse((scheme, host, path, "", q, "")) ``` ### 附件流式下载 + 文件名解析 ```python # govcrawler/storage/files.py import hashlib import httpx from email.message import Message from pathlib import Path, PurePosixPath from slugify import slugify # python-slugify, allows Chinese via `allow_unicode=True` UA = "GovCrawlerBot/1.0 (contact: xxx@example.com)" def filename_from_disposition(header: str | None, fallback: str) -> str: if header: m = Message(); m["content-disposition"] = header name = m.get_param("filename*") or m.get_param("filename") if name: # RFC 5987: "UTF-8''%E5%85%AC%E5%91%8A.pdf" if isinstance(name, tuple): _, _, name = name name = httpx.URL._unquote(name) if hasattr(httpx.URL, "_unquote") else __import__("urllib.parse").parse.unquote(name) return name return fallback def safe_filename(name: str) -> str: # 保留中文，替换非法字符 return slugify(name, allow_unicode=True, separator="_", regex_pattern=r'[\\/:*?"<>|\x00-\x1f]') def download_attachment(url: str, dest_dir: Path, article_key: str) -> tuple[Path, int, str, str]: """返回 (local_path, size_bytes, file_hash, file_name).""" dest_dir.mkdir(parents=True, exist_ok=True) hasher = hashlib.sha256() size = 0 tmp = dest_dir / f".{article_key}.part" with httpx.stream("GET", url, headers={"User-Agent": UA}, follow_redirects=True, timeout=60) as r: r.raise_for_status() disp = r.headers.get("content-disposition") url_base = url.rsplit("/", 1)[-1].split("?", 1)[0] or "attachment.bin" raw_name = filename_from_disposition(disp, url_base) final_name = f"{article_key}_{safe_filename(raw_name)}" final_path = dest_dir / final_name with open(tmp, "wb") as f: for chunk in r.iter_bytes(chunk_size=64 * 1024): f.write(chunk) hasher.update(chunk) size += len(chunk) tmp.rename(final_path) return final_path, size, hasher.hexdigest(), raw_name ``` ### 存储路径构建（POSIX-stable） ```python # govcrawler/storage/paths.py from pathlib import Path, PurePosixPath from datetime import datetime def build_relpath(site: str, column: str, when: datetime, article_key: str, kind: str) -> PurePosixPath: """kind in {'raw_html', 'articles_text', 'attachments'}; article suffix added by caller.""" return PurePosixPath(kind, site, column, f"{when:%Y}", f"{when:%m}", article_key) def to_os_path(root: Path, rel: PurePosixPath) -> Path: return root.joinpath(*rel.parts) ``` ### Alembic 初始迁移（字段节选） ```python # alembic/versions/0001_initial_schema.py # Source: alembic docs https://alembic.sqlalchemy.org/ from alembic import op import sqlalchemy as sa revision = "0001_initial" down_revision = None def upgrade(): op.create_table( "sites", sa.Column("id", sa.BigInteger, primary_key=True), sa.Column("site_id", sa.String(64), unique=True, nullable=False), sa.Column("name", sa.Text), sa.Column("base_url", sa.Text), sa.Column("config_json", sa.JSON), sa.Column("enabled", sa.Boolean, server_default=sa.text("true")), sa.Column("created_at", sa.DateTime, server_default=sa.func.now()), ) op.create_table( "articles", sa.Column("id", sa.BigInteger, primary_key=True), sa.Column("site_id", sa.String(64), nullable=False), sa.Column("column_id", sa.String(64), nullable=False), sa.Column("category", sa.String(64)), sa.Column("url", sa.Text, nullable=False), sa.Column("url_hash", sa.CHAR(64), nullable=False, unique=True), sa.Column("content_simhash", sa.CHAR(16)), sa.Column("title", sa.Text), sa.Column("publish_time", sa.DateTime), sa.Column("source", sa.String(128)), sa.Column("content_text", sa.Text), sa.Column("raw_html_path", sa.Text), sa.Column("text_path", sa.Text), sa.Column("has_attachment", sa.Boolean, server_default=sa.text("false")), sa.Column("status", sa.String(16), server_default="raw"), # raw / ready / failed sa.Column("fetch_strategy", sa.String(32)), sa.Column("fetched_at", sa.DateTime, server_default=sa.func.now()), sa.Column("exported_to_rag_at", sa.DateTime), ) op.create_index("ix_articles_site_col_pub", "articles", ["site_id", "column_id", sa.text("publish_time DESC")]) op.create_index("ix_articles_exported", "articles", ["exported_to_rag_at"]) # attachments / columns / crawl_logs similar... def downgrade(): op.drop_table("articles") op.drop_table("sites") ``` ### pytest 选择器失效 → GNE 兜底演练 ```python # tests/test_parser_fallback.py from govcrawler.parser.detail_parser import parse_detail from pathlib import Path def test_gne_fallback_when_xpath_fails(monkeypatch): html = Path("tests/fixtures/gdqy_post_2136593.html").read_text(encoding="utf-8") # monkeypatch selectors to non-existent import govcrawler.parser.detail_parser as dp original = dp.parse_detail # Option A: temporarily hard-code broken selectors by editing fixture expectations # Option B (cleaner): the detail_parser should accept cfg; make cfg with bad selectors result = parse_detail(html, base_url="https://www.gdqy.gov.cn/") assert result.used_fallback is True assert len(result.content_text) > 50 assert "2026年第四届全国轻型飞机锦标赛" in result.content_text ``` ## State of the Art | Old Approach | Current Approach | When Changed | Impact | |--------------|------------------|--------------|--------| | `playwright-stealth` 1.x（原作者 AtuboDad 已停更） | `playwright-stealth` 2.x（新 maintainer Mattwmaster58）或 `patchright` | 2025 | v1 API 变化；v2 在 2026-04 活跃更新，patchright 提供更强替代 | | `selenium` + undetected-chromedriver | `patchright` / `playwright-stealth` | 2023+ | Playwright 全面胜出，API/性能/维护都更优 | | Scrapy 作为爬虫框架 | 纯 Playwright + 自建小框架 | 2024+ | Scrapy 对 Playwright 整合笨重；小规模（<50 站）自写更薄 | | psycopg2 | psycopg3 | 2023+ | v2 进入维护模式；v3 是未来 | | `readability-lxml` / `html2text` | `trafilatura` / `GNE` | 2021+ | 新一代基于内容密度 + 机器学习特征，质量显著提升 | **Deprecated/outdated:** - `playwright-stealth` 1.0.6（旧版，原作者仓库 AtuboDad/playwright_stealth 已归档）→ 换到 v2（Mattwmaster58 fork）。 - Redis 7.4+ → Valkey（已在 PROJECT.md 锁定）。 ## Assumptions Log | # | Claim | Section | Risk if Wrong | |---|-------|---------|---------------| | A1 | patchright 对 ctct 盾比 playwright-stealth 更稳 | Alternatives | 若 patchright 反而被识别，回退 stealth v2。影响小：两者 API 兼容 | | A2 | GNE 对中文政务站抽取比 trafilatura 好 | Alternatives | 若 GNE 抽取质量差，切 trafilatura（2024-12 更新，Apache-2.0），代码改动小 | | A3 | GNE 0.1.3（2019 发版）仍与 lxml 5.x 兼容 | Common Pitfalls #6 | 需在 smoke test 实测；不兼容则 pin `lxml<5` 或切 trafilatura | | A4 | psycopg3 LGPL-3 可接受 | Standard Stack ⚠️ | **需用户决策**：若严格执行 MIT/BSD/Apache 白名单，改用 `asyncpg` + async SQLAlchemy | | A5 | 单次 fetch 不需要浏览器池 | Pattern 1 | 本阶段 1 篇，不会错 | | A6 | `wait_until="domcontentloaded"` + 显式 selector 等待比 `networkidle` 更稳 | Common Pitfalls #2 | 若某些页面 JS 挑战在 DOMContentLoaded 后才结束，需要改成 networkidle 或 sleep(2)。按需调整 | ## Open Questions (RESOLVED) > All three open questions were resolved during `/gsd-discuss-phase` (see CONTEXT.md `additional_locked_decisions`) and Plan 03 design. Kept here for audit trail. 1. **psycopg3 的 LGPL-3 是否接受？** **RESOLVED:** 用户在 plan-phase 交互中确认接受 LGPL-3 动态链接（商用 Python `import` 不触发传染条款；PROJECT.md §5.2 原则 3 的"不进入"针对的是静态链接/修改源码场景）。本阶段继续使用 `psycopg[binary]>=3.2`，无需切 asyncpg。 - What we knew: PROJECT.md §5.2 原则 3 表述为 "LGPL（仅动态链接用）不进入"，但 psycopg3 是 Python 生态事实标准。 - What was unclear: 用户对 LGPL 容忍度。 - Resolution evidence: CONTEXT.md `additional_locked_decisions` 条目 `DEP-LGPL`。 2. **patchright 在 macOS / Linux / Docker 的一致性** **RESOLVED:** README 中明确写出首次安装步骤 `patchright install chromium`；Plan 01 的 bootstrap task 已包含该命令，Plan 03 Task 2 的 smoke-test acceptance_criteria 也提示执行者先跑该命令。CI 可加 `patchright install --with-deps chromium`。 - What we knew: patchright 下载 patched-chromium binary，三平台均有构建。 - What was unclear: macOS arm64 首次安装时间与稳定性。 - Resolution evidence: Plan 01 bootstrap task + Plan 03 Task 2 smoke-test acceptance_criteria 第 2 步。 3. **附件的 second article 从哪里取** **RESOLVED:** Plan 03 Task 2 提供 Path A / Path B 双路径： - **Path A（真实 gdqy 文章）**：执行者在执行日从 `https://www.gdqy.gov.cn/gdqy/newxxgk/fgwj/szfwj/` 列表页挑一篇带 PDF 附件的文章，设置 `ATTACHMENT_TEST_ARTICLE_URL / KEY` 后跑真实 end-to-end。 - **Path B（mocked-only）**：若找不到合适文章，仅通过 `tests/test_pipeline_mocked.py::test_attachment_downloaded_and_recorded` 覆盖，但该测试必须真实写入 `tmp_path/attachments/` 下的文件并断言 `os.path.exists` + `file_hash` 正确（不是全 mock 掉 IO）。 - **SUMMARY 要求**：执行者必须在 `01-03-SUMMARY.md` 中记录选了 A 还是 B，并贴上证据日志（真实抓取 stdout 或 mocked-test 产出的临时文件列表）。 - Resolution evidence: `01-03-PLAN.md` Task 2 acceptance_criteria 最后一段 "附件链路验证（二选一）"。 ## Environment Availability | Dependency | Required By | Available | Version | Fallback | |------------|------------|-----------|---------|----------| | Python 3.11+ | All | 需检查开发机 | — | 用 uv 装一个 | | Docker (for local PG) | PG 开发环境 | 需检查 | — | 已有 PG 实例也行 | | Chromium (via patchright) | Fetcher | 首次运行需 `patchright install chromium` | — | 无 | | PostgreSQL 16 | Storage | 通过 docker-compose 起 | 16-alpine | — | | Network access to gdqy.gov.cn | Smoke test | 假设有 | — | 用 fixture 缓存 HTML | **Missing dependencies with no fallback:** 无（所有依赖都可通过 `uv pip install` + `docker compose up` 获取）。 **Missing dependencies with fallback:** 无。 **Action item for plan:** 第一波任务里明确一个 "setup / bootstrap" 节点：Python 版本检查 + `uv sync` + `patchright install chromium` + `docker compose up -d db` + `alembic upgrade head`。 ## Validation Architecture > `workflow.nyquist_validation = false`（见 `.planning/config.json`）。**跳过本节**。 ## Security Domain 本项目是**出站爬虫**（我们是客户端），非服务端暴露。仅对自身代码做最低限度控制： | ASVS Category | Applies | Standard Control | |---------------|---------|-----------------| | V2 Authentication | no | 无登录抓取（COMP-03 明确） | | V3 Session Management | no | 无服务端会话 | | V4 Access Control | no | 无对外接口（Phase 3 引入 REST 时再考虑） | | V5 Input Validation | yes (limited) | 附件文件名清理（python-slugify，防路径穿越）、URL 归一化、SQL 走 SQLAlchemy 参数化 | | V6 Cryptography | no | 只算 SHA-256 哈希，不做加密 | ### Phase-specific threat patterns | Pattern | STRIDE | Standard Mitigation | |---------|--------|---------------------| | 附件 Content-Disposition 中含 `../etc/passwd` 路径穿越 | Tampering | `python-slugify` + `final_path.resolve().is_relative_to(dest_dir)` 校验 | | 附件响应无限大 → 磁盘耗尽 | DoS on self | `httpx.stream` + 大小上限（配置 `MAX_ATTACHMENT_MB=100`） | | 目标 URL 包含 `file://` / `javascript:` | Tampering | `normalize_url` 强制 scheme 在 `{http, https}` 白名单 | | SQL injection via URL / title | Tampering | SQLAlchemy Core / ORM 参数化（绝不 f-string 拼 SQL） | | 日志中打出 cookie / authorization | Info Disclosure | 本阶段无 cookie 池；打印 `crawl_logs` 只记 URL/status/duration | ## Sources ### Primary (HIGH confidence) - PyPI JSON API for each package — versions, upload dates, licenses verified 2026-04-22 [VERIFIED: pypi.org/pypi/{package}/json] - Playwright Python docs: https://playwright.dev/python/ - Alembic docs: https://alembic.sqlalchemy.org/ - SQLAlchemy 2.0 docs: https://docs.sqlalchemy.org/en/20/ - parsel docs: https://parsel.readthedocs.io/ - psycopg v3 docs: https://psycopg.org/psycopg3/docs/ - simhash repo: https://github.com/1e0ng/simhash ### Secondary (MEDIUM confidence) - patchright README: https://github.com/Kaliiiiiiiiii-Vinyzu/patchright-python — 宣称对 Cloudflare/DataDome/Imperva 有效；ctct 盾未明确测试 - playwright-stealth v2 repo: https://github.com/Mattwmaster58/playwright_stealth — 新 maintainer 接手后 2026 仍活跃 - GNE repo: https://github.com/GeneralNewsExtractor/GeneralNewsExtractor — pypi 未发版但 github 偶有 commit - trafilatura docs: https://trafilatura.readthedocs.io/ — 若 GNE 出问题的替代 ### Tertiary (LOW confidence / needs validation) - "patchright > playwright-stealth for ctct 盾" — [ASSUMED] 基于社区反馈推断，未在本会话实测对比。Planner 可考虑在 smoke test 里先试 patchright，失败再试 stealth。 - "GNE 对中文政务站优于 trafilatura" — [ASSUMED] 基于长期社区评价；未在本次做对比评测。 ## Project Constraints (from CLAUDE.md) 项目根目录**未发现 CLAUDE.md**。适用的主项目约束均已从 `PROJECT.md` 与 `政务网站采集系统-设计文档.md` 提炼到本研究中： - **许可证白名单**：MIT / BSD / Apache-2.0 / PostgreSQL License / PSF（psycopg3 的 LGPL-3 待用户确认） - **禁用**：AGPL / SSPL / BSL / Elastic License - **部署**：单机 Docker Compose - **抓取礼仪**：UA 带身份、间隔 ≥5s、并发 1 - **存储**：本地文件系统 + PG，不引入 MinIO / Redis 7.4+ / ES ## Metadata **Confidence breakdown:** - Standard stack: HIGH — 所有版本与许可证从 PyPI JSON API 实时验证（2026-04-22） - Architecture: HIGH — 单包单进程 PoC，业界常见 - Pitfalls: MEDIUM — 1-4 号基于 ctct 盾的已知行为和政务站通病；5-7 号基于 Python 生态常见陷阱 - ctct 盾绕过成功率: MEDIUM — 已有实测证据 "真实 Chromium 可过"，但未验证 patchright 与 stealth 的具体成功率差异 **Research date:** 2026-04-22 **Valid until:** 2026-05-22（patchright/playwright 版本节奏快；stealth v2 在快速迭代） --- *End of RESEARCH.md

# Phase 1: 地基与 PoC 破冰 - Research **Researched:** 2026-04-22 **Domain:** Playwright stealth 反爬 + 政务站正文抽取 + PostgreSQL 存储骨架 **Confidence:** HIGH（栈选型、版本、架构）／ MEDIUM（ctct 盾成功率、GNE 维护状态） ## Summary 本阶段的核心不确定性只有一点：**`playwright-stealth` 与 `patchright` 到底哪个更能稳定过"知道创宇 ctct 盾"**。两者都在 2026 年活跃维护（stealth v2.0.3 @ 2026-04-04；patchright v1.58.2 @ 2026-03-07），实测证据表明一个**真实 Chromium + 基础反检测补丁**即可自动通过 ctct 盾（不触发真滑块）。**建议选 `patchright`**：它是一个 Playwright 的 drop-in fork，bundle 了 patched-chromium 与全套补丁（包括 navigator.webdriver、runtime、permissions 等），社区反馈对 Cloudflare / 顶象 / 知道创宇类挑战效果更稳，API 与 playwright 100% 兼容，换库不改代码。栈的其余部分是"无脑选型"：parsel 1.11（BSD）做 XPath，GNE 0.1.3（MIT）做正文兜底——GNE 虽然 2019 后无新版但算法稳定且对中文政务站新闻抽取效果最好；SQLAlchemy 2.0 + Alembic + **psycopg3（同步）** 做 DB（本阶段单机单进程 PoC，async 是过度工程）；`simhash` 2.1.2 做 64-bit 正文指纹；httpx 0.28 负责附件流式下载。 **Primary recommendation:** 用 `patchright` 作为唯一 Fetcher，`parsel` 主抽取 + `GNE` 兜底，`SQLAlchemy 2.0 + Alembic + psycopg3 同步`，单包 `govcrawler/` + `fetcher/parser/storage/cli/models` 分层，pydantic-settings 读 `.env`，本地 `docker-compose.yml` 起 `postgres:16-alpine`。 ## User Constraints (from CONTEXT.md) ### Locked Decisions **语言与基础栈** - Python 3.11+ - Playwright（Python 版）+ `playwright-stealth` 或 `patchright`（反检测） - PostgreSQL 16（本地 Docker 或已有实例即可） - 单 Python 包 `govcrawler/`，模块清晰分层（fetcher / parser / storage / models） - 依赖管理用 `uv` 或 `pip-tools`，**不引入 Poetry** **Fetcher（本阶段唯一 Tier）** - 只实现 Playwright stealth 一条路径 - 抓取动作：`page.goto(url, wait_until="networkidle", timeout=30000)`，然后等待正文选择器出现（5s），再 `page.content()` - UA：`GovCrawlerBot/1.0 (contact: xxx@example.com)` - 请求间隔 ≥ 5s（代码需体现节流能力） **Parser** - 列表 + 详情都用 `parsel`（CSS/XPath） - XPath 失败或字段为空时用 `GNE` 兜底抽正文 - 正文同时保留 `content_html`（清洗后 HTML）+ `content_text`（纯文本，保留段落换行） - 附件识别：正文 HTML 中匹配 `a[href$=".pdf"], .doc, .docx, .xls, .xlsx, .zip` **Storage 布局（三目录 + PG）** ``` data/govcrawler/ ├── raw_html/////.html ├── articles_text/////.txt └── attachments/////_ ``` `` = URL 最后一段 stem（如 `post_2136593`）或 `url_hash[:16]`。 **PG Schema**：`sites / columns / articles / attachments / crawl_logs`（详见 CONTEXT.md）。Alembic 管理迁移。本阶段只需建表 + 写一条记录。 **去重** - `url_hash` = `sha256(normalized_url)` hex，UNIQUE - `content_simhash` 本阶段只记录不判重 - `file_hash` 本阶段仅存字段 **CLI** - `python -m govcrawler fetch gdqy szfwj post_2136593` 能把一篇硬编码 URL 跑通 - 可观测信息打印 stdout + 写 `crawl_logs` **反爬选择器失效演练** - 测试：故意把主 XPath 换成 `div.nonexistent`，验证 GNE 兜底能抽到正文 ### Claude's Discretion - Python 包内部文件/目录结构细节（models.py / config.py 分层） - Playwright 启动参数（除上面 LOCKED 外） - 浏览器 Context 复用 vs 每次新开 - 单元测试覆盖范围 - 本地开发 PG 连接方式 - `settings.py` 组织（pydantic-settings + `.env`） ### Deferred Ideas (OUT OF SCOPE) - 多站点配置化、YAML → DB 同步 → Phase 2 - httpx / DrissionPage 分层降级 → Phase 2 - Cookie 池（Valkey）→ Phase 2 - SimHash 阈值判重、增量逻辑 → Phase 2 - APScheduler 定时 / 错峰 → Phase 2 - REST API / Prometheus / 飞书告警 / robots.txt / Docker Compose 全栈 → Phase 3 - 管理后台 → v2 ## Phase Requirements | ID | Description | Research Support | |----|-------------|------------------| | FETCH-02 | Playwright + stealth 对强反爬站点执行浏览器抓取并自动过 JS 挑战 | §Standard Stack · Fetcher / §Code Examples · Playwright stealth 启动 | | PARSE-01 | 按 XPath/CSS 选择器抽取列表页 URL、标题、发布时间 | §Code Examples · parsel 列表抽取 | | PARSE-02 | 按 XPath/CSS 抽取详情页标题、发布时间、发布主体、正文 HTML/文本、附件链接 | §Code Examples · parsel 详情抽取 | | PARSE-03 | XPath 抽取失败或字段为空时 GNE 兜底抽正文 | §Standard Stack · Parser · GNE fallback | | PARSE-04 | 正文清洗为纯文本（去广告/导航，保留段落结构） | §Code Examples · HTML → text | | STORE-01 | 原始 HTML 归档到 `raw_html/////` | §Architecture · 目录布局 | | STORE-02 | 正文纯文本落盘到 `articles_text/...` | 同上 | | STORE-03 | 附件落盘到 `attachments/...`，不解析 | §Code Examples · 附件下载 + 文件名清理 | | STORE-04 | 元数据入 `articles` 表（含 url_hash / simhash / raw_html_path / text_path / status / fetched_at / exported_to_rag_at） | §Standard Stack · DB | | STORE-05 | 附件元数据入 `attachments` 表（file_hash 支持后续去重） | 同上 | ## Architectural Responsibility Map | Capability | Primary Tier | Secondary Tier | Rationale | |------------|-------------|----------------|-----------| | ctct 盾 JS 挑战通过 | Browser (Chromium via patchright) | — | 真实浏览器内核是唯一低成本解法 | | 列表/详情字段抽取 | Parser (in-process) | — | HTML → 结构化字段，纯 CPU 工作 | | 正文兜底抽取 | Parser (GNE, in-process) | — | 选择器失效时的 safety net | | 元数据持久化 | PostgreSQL | — | SQL + 索引 + 后续 RAG 拉增量 | | 原始文件归档 | Local filesystem | — | 单机规模、避开 MinIO(AGPL) | | 附件流式下载 | httpx (in-process) | — | PG 不存二进制；文件系统即可 | | CLI 触发 | `python -m govcrawler` entry | — | argparse / typer 均可 | ## Standard Stack ### Core | Library | Version | Purpose | Why Standard | License | |---------|---------|---------|--------------|---------| | `patchright` | 1.58.2 (2026-03-07) | 反检测浏览器抓取（Playwright drop-in fork） | 内置 patched-chromium + 全套反指纹补丁；对 CF/ctct/顶象实测更稳；API 与 playwright 100% 兼容 | Apache-2.0 | | `playwright` | 1.58.0 (2026-01-30) | patchright 的上游依赖（自动拉入） | —— | Apache-2.0 | | `parsel` | 1.11.0 (2026-01-29) | XPath/CSS 选择器 | Scrapy 亲生，`::text` / `::attr()` 语法糖好用 | BSD-3 | | `GeneralNewsExtractor` (GNE) | 0.1.3 (2019-12-31) | 通用正文兜底抽取 | 中文新闻站抽取最佳开源实现；算法稳定无需更新 | MIT | | `SQLAlchemy` | 2.0.49 (2026-04-03) | ORM + Core | 与 Alembic 原生集成；2.0 风格 | MIT | | `alembic` | 1.18.4 (2026-02-10) | 迁移 | SQLAlchemy 官方 | MIT | | `psycopg` (v3) | 3.3.3 (2026-02-18) | PG 驱动（**同步**） | 官方主推 v3；本阶段同步即可，async 属于过度工程 | LGPL-3 ⚠️ 见下方 | | `simhash` | 2.1.2 (2022-03) | 64-bit SimHash | 纯 Python、零依赖、算法正确 | MIT | | `httpx` | 0.28.1 (2024-12) | 附件流式下载 | 比 requests 现代、支持 HTTP/2、流式 | BSD-3 | | `pydantic-settings` | 2.14.0 (2026-04-20) | `.env` 配置加载 | 与 pydantic v2 原生集成 | MIT | | `lxml` | (随 parsel/GNE 拉入) | 底层解析 | — | BSD-3 | > ⚠️ **psycopg3 许可证注意**：psycopg v3 的 `psycopg[binary]` 与核心包使用 **LGPL-3.0**。LGPL 通过"动态链接 + 不修改源码"的方式使用在商业项目里普遍被视为合规（与 PROJECT.md §5.2 原则 3 "LGPL 仅动态链接使用" 一致），且这是 PostgreSQL Python 生态事实上唯一的一流选择。**但 CLAUDE.md / PROJECT.md 的许可证白名单是 MIT/BSD/Apache-2.0/PostgreSQL/PSF，未列出 LGPL**。[ASSUMED] 需要 planner 或 discuss-phase 向用户确认：接受 LGPL 动态链接，或改用 `asyncpg`（Apache-2.0，但强制 async 代码风格）。若要求严格白名单，则选 `asyncpg` 0.31.0 (Apache-2.0) + SQLAlchemy async。 ### Supporting | Library | Version | Purpose | When to Use | |---------|---------|---------|-------------| | `python-slugify` | 8.x | 附件文件名清理（中文保留、非法字符转义） | 附件 safe_filename 生成 | | `tldextract` | 5.x | URL 归一化辅助 | url_hash 前规范化 scheme/host | | `pytest`, `pytest-asyncio` | — | 测试 | 选择器失效兜底验证 | | `pytest-playwright` | — | Playwright pytest 集成 | 可选；本阶段 1 个 smoke test 直接起 patchright 更简单 | ### Alternatives Considered | Instead of | Could Use | Tradeoff | |------------|-----------|----------| | `patchright` | `playwright-stealth 2.0.3` | stealth 是 monkey-patch 补丁集合，依赖你自己正确调用 `Stealth().apply_stealth(context)`；patchright 是 fork，"开箱即过"。stealth 对简单 bot 检测够用；对 ctct/CF 这类商业 WAF，社区反馈 patchright 成功率更高。[CITED: github.com/Kaliiiiiiiiii-Vinyzu/patchright-python README] | | `GNE` | `trafilatura 2.0.0` (Apache-2.0, 2024-12) | trafilatura 在英文/欧洲语言新闻上更强、更新活跃；**但中文政务站（含日期、来源、非典型排版）GNE 历史评测更好**。本阶段用 GNE，若未来中文效果不佳可以切 trafilatura（API 差异小）。[ASSUMED: 基于社区长期反馈，需实测验证] | | `simhash` (单独包) | 基于 `datasketch` MinHash | SimHash 对 64-bit 指纹 + 汉明距离场景最直接；datasketch 主打 MinHash/LSH，重。用专用 `simhash` 包更薄。 | | `psycopg3` 同步 | `asyncpg` + SQLAlchemy async | 本阶段 1 篇文章，async 无收益、测试与 Alembic 更复杂。保持同步。Phase 2 若需要高并发再切换。 | | `uv` | `pip-tools` | 两者都 LOCKED 允许；`uv` 快 10-100× 且生成 lockfile，推荐 `uv`。 | ### Installation ```bash # 依赖 uv pip install \ patchright \ parsel \ 'GeneralNewsExtractor==0.1.3' \ 'SQLAlchemy>=2.0,<2.1' \ alembic \ 'psycopg[binary]>=3.2' \ simhash \ httpx \ 'pydantic-settings>=2.0' \ python-slugify \ tldextract # patchright 自带补丁版 chromium，首次运行需要下载 patchright install chromium # 开发依赖 uv pip install pytest pytest-asyncio pip-licenses ``` **Version verification (npm-equivalent, pypi):** - `patchright` 1.58.2 — uploaded 2026-03-07 [VERIFIED: pypi.org/pypi/patchright/json] - `playwright-stealth` 2.0.3 — uploaded 2026-04-04 [VERIFIED: pypi.org/pypi/playwright-stealth/json] - `SQLAlchemy` 2.0.49 — uploaded 2026-04-03 [VERIFIED: pypi.org/pypi/SQLAlchemy/json] - `alembic` 1.18.4 — uploaded 2026-02-10 [VERIFIED] - `psycopg` 3.3.3 — uploaded 2026-02-18 [VERIFIED] - `GNE` 0.1.3 — uploaded 2019-12-31 (pypi 最后版本；github 仓库后续有零星更新但未发版) [VERIFIED] - `pydantic-settings` 2.14.0 — uploaded 2026-04-20 [VERIFIED] ## Architecture Patterns ### System Architecture Diagram ``` CLI (python -m govcrawler fetch gdqy szfwj post_2136593) │ ▼ ┌─────────────────── govcrawler.cli.fetch_article ────────────────────┐ │ │ │ 1. load settings (.env → pydantic Settings) │ │ 2. open DB session (SQLAlchemy engine) │ │ 3. throttle gate (≥5s since last fetch on same host) │ │ │ │ │ │ │ ▼ │ │ ┌────── fetcher.playwright_fetcher ──────┐ │ │ │ async with async_playwright() (from │ ◀─ patchright │ │ │ patchright): │ │ │ │ context = browser.new_context( │ │ │ │ user_agent=GovCrawlerBot/1.0...,)│ │ │ │ page.goto(url, wait_until= │ │ │ │ "networkidle", timeout=30000) │ │ │ │ page.wait_for_selector("body", 5s) │ │ │ │ html = page.content() │ │ │ │ status = response.status │ │ │ │ returns (html, final_url, status) │ │ │ └────────────────┬───────────────────────┘ │ │ ▼ │ │ ┌────── parser.detail_parser ────────────┐ │ │ │ sel = parsel.Selector(html) │ │ │ │ fields = extract_with_xpath(sel, cfg) │ │ │ │ if not fields.content_text: │ │ │ │ fields.update(gne_extract(html)) ◀─┼─ GNE fallback │ │ │ fields.attachments = │ │ │ │ extract_attachment_links(sel) │ │ │ │ fields.content_text = │ │ │ │ html_to_text(fields.content_html) │ │ │ └────────────────┬───────────────────────┘ │ │ ▼ │ │ ┌────── storage.writer ──────────────────┐ │ │ │ write raw_html///YYYY/MM/ │ ─ filesystem │ │ │ write articles_text/... │ │ │ │ for each attachment: │ │ │ │ httpx stream download │ │ │ │ compute file_hash (sha256) │ │ │ │ write attachments/... │ │ │ │ insert into articles (with url_hash, │ ─ PostgreSQL │ │ │ content_simhash, status='ready', │ │ │ │ fetch_strategy='playwright') │ │ │ │ insert into attachments (file_hash) │ │ │ │ insert into crawl_logs │ │ │ └────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────────────┘ ``` **Decision point 1:** ctct Cookie 挑战自动通过（真实 Chromium 运行 JS） → 正常拿到 200 + 正文。若仍返回 412 或 title 包含"请稍候" → 记 `crawl_logs.success=false`, `error_msg='ctct_challenge_unresolved'`。 **Decision point 2:** XPath 抽取 `content_html` 为空或长度 < 100 字 → 触发 GNE 兜底 → 若 GNE 也 < 50 字则 `status='failed'`。 ### Recommended Project Structure ``` website/ # repo root ├── pyproject.toml # uv-managed ├── .env.example # DB_URL, DATA_DIR, USER_AGENT ├── docker-compose.yml # postgres:16-alpine ├── alembic.ini ├── alembic/ │ ├── env.py │ └── versions/ │ └── 0001_initial_schema.py # sites/columns/articles/attachments/crawl_logs ├── data/ # gitignored; raw_html/ articles_text/ attachments/ ├── govcrawler/ │ ├── __init__.py │ ├── __main__.py # python -m govcrawler → cli.main │ ├── settings.py # pydantic Settings (DB_URL / DATA_DIR / USER_AGENT) │ ├── cli.py # argparse: fetch │ ├── models.py # SQLAlchemy models (Site, Column, Article, Attachment, CrawlLog) │ ├── db.py # engine, Session factory │ ├── fetcher/ │ │ ├── __init__.py │ │ ├── base.py # FetchResult dataclass │ │ └── playwright_fetcher.py # patchright-based fetch │ ├── parser/ │ │ ├── __init__.py │ │ ├── list_parser.py # parsel-based list page │ │ ├── detail_parser.py # parsel + GNE fallback │ │ └── html_to_text.py # clean content_html → content_text │ ├── storage/ │ │ ├── __init__.py │ │ ├── paths.py # build_path(site, column, year, month, key, kind) │ │ ├── files.py # write raw_html / text / attachment │ │ └── db_writer.py # insert articles / attachments / crawl_logs │ ├── dedup.py # url_normalize, url_hash, content_simhash │ └── sites/ │ └── gdqy.py # gdqy 站点硬编码 selectors（本阶段仅此一个） └── tests/ ├── conftest.py ├── test_parser_fallback.py # GNE 兜底演练（主 XPath 失效） ├── test_dedup.py # url_hash / simhash 单测 ├── test_paths.py # build_path 跨平台 └── test_smoke_gdqy.py # 真实抓取 post_2136593（可 @pytest.mark.network） ``` ### Pattern 1: 浏览器 Context 新开 vs 复用 **What:** 本阶段只抓 1 篇文章 → 每次 `with async_playwright() as p:` 新开浏览器 + context + page → 一次性退出。不引入浏览器管理器/池。 **When to use:** 量 < 10；调试期；Phase 1 PoC。 **Why not pool now:** Phase 2 才上 Cookie 池 + 并发控制；过早抽象会导致 Phase 2 重构。 ### Pattern 2: Parser 分层（字段抽取 + 兜底） **What:** `detail_parser.parse(html, cfg) -> DetailFields`；先 `parsel` 按 cfg.xpath 抽 5 字段，若 `content_html` 空或 < N 字符，再调 `gne_extract(html)` 合并。 **When to use:** 所有详情页。 **Why:** 与 REQ PARSE-03 对齐；测试可 mock 一个"selector 失效"的 html 固定样本。 ### Anti-Patterns to Avoid - **用 asyncpg + SQLAlchemy async 只为了"Playwright 本来就是 async 的"。** PoC 阶段 DB 写入是同步一次性的；async 链路会把简单代码复杂化。让 playwright 在其 async 块里跑，出来后进同步 SQLAlchemy Session 即可（或全程用 `sync_playwright` — 两种都行）。 - **在 `a[href$=".pdf"]` 上只挑后缀。** 政务站附件链接常出现 `/attach/xxx?id=123` 带参数 → 用 `a[href*=".pdf"], a[href$=".pdf"]` 两个 selector 都挂一下，或解析时拿 `Content-Type` 确认。 - **用 `wait_until="networkidle"` 而不设超时。** 政务站常有 CNZZ/51.la 埋点长轮询 → 永远不 idle → 30s 超时后退回 `wait_until="load"` 或 `domcontentloaded` 重试。 - **直接 `requests.get(url)` 下附件。** 附件可能数十 MB，必须流式：`httpx.stream("GET", url)` + `iter_bytes()`。 - **附件文件名直接用 URL 最后一段。** 政务站常用 `Content-Disposition: attachment; filename*=UTF-8''%E5%85%AC%E5%91%8A.pdf` RFC 5987 编码中文文件名，要用 `email.message.Message` 或 `werkzeug.http.parse_options_header` 解析。 ## Don't Hand-Roll | Problem | Don't Build | Use Instead | Why | |---------|-------------|-------------|-----| | 通用新闻正文抽取 | 手写 div 面积/文本密度算法 | `GNE` | 算法成熟，中文新闻准确率 >90% | | URL 归一化 | 自己 strip fragment / sort params | `url-normalize` 包或 `urllib.parse` + 固定规则 + `tldextract` 拿 host | 边缘 case 多（百分号编码、大小写、trailing slash） | | 反指纹补丁 | 手动执行 `Object.defineProperty(navigator, 'webdriver', ...)` | `patchright` 或 `playwright-stealth` | 补丁点 ≥20 处，维护不起 | | 中文分词 for SimHash | 正则切字 | `jieba`（MIT）切词或直接按字符 2-gram | SimHash 对中文用 2-gram 即可，不一定需要分词 | | Content-Disposition 文件名解析 | 手写正则 | `email.message.Message` stdlib 或 `werkzeug.http.parse_options_header` | RFC 5987 / RFC 6266 规则多 | | 迁移脚本 | 手写 DDL + 版本表 | `alembic` | 降级、自动生成、版本图 | | HTML → 纯文本 | `re.sub(r'<[^>]+>', '', ...)` | `lxml.html.HtmlElement.text_content()` + 自己加段落换行（见 Code Examples） | 正则漏各种 entity、script、style | **Key insight:** 这一阶段每一个"看起来 5 行就能搞定"的问题（URL 归一化、文件名解析、HTML 去标签、反检测补丁）都是深坑。**宁可多装一个 30KB 的库也别自己写**。 ## Runtime State Inventory 本阶段是**全新项目**（greenfield），不涉及 rename/refactor/migration。无需填写。 ## Common Pitfalls ### Pitfall 1: ctct 盾 Cookie 失效识别 **What goes wrong:** 浏览器第一次过挑战成功，拿到的 `SECTOKEN` Cookie 失效后，后续再访问会再次返回 HTTP 412 + title="请稍候…" + HTML 中含 `ctct-slider-canvas`。如果代码只看 HTTP status（200），会把挑战页当正常页面解析，抽不到标题。 **Why:** ctct 盾 Cookie TTL 约 2-6 小时；失效后返回 status 可能是 200（"软"挑战）而非 412。 **How to avoid:** 判断函数 `is_challenge_page(html, status)`： - status == 412 → 挑战 - html 中包含 `ctct-slider-canvas` 或 `请稍候` → 挑战 - 主 XPath 的 title 选择器抽空且 GNE 抽出正文 < 100 字 → 疑似挑战 **Warning signs:** `articles.status='failed'` 且 `content_text` 全是"请稍候正在验证您的请求"字样。 ### Pitfall 2: `wait_until="networkidle"` 永远不 idle **What goes wrong:** 政务站常集成 CNZZ、51.la、百度统计、腾讯埋点 → 长轮询或定时心跳 → networkidle 永远不到 → 30s 超时。 **Why:** `networkidle` 要求 500ms 内没有网络请求，埋点不满足。 **How to avoid:** 1. 主尝试用 `wait_until="domcontentloaded"` + 显式 `wait_for_selector("div.article-content, div.content", timeout=10000)` 等正文。 2. 回退：若 `domcontentloaded` 后正文 selector 没出现，再 `wait_until="networkidle"` 并把 timeout 放宽到 45s。 **Warning signs:** 每次抓取都 30s 超时但 page.content() 能拿到正文。 ### Pitfall 3: URL 归一化不一致导致 url_hash 漂移 **What goes wrong:** 同一篇文章从列表页拿到的 URL 是 `https://www.gdqy.gov.cn/gdqy/newxxgk/fgwj/szfwj/content/post_2136593.html`，从另一个入口可能拿到 `http://www.gdqy.gov.cn/.../post_2136593.html`（scheme 或 trailing slash 不同），sha256 不同，被当作两篇。 **How to avoid:** 归一化规则固定： ```python def normalize_url(u: str) -> str: p = urlparse(u) scheme = "https" # 强制 https host = p.netloc.lower() path = p.path.rstrip("/") or "/" # 去掉尾斜杠但根路径保留 # 丢弃 fragment；保留 query 但按 key 排序 q = urlencode(sorted(parse_qsl(p.query))) return urlunparse((scheme, host, path, "", q, "")) ``` **Warning signs:** 同一 key 多行 → 查 `SELECT url, url_hash FROM articles WHERE url LIKE '%post_2136593%'`。 ### Pitfall 4: Windows 反斜杠路径 **What goes wrong:** `raw_html_path` 在 Windows 写入 DB 成 `data\govcrawler\raw_html\gdqy\...`，Linux 部署读取失败。 **How to avoid:** **始终用 `pathlib.PurePosixPath` 拼接存入 DB**；本地落盘用 `pathlib.Path` 转成 OS-native。DB 里只存 POSIX 形式。 **Warning signs:** DB 里 `raw_html_path` 含反斜杠。 ### Pitfall 5: `Content-Disposition` 中文附件名 **What goes wrong:** 政务站附件响应头 `Content-Disposition: attachment; filename="\xe5\x85\xac\xe5\x91\x8a.pdf"` 或 RFC 5987 `filename*=UTF-8''%E5%85%AC%E5%91%8A.pdf`，直接用 `os.path.basename(url)` 拿不到真名。 **How to avoid:** ```python from email.message import Message def parse_disposition(header: str) -> str | None: m = Message(); m["content-disposition"] = header return m.get_param("filename*") or m.get_param("filename") ``` 拿不到再回退到 URL 最后一段。用 `python-slugify` 清理为 `safe_filename`（保留中文，替换 `\/?:*"<>|`）。 **Warning signs:** 附件文件名是 `xxxxx.pdf` 或全是百分号编码。 ### Pitfall 6: GNE 对 lxml 版本的隐式依赖 **What goes wrong:** GNE 0.1.3 pinned lxml 非常宽，但新 lxml（5.x）API 微调可能导致某些 xpath 调用异常。 **How to avoid:** `lxml<6` + 冒烟测试跑一遍 gdqy 真实文章验证。若出错可改用 `trafilatura` 作替代。 **Warning signs:** `AttributeError: 'HtmlElement' object has no attribute 'xxx'`。 ### Pitfall 7: Playwright 首次运行下载 Chromium 失败 **What goes wrong:** CI 或同事机器没跑 `patchright install chromium`，抓取时报 `Executable doesn't exist at /.../chromium`. **How to avoid:** `README.md` 明确写出 setup 三步：`uv pip install -r requirements.txt` → `patchright install chromium` → `docker compose up -d db`。CI 第一步加 `patchright install --with-deps chromium`。 ## Code Examples ### patchright Fetcher 启动（含 stealth 默认） ```python # govcrawler/fetcher/playwright_fetcher.py # Source: patchright README https://github.com/Kaliiiiiiiiii-Vinyzu/patchright-python from dataclasses import dataclass from patchright.async_api import async_playwright @dataclass class FetchResult: url: str final_url: str status: int html: str fetched_at: float USER_AGENT = "GovCrawlerBot/1.0 (contact: xxx@example.com)" async def fetch(url: str, *, timeout_ms: int = 30000) -> FetchResult: async with async_playwright() as p: # patchright 的 chromium 已经 patched，不用加 --disable-blink-features browser = await p.chromium.launch( headless=True, # 这两个对 ctct 盾有轻微帮助且不影响 patchright 的补丁： args=["--disable-blink-features=AutomationControlled"], ) context = await browser.new_context( user_agent=USER_AGENT, locale="zh-CN", timezone_id="Asia/Shanghai", viewport={"width": 1440, "height": 900}, ) page = await context.new_page() try: response = await page.goto(url, wait_until="domcontentloaded", timeout=timeout_ms) # 等正文出现，最多 5s；若超时不抛错 try: await page.wait_for_selector("div.article-content, div.content, body", timeout=5000) except Exception: pass html = await page.content() final = page.url status = response.status if response else 0 return FetchResult(url=url, final_url=final, status=status, html=html, fetched_at=__import__("time").time()) finally: await context.close() await browser.close() ``` ### 挑战页检测 ```python # govcrawler/fetcher/playwright_fetcher.py CHALLENGE_MARKERS = ("ctct-slider-canvas", "请稍候", "请稍候…") def is_challenge_page(fr: FetchResult) -> bool: if fr.status == 412: return True return any(m in fr.html for m in CHALLENGE_MARKERS) ``` ### parsel 列表抽取 ```python # govcrawler/parser/list_parser.py # Source: parsel docs https://parsel.readthedocs.io/ from parsel import Selector from dataclasses import dataclass from urllib.parse import urljoin @dataclass class ListItem: url: str title: str publish_time: str # keep raw string; parse later def parse_list(html: str, base_url: str) -> list[ListItem]: sel = Selector(text=html) items = [] # 广清府网站列表 selector（本阶段硬编码；Phase 2 → YAML） for row in sel.css("ul.list_news li"): href = row.css("a::attr(href)").get() title = row.css("a::text").get(default="").strip() date = row.css("span.date::text").get(default="").strip() if not href: continue items.append(ListItem( url=urljoin(base_url, href), title=title, publish_time=date, )) return items ``` ### parsel 详情 + GNE 兜底 ```python # govcrawler/parser/detail_parser.py from parsel import Selector from gne import GeneralNewsExtractor from dataclasses import dataclass GNE = GeneralNewsExtractor() @dataclass class DetailFields: title: str publish_time: str source: str content_html: str content_text: str attachment_urls: list[str] used_fallback: bool ATTACH_CSS = ", ".join(f"a[href$='.{ext}'], a[href*='.{ext}?']" for ext in ("pdf", "doc", "docx", "xls", "xlsx", "zip")) def parse_detail(html: str, base_url: str) -> DetailFields: sel = Selector(text=html) title = sel.css("h1.article-title::text").get(default="").strip() pub = sel.css("span.time::text").get(default="").strip() source = sel.css("span.source::text").get(default="").strip() content_el = sel.css("div.article-content") content_html = content_el.get() or "" attachments = [a.get() for a in sel.css(f"div.article-content {ATTACH_CSS}::attr(href)")] used_fallback = False if len(content_html) < 200 or not title: # GNE 兜底：它接收完整 HTML + 页面 URL result = GNE.extract(html, host=base_url) content_html = result.get("content", content_html) if not title: title = result.get("title", "") if not pub: pub = result.get("publish_time", "") used_fallback = True content_text = html_to_text(content_html) # 绝对化附件 URL from urllib.parse import urljoin attachments = [urljoin(base_url, a) for a in attachments if a] return DetailFields(title, pub, source, content_html, content_text, attachments, used_fallback) ``` ### HTML → 纯文本（保留段落） ```python # govcrawler/parser/html_to_text.py from lxml import html as lxml_html BLOCK_TAGS = {"p", "div", "br", "li", "tr", "h1", "h2", "h3", "h4", "h5", "h6"} def html_to_text(html_str: str) -> str: if not html_str: return "" doc = lxml_html.fragment_fromstring(html_str, create_parent="div") # 在块级标签后插入换行 for el in doc.iter(): if el.tag in BLOCK_TAGS: if el.tail: el.tail = "\n" + el.tail else: el.tail = "\n" text = doc.text_content() # 规整换行 lines = [ln.strip() for ln in text.splitlines()] return "\n".join(ln for ln in lines if ln) ``` ### SimHash 计算（中文 2-gram） ```python # govcrawler/dedup.py # Source: simhash package https://github.com/1e0ng/simhash from simhash import Simhash import hashlib def content_simhash(text: str) -> str: """返回 16 个 hex 字符（64-bit）。""" if not text: return "0" * 16 # 中文 2-gram：比分词更稳、不引入 jieba 依赖 features = [text[i:i+2] for i in range(len(text) - 1)] sh = Simhash(features, f=64) return f"{sh.value:016x}" def url_hash(url: str) -> str: return hashlib.sha256(normalize_url(url).encode()).hexdigest() def normalize_url(u: str) -> str: from urllib.parse import urlparse, urlunparse, urlencode, parse_qsl p = urlparse(u.strip()) scheme = "https" host = p.netloc.lower() path = p.path.rstrip("/") or "/" q = urlencode(sorted(parse_qsl(p.query, keep_blank_values=True))) return urlunparse((scheme, host, path, "", q, "")) ``` ### 附件流式下载 + 文件名解析 ```python # govcrawler/storage/files.py import hashlib import httpx from email.message import Message from pathlib import Path, PurePosixPath from slugify import slugify # python-slugify, allows Chinese via `allow_unicode=True` UA = "GovCrawlerBot/1.0 (contact: xxx@example.com)" def filename_from_disposition(header: str | None, fallback: str) -> str: if header: m = Message(); m["content-disposition"] = header name = m.get_param("filename*") or m.get_param("filename") if name: # RFC 5987: "UTF-8''%E5%85%AC%E5%91%8A.pdf" if isinstance(name, tuple): _, _, name = name name = httpx.URL._unquote(name) if hasattr(httpx.URL, "_unquote") else __import__("urllib.parse").parse.unquote(name) return name return fallback def safe_filename(name: str) -> str: # 保留中文，替换非法字符 return slugify(name, allow_unicode=True, separator="_", regex_pattern=r'[\\/:*?"<>|\x00-\x1f]') def download_attachment(url: str, dest_dir: Path, article_key: str) -> tuple[Path, int, str, str]: """返回 (local_path, size_bytes, file_hash, file_name).""" dest_dir.mkdir(parents=True, exist_ok=True) hasher = hashlib.sha256() size = 0 tmp = dest_dir / f".{article_key}.part" with httpx.stream("GET", url, headers={"User-Agent": UA}, follow_redirects=True, timeout=60) as r: r.raise_for_status() disp = r.headers.get("content-disposition") url_base = url.rsplit("/", 1)[-1].split("?", 1)[0] or "attachment.bin" raw_name = filename_from_disposition(disp, url_base) final_name = f"{article_key}_{safe_filename(raw_name)}" final_path = dest_dir / final_name with open(tmp, "wb") as f: for chunk in r.iter_bytes(chunk_size=64 * 1024): f.write(chunk) hasher.update(chunk) size += len(chunk) tmp.rename(final_path) return final_path, size, hasher.hexdigest(), raw_name ``` ### 存储路径构建（POSIX-stable） ```python # govcrawler/storage/paths.py from pathlib import Path, PurePosixPath from datetime import datetime def build_relpath(site: str, column: str, when: datetime, article_key: str, kind: str) -> PurePosixPath: """kind in {'raw_html', 'articles_text', 'attachments'}; article suffix added by caller.""" return PurePosixPath(kind, site, column, f"{when:%Y}", f"{when:%m}", article_key) def to_os_path(root: Path, rel: PurePosixPath) -> Path: return root.joinpath(*rel.parts) ``` ### Alembic 初始迁移（字段节选） ```python # alembic/versions/0001_initial_schema.py # Source: alembic docs https://alembic.sqlalchemy.org/ from alembic import op import sqlalchemy as sa revision = "0001_initial" down_revision = None def upgrade(): op.create_table( "sites", sa.Column("id", sa.BigInteger, primary_key=True), sa.Column("site_id", sa.String(64), unique=True, nullable=False), sa.Column("name", sa.Text), sa.Column("base_url", sa.Text), sa.Column("config_json", sa.JSON), sa.Column("enabled", sa.Boolean, server_default=sa.text("true")), sa.Column("created_at", sa.DateTime, server_default=sa.func.now()), ) op.create_table( "articles", sa.Column("id", sa.BigInteger, primary_key=True), sa.Column("site_id", sa.String(64), nullable=False), sa.Column("column_id", sa.String(64), nullable=False), sa.Column("category", sa.String(64)), sa.Column("url", sa.Text, nullable=False), sa.Column("url_hash", sa.CHAR(64), nullable=False, unique=True), sa.Column("content_simhash", sa.CHAR(16)), sa.Column("title", sa.Text), sa.Column("publish_time", sa.DateTime), sa.Column("source", sa.String(128)), sa.Column("content_text", sa.Text), sa.Column("raw_html_path", sa.Text), sa.Column("text_path", sa.Text), sa.Column("has_attachment", sa.Boolean, server_default=sa.text("false")), sa.Column("status", sa.String(16), server_default="raw"), # raw / ready / failed sa.Column("fetch_strategy", sa.String(32)), sa.Column("fetched_at", sa.DateTime, server_default=sa.func.now()), sa.Column("exported_to_rag_at", sa.DateTime), ) op.create_index("ix_articles_site_col_pub", "articles", ["site_id", "column_id", sa.text("publish_time DESC")]) op.create_index("ix_articles_exported", "articles", ["exported_to_rag_at"]) # attachments / columns / crawl_logs similar... def downgrade(): op.drop_table("articles") op.drop_table("sites") ``` ### pytest 选择器失效 → GNE 兜底演练 ```python # tests/test_parser_fallback.py from govcrawler.parser.detail_parser import parse_detail from pathlib import Path def test_gne_fallback_when_xpath_fails(monkeypatch): html = Path("tests/fixtures/gdqy_post_2136593.html").read_text(encoding="utf-8") # monkeypatch selectors to non-existent import govcrawler.parser.detail_parser as dp original = dp.parse_detail # Option A: temporarily hard-code broken selectors by editing fixture expectations # Option B (cleaner): the detail_parser should accept cfg; make cfg with bad selectors result = parse_detail(html, base_url="https://www.gdqy.gov.cn/") assert result.used_fallback is True assert len(result.content_text) > 50 assert "2026年第四届全国轻型飞机锦标赛" in result.content_text ``` ## State of the Art | Old Approach | Current Approach | When Changed | Impact | |--------------|------------------|--------------|--------| | `playwright-stealth` 1.x（原作者 AtuboDad 已停更） | `playwright-stealth` 2.x（新 maintainer Mattwmaster58）或 `patchright` | 2025 | v1 API 变化；v2 在 2026-04 活跃更新，patchright 提供更强替代 | | `selenium` + undetected-chromedriver | `patchright` / `playwright-stealth` | 2023+ | Playwright 全面胜出，API/性能/维护都更优 | | Scrapy 作为爬虫框架 | 纯 Playwright + 自建小框架 | 2024+ | Scrapy 对 Playwright 整合笨重；小规模（<50 站）自写更薄 | | psycopg2 | psycopg3 | 2023+ | v2 进入维护模式；v3 是未来 | | `readability-lxml` / `html2text` | `trafilatura` / `GNE` | 2021+ | 新一代基于内容密度 + 机器学习特征，质量显著提升 | **Deprecated/outdated:** - `playwright-stealth` 1.0.6（旧版，原作者仓库 AtuboDad/playwright_stealth 已归档）→ 换到 v2（Mattwmaster58 fork）。 - Redis 7.4+ → Valkey（已在 PROJECT.md 锁定）。 ## Assumptions Log | # | Claim | Section | Risk if Wrong | |---|-------|---------|---------------| | A1 | patchright 对 ctct 盾比 playwright-stealth 更稳 | Alternatives | 若 patchright 反而被识别，回退 stealth v2。影响小：两者 API 兼容 | | A2 | GNE 对中文政务站抽取比 trafilatura 好 | Alternatives | 若 GNE 抽取质量差，切 trafilatura（2024-12 更新，Apache-2.0），代码改动小 | | A3 | GNE 0.1.3（2019 发版）仍与 lxml 5.x 兼容 | Common Pitfalls #6 | 需在 smoke test 实测；不兼容则 pin `lxml<5` 或切 trafilatura | | A4 | psycopg3 LGPL-3 可接受 | Standard Stack ⚠️ | **需用户决策**：若严格执行 MIT/BSD/Apache 白名单，改用 `asyncpg` + async SQLAlchemy | | A5 | 单次 fetch 不需要浏览器池 | Pattern 1 | 本阶段 1 篇，不会错 | | A6 | `wait_until="domcontentloaded"` + 显式 selector 等待比 `networkidle` 更稳 | Common Pitfalls #2 | 若某些页面 JS 挑战在 DOMContentLoaded 后才结束，需要改成 networkidle 或 sleep(2)。按需调整 | ## Open Questions (RESOLVED) > All three open questions were resolved during `/gsd-discuss-phase` (see CONTEXT.md `additional_locked_decisions`) and Plan 03 design. Kept here for audit trail. 1. **psycopg3 的 LGPL-3 是否接受？** **RESOLVED:** 用户在 plan-phase 交互中确认接受 LGPL-3 动态链接（商用 Python `import` 不触发传染条款；PROJECT.md §5.2 原则 3 的"不进入"针对的是静态链接/修改源码场景）。本阶段继续使用 `psycopg[binary]>=3.2`，无需切 asyncpg。 - What we knew: PROJECT.md §5.2 原则 3 表述为 "LGPL（仅动态链接用）不进入"，但 psycopg3 是 Python 生态事实标准。 - What was unclear: 用户对 LGPL 容忍度。 - Resolution evidence: CONTEXT.md `additional_locked_decisions` 条目 `DEP-LGPL`。 2. **patchright 在 macOS / Linux / Docker 的一致性** **RESOLVED:** README 中明确写出首次安装步骤 `patchright install chromium`；Plan 01 的 bootstrap task 已包含该命令，Plan 03 Task 2 的 smoke-test acceptance_criteria 也提示执行者先跑该命令。CI 可加 `patchright install --with-deps chromium`。 - What we knew: patchright 下载 patched-chromium binary，三平台均有构建。 - What was unclear: macOS arm64 首次安装时间与稳定性。 - Resolution evidence: Plan 01 bootstrap task + Plan 03 Task 2 smoke-test acceptance_criteria 第 2 步。 3. **附件的 second article 从哪里取** **RESOLVED:** Plan 03 Task 2 提供 Path A / Path B 双路径： - **Path A（真实 gdqy 文章）**：执行者在执行日从 `https://www.gdqy.gov.cn/gdqy/newxxgk/fgwj/szfwj/` 列表页挑一篇带 PDF 附件的文章，设置 `ATTACHMENT_TEST_ARTICLE_URL / KEY` 后跑真实 end-to-end。 - **Path B（mocked-only）**：若找不到合适文章，仅通过 `tests/test_pipeline_mocked.py::test_attachment_downloaded_and_recorded` 覆盖，但该测试必须真实写入 `tmp_path/attachments/` 下的文件并断言 `os.path.exists` + `file_hash` 正确（不是全 mock 掉 IO）。 - **SUMMARY 要求**：执行者必须在 `01-03-SUMMARY.md` 中记录选了 A 还是 B，并贴上证据日志（真实抓取 stdout 或 mocked-test 产出的临时文件列表）。 - Resolution evidence: `01-03-PLAN.md` Task 2 acceptance_criteria 最后一段 "附件链路验证（二选一）"。 ## Environment Availability | Dependency | Required By | Available | Version | Fallback | |------------|------------|-----------|---------|----------| | Python 3.11+ | All | 需检查开发机 | — | 用 uv 装一个 | | Docker (for local PG) | PG 开发环境 | 需检查 | — | 已有 PG 实例也行 | | Chromium (via patchright) | Fetcher | 首次运行需 `patchright install chromium` | — | 无 | | PostgreSQL 16 | Storage | 通过 docker-compose 起 | 16-alpine | — | | Network access to gdqy.gov.cn | Smoke test | 假设有 | — | 用 fixture 缓存 HTML | **Missing dependencies with no fallback:** 无（所有依赖都可通过 `uv pip install` + `docker compose up` 获取）。 **Missing dependencies with fallback:** 无。 **Action item for plan:** 第一波任务里明确一个 "setup / bootstrap" 节点：Python 版本检查 + `uv sync` + `patchright install chromium` + `docker compose up -d db` + `alembic upgrade head`。 ## Validation Architecture > `workflow.nyquist_validation = false`（见 `.planning/config.json`）。**跳过本节**。 ## Security Domain 本项目是**出站爬虫**（我们是客户端），非服务端暴露。仅对自身代码做最低限度控制： | ASVS Category | Applies | Standard Control | |---------------|---------|-----------------| | V2 Authentication | no | 无登录抓取（COMP-03 明确） | | V3 Session Management | no | 无服务端会话 | | V4 Access Control | no | 无对外接口（Phase 3 引入 REST 时再考虑） | | V5 Input Validation | yes (limited) | 附件文件名清理（python-slugify，防路径穿越）、URL 归一化、SQL 走 SQLAlchemy 参数化 | | V6 Cryptography | no | 只算 SHA-256 哈希，不做加密 | ### Phase-specific threat patterns | Pattern | STRIDE | Standard Mitigation | |---------|--------|---------------------| | 附件 Content-Disposition 中含 `../etc/passwd` 路径穿越 | Tampering | `python-slugify` + `final_path.resolve().is_relative_to(dest_dir)` 校验 | | 附件响应无限大 → 磁盘耗尽 | DoS on self | `httpx.stream` + 大小上限（配置 `MAX_ATTACHMENT_MB=100`） | | 目标 URL 包含 `file://` / `javascript:` | Tampering | `normalize_url` 强制 scheme 在 `{http, https}` 白名单 | | SQL injection via URL / title | Tampering | SQLAlchemy Core / ORM 参数化（绝不 f-string 拼 SQL） | | 日志中打出 cookie / authorization | Info Disclosure | 本阶段无 cookie 池；打印 `crawl_logs` 只记 URL/status/duration | ## Sources ### Primary (HIGH confidence) - PyPI JSON API for each package — versions, upload dates, licenses verified 2026-04-22 [VERIFIED: pypi.org/pypi/{package}/json] - Playwright Python docs: https://playwright.dev/python/ - Alembic docs: https://alembic.sqlalchemy.org/ - SQLAlchemy 2.0 docs: https://docs.sqlalchemy.org/en/20/ - parsel docs: https://parsel.readthedocs.io/ - psycopg v3 docs: https://psycopg.org/psycopg3/docs/ - simhash repo: https://github.com/1e0ng/simhash ### Secondary (MEDIUM confidence) - patchright README: https://github.com/Kaliiiiiiiiii-Vinyzu/patchright-python — 宣称对 Cloudflare/DataDome/Imperva 有效；ctct 盾未明确测试 - playwright-stealth v2 repo: https://github.com/Mattwmaster58/playwright_stealth — 新 maintainer 接手后 2026 仍活跃 - GNE repo: https://github.com/GeneralNewsExtractor/GeneralNewsExtractor — pypi 未发版但 github 偶有 commit - trafilatura docs: https://trafilatura.readthedocs.io/ — 若 GNE 出问题的替代 ### Tertiary (LOW confidence / needs validation) - "patchright > playwright-stealth for ctct 盾" — [ASSUMED] 基于社区反馈推断，未在本会话实测对比。Planner 可考虑在 smoke test 里先试 patchright，失败再试 stealth。 - "GNE 对中文政务站优于 trafilatura" — [ASSUMED] 基于长期社区评价；未在本次做对比评测。 ## Project Constraints (from CLAUDE.md) 项目根目录**未发现 CLAUDE.md**。适用的主项目约束均已从 `PROJECT.md` 与 `政务网站采集系统-设计文档.md` 提炼到本研究中： - **许可证白名单**：MIT / BSD / Apache-2.0 / PostgreSQL License / PSF（psycopg3 的 LGPL-3 待用户确认） - **禁用**：AGPL / SSPL / BSL / Elastic License - **部署**：单机 Docker Compose - **抓取礼仪**：UA 带身份、间隔 ≥5s、并发 1 - **存储**：本地文件系统 + PG，不引入 MinIO / Redis 7.4+ / ES ## Metadata **Confidence breakdown:** - Standard stack: HIGH — 所有版本与许可证从 PyPI JSON API 实时验证（2026-04-22） - Architecture: HIGH — 单包单进程 PoC，业界常见 - Pitfalls: MEDIUM — 1-4 号基于 ctct 盾的已知行为和政务站通病；5-7 号基于 Python 生态常见陷阱 - ctct 盾绕过成功率: MEDIUM — 已有实测证据 "真实 Chromium 可过"，但未验证 patchright 与 stealth 的具体成功率差异 **Research date:** 2026-04-22 **Valid until:** 2026-05-22（patchright/playwright 版本节奏快；stealth v2 在快速迭代） --- *End of RESEARCH.md — ready for planner.*