"""
智能文本分块器 —— 针对中文政务文档设计，基于段落边界进行语义分块并支持可配置的重叠窗口。
Intelligent text chunker for government documents.

Splits text into semantically meaningful chunks based on paragraph boundaries
with configurable overlap. Designed for Chinese government documents.
"""

from __future__ import annotations

import re
from dataclasses import dataclass, field
from typing import Any

from app.utils.logger import get_logger

logger = get_logger(__name__)


@dataclass
class Chunk:
    """文本分块，包含位置信息、结构元数据和文档级元数据，可直接序列化为 ES 文档。

    A text chunk with positional and metadata information."""

    chunk_id: str
    content_hash: str
    doc_ids: list[str]
    chunk_index: int
    content: str
    char_start: int
    char_end: int
    token_estimate: int
    metadata: dict[str, Any] = field(default_factory=dict)
    # ── Structure metadata (multi-format support) ──
    page_number: int | None = None
    page_numbers: list[int] = field(default_factory=list)
    heading_hierarchy: list[str] = field(default_factory=list)
    element_type: str = ""

    def to_dict(self) -> dict[str, Any]:
        """Convert to dict suitable for ES indexing.

        metadata 先展开，核心字段后写入，确保 metadata 中的同名键不会覆盖核心字段。
        Metadata is spread first; core fields are written after to prevent override.
        """
        d: dict[str, Any] = {
            **self.metadata,
            "chunk_id": self.chunk_id,
            "content_hash": self.content_hash,
            "doc_ids": self.doc_ids,
            "chunk_index": self.chunk_index,
            "content": self.content,
        }
        if self.page_number is not None:
            d["page_number"] = self.page_number
        if self.page_numbers:
            d["page_numbers"] = self.page_numbers
        if self.heading_hierarchy:
            d["heading_hierarchy"] = self.heading_hierarchy
        if self.element_type:
            d["element_type"] = self.element_type
        return d


# ── 政务文档章节标题识别模式，用于在标题处拆分段落 ────────────────────────

_SECTION_PATTERNS = [
    re.compile(r"^第[一二三四五六七八九十百]+[条章节编部分]"),  # 第一条、第一章
    re.compile(r"^[一二三四五六七八九十]+[、.]"),  # 一、 二、
    re.compile(r"^（[一二三四五六七八九十]+）"),  # （一）（二）
    re.compile(r"^\d+[、.\s]"),  # 1、 2.
    re.compile(r"^（\d+）"),  # （1）（2）
    re.compile(r"^附件[：:]?\s*\d*"),  # 附件：
    re.compile(r"^附[：:]"),  # 附：
]


def _is_section_heading(line: str) -> bool:
    """Check if a line looks like a section heading."""
    stripped = line.strip()
    return any(pat.match(stripped) for pat in _SECTION_PATTERNS)


def _estimate_tokens(text: str) -> int:
    """Rough token count estimate for mixed Chinese/English text.

    Chinese: ~1.5 tokens per character.
    English: ~0.75 tokens per word (4 chars).
    """
    chinese_chars = len(re.findall(r"[\u4e00-\u9fff]", text))
    non_chinese = len(text) - chinese_chars
    return int(chinese_chars * 1.5 + non_chinese * 0.25)


class DocumentChunker:
    """政务文档分块器，将全文切分为带重叠的语义块。

    Split government document text into overlapping chunks.

    分块策略:
    1. 按双换行或章节标题拆分段落
    2. 合并连续短段落直到目标 token 数
    3. 对超长段落按句子边界二次拆分
    4. 相邻块之间添加重叠上下文

    Strategy:
    1. Split into paragraphs (by double newline or section headings)
    2. Merge short consecutive paragraphs up to target chunk size
    3. Split oversized paragraphs at sentence boundaries
    4. Add overlap between adjacent chunks
    """

    def __init__(
        self,
        *,
        target_tokens: int = 512,
        max_tokens: int = 768,
        overlap_tokens: int = 64,
        min_chunk_tokens: int = 50,
    ):
        self.target_tokens = target_tokens
        self.max_tokens = max_tokens
        self.overlap_tokens = overlap_tokens
        self.min_chunk_tokens = min_chunk_tokens

    def chunk_document(
        self,
        text: str,
        content_hash: str,
        doc_ids: list[str],
        metadata: dict[str, Any] | None = None,
    ) -> list[Chunk]:
        """Split document text into chunks with overlap.

        Args:
            text: Full document text.
            content_hash: MD5 hash of the source file (used for chunk_id).
            doc_ids: List of document IDs referencing this content.
            metadata: Document-level metadata to attach to each chunk.

        Returns:
            List of Chunk objects.
        """
        if not text.strip():
            logger.warning("empty_document", content_hash=content_hash[:12])
            return []

        metadata = metadata or {}

        # Step 1: Split into paragraphs
        paragraphs = self._split_paragraphs(text)

        # Step 2: Merge small paragraphs into chunks
        merged = self._merge_paragraphs(paragraphs)

        # Step 3: Split oversized chunks
        sized = []
        for segment in merged:
            tokens = _estimate_tokens(segment)
            if tokens > self.max_tokens:
                sized.extend(self._split_by_sentences(segment))
            else:
                sized.append(segment)

        # Step 4: Add overlap and create Chunk objects
        chunks = self._create_chunks_with_overlap(
            sized, content_hash, doc_ids, metadata, text,
        )

        # Filter out too-small chunks
        chunks = [c for c in chunks if c.token_estimate >= self.min_chunk_tokens]

        # Re-index
        for i, chunk in enumerate(chunks):
            chunk.chunk_index = i
            chunk.chunk_id = f"{content_hash}_chunk_{i:04d}"

        logger.info(
            "document_chunked",
            content_hash=content_hash[:12],
            total_chars=len(text),
            chunks=len(chunks),
            avg_tokens=sum(c.token_estimate for c in chunks) // max(len(chunks), 1),
        )

        return chunks

    def _split_paragraphs(self, text: str) -> list[str]:
        """Split text into paragraphs by double newlines and section headings."""
        # First split by double newlines
        raw_paragraphs = re.split(r"\n{2,}", text)

        result = []
        for para in raw_paragraphs:
            lines = para.strip().split("\n")
            current = []
            for line in lines:
                if _is_section_heading(line) and current:
                    result.append("\n".join(current))
                    current = [line]
                else:
                    current.append(line)
            if current:
                result.append("\n".join(current))

        return [p.strip() for p in result if p.strip()]

    def _merge_paragraphs(self, paragraphs: list[str]) -> list[str]:
        """Merge consecutive short paragraphs up to target_tokens."""
        merged = []
        current_parts: list[str] = []
        current_tokens = 0

        for para in paragraphs:
            para_tokens = _estimate_tokens(para)

            if current_tokens + para_tokens > self.target_tokens and current_parts:
                merged.append("\n\n".join(current_parts))
                current_parts = [para]
                current_tokens = para_tokens
            else:
                current_parts.append(para)
                current_tokens += para_tokens

        if current_parts:
            merged.append("\n\n".join(current_parts))

        return merged

    def _split_by_sentences(self, text: str) -> list[str]:
        """Split oversized text at sentence boundaries."""
        # Chinese sentence endings + common separators
        sentences = re.split(r"(?<=[。！？；\n])", text)

        result = []
        current = ""
        current_tokens = 0

        for sent in sentences:
            sent_tokens = _estimate_tokens(sent)
            if current_tokens + sent_tokens > self.target_tokens and current:
                result.append(current.strip())
                current = sent
                current_tokens = sent_tokens
            else:
                current += sent
                current_tokens += sent_tokens

        if current.strip():
            result.append(current.strip())

        return result

    def _create_chunks_with_overlap(
        self,
        segments: list[str],
        content_hash: str,
        doc_ids: list[str],
        metadata: dict[str, Any],
        original_text: str,
    ) -> list[Chunk]:
        """Create Chunk objects with overlapping context from adjacent segments."""
        chunks = []
        # 重叠字符数估算：假设平均每个 token 约 1.5 个中文字符，乘 2 作为安全边距
        # Overlap char estimate: ~1.5 chars/token for Chinese, x2 as safety margin
        overlap_chars = int(self.overlap_tokens / 1.5) * 2

        for i, segment in enumerate(segments):
            content = segment

            # Prepend tail of previous segment as overlap
            if i > 0 and overlap_chars > 0:
                prev = segments[i - 1]
                overlap_text = prev[-overlap_chars:] if len(prev) > overlap_chars else prev
                # Find last sentence boundary in overlap
                last_boundary = max(
                    overlap_text.rfind("。"),
                    overlap_text.rfind("；"),
                    overlap_text.rfind("\n"),
                    0,
                )
                if last_boundary > 0:
                    overlap_text = overlap_text[last_boundary + 1:]
                content = overlap_text.strip() + "\n" + content

            # Find position in original text
            char_start = original_text.find(segment[:50]) if len(segment) >= 50 else 0
            char_end = char_start + len(segment) if char_start >= 0 else len(segment)

            chunk = Chunk(
                chunk_id=f"{content_hash}_chunk_{i:04d}",
                content_hash=content_hash,
                doc_ids=list(doc_ids),
                chunk_index=i,
                content=content,
                char_start=max(char_start, 0),
                char_end=char_end,
                token_estimate=_estimate_tokens(content),
                metadata=metadata.copy(),
            )
            chunks.append(chunk)

        return chunks


def chunks_from_docling(
    processed_chunks: list,
    content_hash: str,
    doc_ids: list[str],
    metadata: dict[str, Any] | None = None,
    file_type: str = "",
) -> list[Chunk]:
    """将 DoclingProcessor 输出的 ProcessedChunk 转换为标准 Chunk 对象，保留页码和标题层级等结构信息。

    Convert DoclingProcessor output into Chunk objects.

    Args:
        processed_chunks: List of ProcessedChunk from DoclingProcessor.
        content_hash: MD5 hash of the source file.
        doc_ids: List of document IDs referencing this content.
        metadata: Document-level metadata to attach to each chunk.
        file_type: Source file type (pdf, docx, etc.)

    Returns:
        List of Chunk objects ready for embedding and indexing.
    """
    from app.core.docling_processor import ProcessedChunk

    metadata = metadata or {}
    if file_type:
        metadata["file_type"] = file_type

    chunks: list[Chunk] = []
    for pc in processed_chunks:
        # assert 在 python -O 模式下会被移除，改用显式类型检查
        if not isinstance(pc, ProcessedChunk):
            raise TypeError(f"Expected ProcessedChunk, got {type(pc).__name__}")
        chunk_meta = metadata.copy()

        chunk = Chunk(
            chunk_id=f"{content_hash}_chunk_{pc.chunk_index:04d}",
            content_hash=content_hash,
            doc_ids=list(doc_ids),
            chunk_index=pc.chunk_index,
            content=pc.content,
            char_start=0,
            char_end=len(pc.content),
            token_estimate=_estimate_tokens(pc.content),
            metadata=chunk_meta,
            page_number=pc.page_number,
            page_numbers=pc.page_numbers,
            heading_hierarchy=pc.heading_hierarchy,
            element_type=pc.element_type,
        )
        chunks.append(chunk)

    logger.info(
        "docling_chunks_created",
        content_hash=content_hash[:12],
        chunks=len(chunks),
        avg_tokens=sum(c.token_estimate for c in chunks) // max(len(chunks), 1),
    )

    return chunks