跳轉到

文件攝取管線

RAG 管線的第一階段是將 PDF 文件轉換為可搜尋的向量索引。NextPDF Enterprise 的攝取管線包含結構感知解析、智能分塊、GPU 嵌入,以及雙軌索引(向量 + BM25)。


端點

POST /v1/rag/ingest

攝取單份 PDF 文件,系統自動執行解析、分塊、嵌入與索引。

請求

POST /v1/rag/ingest
Authorization: Bearer {jwt_token}
X-Tenant-ID: acme-corp-001
Content-Type: multipart/form-data

--boundary
Content-Disposition: form-data; name="document"; filename="annual-report-2024.pdf"
Content-Type: application/pdf

{pdf binary data}
--boundary
Content-Disposition: form-data; name="options"
Content-Type: application/json

{
  "document_id": "annual-report-2024",
  "title": "Annual Report 2024",
  "tags": ["financial", "2024", "annual"],
  "chunking_strategy": "structure_aware",
  "chunk_max_tokens": 512,
  "chunk_overlap_tokens": 64,
  "extract_tables": true,
  "extract_headers": true,
  "language_hint": "en"
}

回應(202 Accepted)

{
  "ingest_job_id": "job_01HX8K2N3P4Q5R6S7T8U9V0W",
  "document_id": "annual-report-2024",
  "status": "queued",
  "estimated_chunks": 247,
  "created_at": "2025-01-15T09:30:00Z",
  "status_url": "/v1/rag/ingest/jobs/job_01HX8K2N3P4Q5R6S7T8U9V0W"
}

POST /v1/rag/ingest-chunks

直接提交預先分塊的段落,跳過 NextPDF 的自動分塊(適用於已有自定義分塊邏輯的場景):

請求

POST /v1/rag/ingest-chunks
Authorization: Bearer {jwt_token}
X-Tenant-ID: acme-corp-001
Content-Type: application/json

{
  "document_id": "contract-2025-001",
  "chunks": [
    {
      "chunk_id": "c001",
      "text": "This Service Agreement is entered into as of January 15, 2025...",
      "metadata": {
        "page_number": 1,
        "section": "Introduction",
        "document_type": "legal_contract",
        "headings": ["Service Agreement", "1. Definitions"]
      }
    },
    {
      "chunk_id": "c002",
      "text": "1.1 \"Service\" means the PDF generation and processing capabilities...",
      "metadata": {
        "page_number": 1,
        "section": "1. Definitions",
        "headings": ["1. Definitions"]
      }
    }
  ]
}

結構感知分塊

structure_aware 分塊策略使用 NextPDF 的文件結構解析能力,在語意邊界(標題、段落、章節)處分塊:

flowchart TD
    A[PDF 文件] --> B[結構解析]
    B --> C[識別標題階層 H1/H2/H3]
    C --> D[識別段落邊界]
    D --> E[識別表格 / 圖像]
    E --> F{段落長度 > max_tokens?}
    F -->|是| G[句子邊界分割]
    F -->|否| H[保持完整段落]
    G --> I[加入 overlap 上下文]
    H --> I
    I --> J[標記元資料(頁碼、標題路徑)]

分塊策略選項

策略 說明 適用場景
structure_aware 依文件結構邊界分塊(推薦) 有明確章節結構的文件
fixed_token 固定 token 數分塊(含 overlap) 掃描文件、無結構 PDF
sentence 以句子為邊界分塊 短段落密集文件
paragraph 以段落為邊界分塊 新聞稿、報告
page 以頁面為邊界分塊(最大粒度) 頁面獨立的文件(表單、簡報)

攝取進度追蹤

GET /v1/rag/ingest/jobs/{job_id}

GET /v1/rag/ingest/jobs/job_01HX8K2N3P4Q5R6S7T8U9V0W
Authorization: Bearer {jwt_token}
{
  "ingest_job_id": "job_01HX8K2N3P4Q5R6S7T8U9V0W",
  "document_id": "annual-report-2024",
  "status": "embedding",
  "progress": {
    "total_chunks": 247,
    "parsed_chunks": 247,
    "embedded_chunks": 183,
    "indexed_chunks": 183,
    "percent_complete": 74
  },
  "started_at": "2025-01-15T09:30:01Z",
  "estimated_completion": "2025-01-15T09:30:08Z"
}

狀態流轉

queued → parsing → chunking → embedding → indexing → completed
                                                    ↘ failed

PHP 客戶端

use NextPDF\Enterprise\AiRag\RagClient;
use NextPDF\Enterprise\AiRag\IngestOptions;
use NextPDF\Enterprise\AiRag\ChunkingStrategy;

$client = RagClient::fromEnvironment();

// 非同步攝取
$job = $client->ingest(
    documentId: 'annual-report-2024',
    pdfBytes: file_get_contents('annual-report-2024.pdf'),
    options: IngestOptions::create()
        ->withTitle('Annual Report 2024')
        ->withTags(['financial', '2024'])
        ->withChunkingStrategy(ChunkingStrategy::StructureAware)
        ->withChunkMaxTokens(512)
        ->withChunkOverlapTokens(64)
        ->withTableExtraction(true),
);

// 等待完成(帶超時)
$completedJob = $client->waitForIngest(
    jobId: $job->ingestJobId(),
    timeoutSeconds: 60,
    pollIntervalMs: 500,
);

echo '攝取完成,共 ' . $completedJob->progress()->indexedChunks() . ' 個索引段落';

PHP Compatibility

This example uses PHP 8.5 syntax. If your environment runs PHP 8.1 or 7.4, use NextPDF Backport for a backward-compatible build.


文件移除

DELETE /v1/rag/documents/{document_id}
Authorization: Bearer {jwt_token}
X-Tenant-ID: acme-corp-001
// 移除文件及其所有向量(GDPR 遺忘權支援)
$client->removeDocument(documentId: 'annual-report-2024');

效能規格

場景 指標
單份文件攝取(50 頁)
批次攝取吞吐量(GPU)
批次攝取吞吐量(CPU)

延伸閱讀