Instructions:Create/Source/Ingest: Difference between revisions
Source ingestion workflow — graph enrichment only, no article edits |
Update source ingest extraction guidance |
||
| Line 173: | Line 173: | ||
[[Category:Instructions/Workflows]] | [[Category:Instructions/Workflows]] | ||
[[Category:Instructions/Source]] | [[Category:Instructions/Source]] | ||
== Machine Extraction Guidance == | |||
The Maintenance tab Source Ingestion tool reads this page before calling the planner model. Keep this section concrete and machine-actionable. | |||
When extracting preview metadata, return structured JSON only. The server will supply the exact JSON schema; this page defines how to choose values. | |||
=== Missing Metadata === | |||
Raw pasted documents may not contain a formal metadata block. Infer cautiously from headers, signatures, datelines, document titles, and institutional voice. Use <code>unknown</code> when the value is absent or uncertain. Do not fail because metadata is missing. | |||
=== Entity Extraction Rules === | |||
Only include named entities that could plausibly have encyclopedia pages: | |||
* named people | |||
* places, habitats, stations, colonies, regions, or celestial bodies | |||
* organizations, institutions, governments, corporations, committees, bureaus, or polities | |||
* named events, crises, treaties, conflicts, missions, programs, or projects | |||
* named technologies, artifacts, ships, infrastructure systems, or facilities | |||
Prefer complete canonical names exactly as they appear in the document. Include a short acronym only when it is likely to be a useful redirect or page title, such as <code>IANA</code>. | |||
Do not include: | |||
* citation wrappers such as <code>From</code>, <code>Vol</code>, <code>Issue</code>, <code>Dateline</code>, or bare months | |||
* journal names, volume labels, issue labels, section labels, byline labels, or document-format labels unless they are meaningful in-world institutions | |||
* Markdown syntax, heading text that is merely a title fragment, or prose fragments | |||
* generic nouns such as <code>Committee</code>, <code>Authority</code>, <code>Studies</code>, <code>Report</code>, or <code>Dispatch</code> by themselves | |||
* incomplete noun phrases, especially phrases ending in conjunctions or articles such as <code>and</code>, <code>of</code>, <code>the</code>, or <code>beyond the</code> | |||
* broad thematic associations not explicitly named in the source | |||
If unsure whether a phrase is an entity, exclude it unless it is central to the source. | |||
=== Source Summary Rules === | |||
Write the summary in out-of-universe editorial voice. Summarize what the document is, who appears to have produced it, and what it covers. Do not simply copy the citation header or first paragraph. | |||
Revision as of 03:31, 8 May 2026
| Instruction Metadata | |
|---|---|
| id | create-source-ingest |
| type | workflow |
| applies_to | Sources |
| task_type | source_ingest |
| priority | high |
| status | active |
| canonical | true |
| include_by_default | no |
| requires | Instructions:World Bible,Instructions:Create/Source (Base Workflow),Instructions:Schema/Source Page,Instructions:Schema/Source Talk Page |
| tags | source,ingest,create,graph-enrichment |
Purpose
This workflow governs the creation of a new Sources: page from raw ingested material — a pasted document, transcript, article, report, or other primary text. It is the entry point for the source ingestion pipeline.
Core rule: Source ingestion enriches the wiki graph. It does not trigger article edits. Do not modify encyclopedia pages during ingestion. Do not create Project: queue tasks unless explicitly instructed. Create the Sources: page and its links, then stop.
Scope
This instruction applies when:
- The agent receives raw source text to process
- The task type is
source_ingest - The Maintenance tab Source Ingestion tool submits a document
It does not apply to:
- Editing existing Sources: pages
- Creating encyclopedia articles
- Running integration reviews
Step 1 — Classify the Source
Determine the source subtype from the content and any user-provided hint. Available subtypes:
- News Article — journalism, press coverage, media reporting
- Interview — Q&A, transcript, recorded conversation
- Personal Log — diary, journal, first-person account
- Official Statement — press release, public announcement, formal declaration
- Academic Paper — research, analysis, scholarly work
- Corporate Advertisement — marketing material, promotional content
- Government Resolution — legislation, policy, formal resolution
- Government Report — official findings, agency report, census
- Propaganda Broadcast — biased mass communication, state media
- Legal Document — contract, ruling, deposition, legal filing
If the subtype is ambiguous, choose the closest match. Record your reasoning in the source summary.
Step 2 — Extract Metadata
From the raw text, extract:
- Title — a descriptive title for the Sources: page. Format:
Sources:Publication or Author – Subject. Example:Sources:New Troy Tribune – Yuèmin District Unrest - Author — individual or organisation responsible for the document
- Affiliation — the author's employer, faction, or institutional context
- Date — publication or creation date. Use in-universe dates where applicable
- Location — where the document originates or was published
- Reliability — your assessment: high / medium / low / unknown
- Bias — brief characterisation of the author's likely perspective or agenda
- Canon status — primary / secondary / disputed / non-canon
For reliability and bias, reason from the affiliation and subtype. A corporate advertisement has inherent promotional bias. A government report from a faction with known interests has institutional bias. State this plainly.
Step 3 — Extract Entities
Identify named entities in the source text:
- People (named individuals)
- Places (locations, habitats, regions, stations)
- Organisations (corporations, governments, factions, institutions)
- Events (named incidents, treaties, conflicts, discoveries)
- Technologies or artefacts (named systems, ships, technologies)
These become the Related Pages links. For each entity, determine whether an encyclopedia page exists. Red links are expected and correct — they become stub generation candidates.
Step 4 — Write the Sources: Page
Create the page at the title determined in Step 2. Use this structure:
Template block
Place the
| Source Metadata | |
|---|---|
| id | |
| type | |
| subtype | |
| author | |
| affiliation | |
| date | |
| location | |
| canonical | true |
| reliability | |
| bias | |
| status | published |
| related | |
| tags | |
template at the top of the page:
{{Source
|id=
|type=<subtype from Step 1>
|author=
|affiliation=
|date=
|location=
|canonical=true
|reliability=<high|medium|low|unknown>
|bias=
|canon_status=<primary|secondary|disputed|non-canon>
|related=<comma-separated entity names>
|tags=<comma-separated lowercase tags>
}}
Page sections
After the template, write these sections in order:
== Source Summary == A 2–4 sentence neutral description of what this document is, who created it, and what it covers. Written out-of-universe (editorially), not in-universe voice. == Document Information == ; Type: <subtype> ; Author: <name> ; Affiliation: <organisation> ; Date: <date> ; Location: <place of origin> ; Reliability: <assessment and brief reason> ; Bias: <characterisation of perspective> == Related Pages == * [[Entity One]] * [[Entity Two]] * [[Entity Three]] (List all named entities from Step 3. Red links are correct and expected.) == Content == The source document text, reproduced faithfully. Preserve the in-universe voice and perspective of the original. Do not editorially correct the content — bias and inaccuracy are features, not errors.
Step 5 — Do Not Edit Encyclopedia Pages
After creating the Sources: page, stop. Do not:
- Edit or expand encyclopedia articles
- Create new encyclopedia stubs (unless separately instructed)
- Append citations to existing articles
- Create Talk page entries for integration tasks
These are Stage B operations handled by the integration review workflow
(Instructions:Maintenance/Source Integration Review) after human or
agent review of the candidate list.
Step 6 — Report Back
After page creation, return:
- The title of the created Sources: page
- The list of related pages extracted (distinguishing red links from blue links)
- The reliability and bias assessment
- Any ambiguities or decisions made during classification
This output feeds the deterministic candidate discovery step in the UI.
Constraints
- Sources pages are immutable records once created. Do not alter the Content section after initial creation. Corrections belong in the Talk page.
- Write the Content section in the in-universe voice of the original document. The source may be wrong, biased, or propaganda. Preserve this faithfully.
- The Source Summary and Document Information sections are written out-of-universe (editorially).
- Do not invent metadata not present in or clearly inferable from the source text. Use "unknown" where necessary.
- Related Pages must be real entity names from the text, not thematic associations.
Quality Check
Before submitting the page, verify:
| Source Metadata | |
|---|---|
| id | |
| type | |
| subtype | |
| author | |
| affiliation | |
| date | |
| location | |
| canonical | true |
| reliability | |
| bias | |
| status | published |
| related | |
| tags | |
template is populated with no empty required fields (use "unknown" not blank)
- Related Pages contains at least one link
- Content section reproduces the source faithfully without editorial correction
- Source Summary is written out-of-universe, not in the source's voice
- Page title follows the
Sources:Author/Publication – Subjectformat
Machine Extraction Guidance
The Maintenance tab Source Ingestion tool reads this page before calling the planner model. Keep this section concrete and machine-actionable.
When extracting preview metadata, return structured JSON only. The server will supply the exact JSON schema; this page defines how to choose values.
Missing Metadata
Raw pasted documents may not contain a formal metadata block. Infer cautiously from headers, signatures, datelines, document titles, and institutional voice. Use unknown when the value is absent or uncertain. Do not fail because metadata is missing.
Entity Extraction Rules
Only include named entities that could plausibly have encyclopedia pages:
- named people
- places, habitats, stations, colonies, regions, or celestial bodies
- organizations, institutions, governments, corporations, committees, bureaus, or polities
- named events, crises, treaties, conflicts, missions, programs, or projects
- named technologies, artifacts, ships, infrastructure systems, or facilities
Prefer complete canonical names exactly as they appear in the document. Include a short acronym only when it is likely to be a useful redirect or page title, such as IANA.
Do not include:
- citation wrappers such as
From,Vol,Issue,Dateline, or bare months - journal names, volume labels, issue labels, section labels, byline labels, or document-format labels unless they are meaningful in-world institutions
- Markdown syntax, heading text that is merely a title fragment, or prose fragments
- generic nouns such as
Committee,Authority,Studies,Report, orDispatchby themselves - incomplete noun phrases, especially phrases ending in conjunctions or articles such as
and,of,the, orbeyond the - broad thematic associations not explicitly named in the source
If unsure whether a phrase is an entity, exclude it unless it is central to the source.
Source Summary Rules
Write the summary in out-of-universe editorial voice. Summarize what the document is, who appears to have produced it, and what it covers. Do not simply copy the citation header or first paragraph.