Instructions:Create/Source/Ingest: Difference between revisions

Revision as of 03:31, 8 May 2026

Instruction Metadata
id	create-source-ingest
type	workflow
applies_to	Sources
task_type	source_ingest
priority	high
status	active
canonical	true
include_by_default	no
requires	Instructions:World Bible,Instructions:Create/Source (Base Workflow),Instructions:Schema/Source Page,Instructions:Schema/Source Talk Page
tags	source,ingest,create,graph-enrichment

Purpose

This workflow governs the creation of a new Sources: page from raw ingested material — a pasted document, transcript, article, report, or other primary text. It is the entry point for the source ingestion pipeline.

Core rule: Source ingestion enriches the wiki graph. It does not trigger article edits. Do not modify encyclopedia pages during ingestion. Do not create Project: queue tasks unless explicitly instructed. Create the Sources: page and its links, then stop.

Scope

This instruction applies when:

The agent receives raw source text to process
The task type is source_ingest
The Maintenance tab Source Ingestion tool submits a document

It does not apply to:

Editing existing Sources: pages
Creating encyclopedia articles
Running integration reviews

Step 1 — Classify the Source

Determine the source subtype from the content and any user-provided hint. Available subtypes:

News Article — journalism, press coverage, media reporting
Interview — Q&A, transcript, recorded conversation
Personal Log — diary, journal, first-person account
Official Statement — press release, public announcement, formal declaration
Academic Paper — research, analysis, scholarly work
Corporate Advertisement — marketing material, promotional content
Government Resolution — legislation, policy, formal resolution
Government Report — official findings, agency report, census
Propaganda Broadcast — biased mass communication, state media
Legal Document — contract, ruling, deposition, legal filing

If the subtype is ambiguous, choose the closest match. Record your reasoning in the source summary.

Step 2 — Extract Metadata

From the raw text, extract:

Title — a descriptive title for the Sources: page. Format: Sources:Publication or Author – Subject. Example: Sources:New Troy Tribune – Yuèmin District Unrest
Author — individual or organisation responsible for the document
Affiliation — the author's employer, faction, or institutional context
Date — publication or creation date. Use in-universe dates where applicable
Location — where the document originates or was published
Reliability — your assessment: high / medium / low / unknown
Bias — brief characterisation of the author's likely perspective or agenda
Canon status — primary / secondary / disputed / non-canon

For reliability and bias, reason from the affiliation and subtype. A corporate advertisement has inherent promotional bias. A government report from a faction with known interests has institutional bias. State this plainly.

Step 3 — Extract Entities

Identify named entities in the source text:

People (named individuals)
Places (locations, habitats, regions, stations)
Organisations (corporations, governments, factions, institutions)
Events (named incidents, treaties, conflicts, discoveries)
Technologies or artefacts (named systems, ships, technologies)

These become the Related Pages links. For each entity, determine whether an encyclopedia page exists. Red links are expected and correct — they become stub generation candidates.

Step 4 — Write the Sources: Page

Create the page at the title determined in Step 2. Use this structure:

Template block

Place the

Source Metadata
id
type
subtype
author
affiliation
date
location
canonical	true
reliability
bias
status	published
related
tags

template at the top of the page:

{{Source
|id=
|type=<subtype from Step 1>
|author=
|affiliation=
|date=
|location=
|canonical=true
|reliability=<high|medium|low|unknown>
|bias=
|canon_status=<primary|secondary|disputed|non-canon>
|related=<comma-separated entity names>
|tags=<comma-separated lowercase tags>
}}

Page sections

After the template, write these sections in order:

== Source Summary ==
A 2–4 sentence neutral description of what this document is, who created it,
and what it covers. Written out-of-universe (editorially), not in-universe voice.

== Document Information ==
; Type: <subtype>
; Author: <name>
; Affiliation: <organisation>
; Date: <date>
; Location: <place of origin>
; Reliability: <assessment and brief reason>
; Bias: <characterisation of perspective>

== Related Pages ==
* [[Entity One]]
* [[Entity Two]]
* [[Entity Three]]
(List all named entities from Step 3. Red links are correct and expected.)

== Content ==
The source document text, reproduced faithfully.
Preserve the in-universe voice and perspective of the original.
Do not editorially correct the content — bias and inaccuracy are features, not errors.

Step 5 — Do Not Edit Encyclopedia Pages

After creating the Sources: page, stop. Do not:

Edit or expand encyclopedia articles
Create new encyclopedia stubs (unless separately instructed)
Append citations to existing articles
Create Talk page entries for integration tasks

These are Stage B operations handled by the integration review workflow (Instructions:Maintenance/Source Integration Review) after human or agent review of the candidate list.

Step 6 — Report Back

After page creation, return:

The title of the created Sources: page
The list of related pages extracted (distinguishing red links from blue links)
The reliability and bias assessment
Any ambiguities or decisions made during classification

This output feeds the deterministic candidate discovery step in the UI.

Constraints

Sources pages are immutable records once created. Do not alter the Content section after initial creation. Corrections belong in the Talk page.
Write the Content section in the in-universe voice of the original document. The source may be wrong, biased, or propaganda. Preserve this faithfully.
The Source Summary and Document Information sections are written out-of-universe (editorially).
Do not invent metadata not present in or clearly inferable from the source text. Use "unknown" where necessary.
Related Pages must be real entity names from the text, not thematic associations.

Quality Check

Before submitting the page, verify:

Source Metadata
id
type
subtype
author
affiliation
date
location
canonical	true
reliability
bias
status	published
related
tags

template is populated with no empty required fields (use "unknown" not blank)

Related Pages contains at least one link
Content section reproduces the source faithfully without editorial correction
Source Summary is written out-of-universe, not in the source's voice
Page title follows the Sources:Author/Publication – Subject format

Machine Extraction Guidance

The Maintenance tab Source Ingestion tool reads this page before calling the planner model. Keep this section concrete and machine-actionable.

When extracting preview metadata, return structured JSON only. The server will supply the exact JSON schema; this page defines how to choose values.

Missing Metadata

Raw pasted documents may not contain a formal metadata block. Infer cautiously from headers, signatures, datelines, document titles, and institutional voice. Use unknown when the value is absent or uncertain. Do not fail because metadata is missing.

Entity Extraction Rules

Only include named entities that could plausibly have encyclopedia pages:

named people
places, habitats, stations, colonies, regions, or celestial bodies
organizations, institutions, governments, corporations, committees, bureaus, or polities
named events, crises, treaties, conflicts, missions, programs, or projects
named technologies, artifacts, ships, infrastructure systems, or facilities

Prefer complete canonical names exactly as they appear in the document. Include a short acronym only when it is likely to be a useful redirect or page title, such as IANA.

Do not include:

citation wrappers such as From, Vol, Issue, Dateline, or bare months
journal names, volume labels, issue labels, section labels, byline labels, or document-format labels unless they are meaningful in-world institutions
Markdown syntax, heading text that is merely a title fragment, or prose fragments
generic nouns such as Committee, Authority, Studies, Report, or Dispatch by themselves
incomplete noun phrases, especially phrases ending in conjunctions or articles such as and, of, the, or beyond the
broad thematic associations not explicitly named in the source

If unsure whether a phrase is an entity, exclude it unless it is central to the source.

Source Summary Rules

Write the summary in out-of-universe editorial voice. Summarize what the document is, who appears to have produced it, and what it covers. Do not simply copy the citation header or first paragraph.

@@ Line 173: / Line 173: @@
 [[Category:Instructions/Workflows]]
 [[Category:Instructions/Source]]
+== Machine Extraction Guidance ==
+The Maintenance tab Source Ingestion tool reads this page before calling the planner model. Keep this section concrete and machine-actionable.
+When extracting preview metadata, return structured JSON only. The server will supply the exact JSON schema; this page defines how to choose values.
+=== Missing Metadata ===
+Raw pasted documents may not contain a formal metadata block. Infer cautiously from headers, signatures, datelines, document titles, and institutional voice. Use <code>unknown</code> when the value is absent or uncertain. Do not fail because metadata is missing.
+=== Entity Extraction Rules ===
+Only include named entities that could plausibly have encyclopedia pages:
+* named people
+* places, habitats, stations, colonies, regions, or celestial bodies
+* organizations, institutions, governments, corporations, committees, bureaus, or polities
+* named events, crises, treaties, conflicts, missions, programs, or projects
+* named technologies, artifacts, ships, infrastructure systems, or facilities
+Prefer complete canonical names exactly as they appear in the document. Include a short acronym only when it is likely to be a useful redirect or page title, such as <code>IANA</code>.
+Do not include:
+* citation wrappers such as <code>From</code>, <code>Vol</code>, <code>Issue</code>, <code>Dateline</code>, or bare months
+* journal names, volume labels, issue labels, section labels, byline labels, or document-format labels unless they are meaningful in-world institutions
+* Markdown syntax, heading text that is merely a title fragment, or prose fragments
+* generic nouns such as <code>Committee</code>, <code>Authority</code>, <code>Studies</code>, <code>Report</code>, or <code>Dispatch</code> by themselves
+* incomplete noun phrases, especially phrases ending in conjunctions or articles such as <code>and</code>, <code>of</code>, <code>the</code>, or <code>beyond the</code>
+* broad thematic associations not explicitly named in the source
+If unsure whether a phrase is an entity, exclude it unless it is central to the source.
+=== Source Summary Rules ===
+Write the summary in out-of-universe editorial voice. Summarize what the document is, who appears to have produced it, and what it covers. Do not simply copy the citation header or first paragraph.