Instructions:Create/Source/Ingest: Difference between revisions

From Encyclopedia Ephemera
Source ingestion workflow — graph enrichment only, no article edits
 
Update source ingest extraction guidance
Line 173: Line 173:
[[Category:Instructions/Workflows]]
[[Category:Instructions/Workflows]]
[[Category:Instructions/Source]]
[[Category:Instructions/Source]]
== Machine Extraction Guidance ==
The Maintenance tab Source Ingestion tool reads this page before calling the planner model. Keep this section concrete and machine-actionable.
When extracting preview metadata, return structured JSON only. The server will supply the exact JSON schema; this page defines how to choose values.
=== Missing Metadata ===
Raw pasted documents may not contain a formal metadata block. Infer cautiously from headers, signatures, datelines, document titles, and institutional voice. Use <code>unknown</code> when the value is absent or uncertain. Do not fail because metadata is missing.
=== Entity Extraction Rules ===
Only include named entities that could plausibly have encyclopedia pages:
* named people
* places, habitats, stations, colonies, regions, or celestial bodies
* organizations, institutions, governments, corporations, committees, bureaus, or polities
* named events, crises, treaties, conflicts, missions, programs, or projects
* named technologies, artifacts, ships, infrastructure systems, or facilities
Prefer complete canonical names exactly as they appear in the document. Include a short acronym only when it is likely to be a useful redirect or page title, such as <code>IANA</code>.
Do not include:
* citation wrappers such as <code>From</code>, <code>Vol</code>, <code>Issue</code>, <code>Dateline</code>, or bare months
* journal names, volume labels, issue labels, section labels, byline labels, or document-format labels unless they are meaningful in-world institutions
* Markdown syntax, heading text that is merely a title fragment, or prose fragments
* generic nouns such as <code>Committee</code>, <code>Authority</code>, <code>Studies</code>, <code>Report</code>, or <code>Dispatch</code> by themselves
* incomplete noun phrases, especially phrases ending in conjunctions or articles such as <code>and</code>, <code>of</code>, <code>the</code>, or <code>beyond the</code>
* broad thematic associations not explicitly named in the source
If unsure whether a phrase is an entity, exclude it unless it is central to the source.
=== Source Summary Rules ===
Write the summary in out-of-universe editorial voice. Summarize what the document is, who appears to have produced it, and what it covers. Do not simply copy the citation header or first paragraph.

Revision as of 03:31, 8 May 2026

Instruction Metadata
id create-source-ingest
type workflow
applies_to Sources
task_type source_ingest
priority high
status active
canonical true
include_by_default no
requires Instructions:World Bible,Instructions:Create/Source (Base Workflow),Instructions:Schema/Source Page,Instructions:Schema/Source Talk Page
tags source,ingest,create,graph-enrichment


Purpose

This workflow governs the creation of a new Sources: page from raw ingested material — a pasted document, transcript, article, report, or other primary text. It is the entry point for the source ingestion pipeline.

Core rule: Source ingestion enriches the wiki graph. It does not trigger article edits. Do not modify encyclopedia pages during ingestion. Do not create Project: queue tasks unless explicitly instructed. Create the Sources: page and its links, then stop.

Scope

This instruction applies when:

  • The agent receives raw source text to process
  • The task type is source_ingest
  • The Maintenance tab Source Ingestion tool submits a document

It does not apply to:

  • Editing existing Sources: pages
  • Creating encyclopedia articles
  • Running integration reviews

Step 1 — Classify the Source

Determine the source subtype from the content and any user-provided hint. Available subtypes:

  • News Article — journalism, press coverage, media reporting
  • Interview — Q&A, transcript, recorded conversation
  • Personal Log — diary, journal, first-person account
  • Official Statement — press release, public announcement, formal declaration
  • Academic Paper — research, analysis, scholarly work
  • Corporate Advertisement — marketing material, promotional content
  • Government Resolution — legislation, policy, formal resolution
  • Government Report — official findings, agency report, census
  • Propaganda Broadcast — biased mass communication, state media
  • Legal Document — contract, ruling, deposition, legal filing

If the subtype is ambiguous, choose the closest match. Record your reasoning in the source summary.

Step 2 — Extract Metadata

From the raw text, extract:

  • Title — a descriptive title for the Sources: page. Format: Sources:Publication or Author – Subject. Example: Sources:New Troy Tribune – Yuèmin District Unrest
  • Author — individual or organisation responsible for the document
  • Affiliation — the author's employer, faction, or institutional context
  • Date — publication or creation date. Use in-universe dates where applicable
  • Location — where the document originates or was published
  • Reliability — your assessment: high / medium / low / unknown
  • Bias — brief characterisation of the author's likely perspective or agenda
  • Canon status — primary / secondary / disputed / non-canon

For reliability and bias, reason from the affiliation and subtype. A corporate advertisement has inherent promotional bias. A government report from a faction with known interests has institutional bias. State this plainly.

Step 3 — Extract Entities

Identify named entities in the source text:

  • People (named individuals)
  • Places (locations, habitats, regions, stations)
  • Organisations (corporations, governments, factions, institutions)
  • Events (named incidents, treaties, conflicts, discoveries)
  • Technologies or artefacts (named systems, ships, technologies)

These become the Related Pages links. For each entity, determine whether an encyclopedia page exists. Red links are expected and correct — they become stub generation candidates.

Step 4 — Write the Sources: Page

Create the page at the title determined in Step 2. Use this structure:

Template block

Place the

Source Metadata
id
type
subtype
author
affiliation
date
location
canonical true
reliability
bias
status published
related
tags
template at the top of the page:
{{Source
|id=
|type=<subtype from Step 1>
|author=
|affiliation=
|date=
|location=
|canonical=true
|reliability=<high|medium|low|unknown>
|bias=
|canon_status=<primary|secondary|disputed|non-canon>
|related=<comma-separated entity names>
|tags=<comma-separated lowercase tags>
}}

Page sections

After the template, write these sections in order:

== Source Summary ==
A 2–4 sentence neutral description of what this document is, who created it,
and what it covers. Written out-of-universe (editorially), not in-universe voice.

== Document Information ==
; Type: <subtype>
; Author: <name>
; Affiliation: <organisation>
; Date: <date>
; Location: <place of origin>
; Reliability: <assessment and brief reason>
; Bias: <characterisation of perspective>

== Related Pages ==
* [[Entity One]]
* [[Entity Two]]
* [[Entity Three]]
(List all named entities from Step 3. Red links are correct and expected.)

== Content ==
The source document text, reproduced faithfully.
Preserve the in-universe voice and perspective of the original.
Do not editorially correct the content — bias and inaccuracy are features, not errors.

Step 5 — Do Not Edit Encyclopedia Pages

After creating the Sources: page, stop. Do not:

  • Edit or expand encyclopedia articles
  • Create new encyclopedia stubs (unless separately instructed)
  • Append citations to existing articles
  • Create Talk page entries for integration tasks

These are Stage B operations handled by the integration review workflow (Instructions:Maintenance/Source Integration Review) after human or agent review of the candidate list.

Step 6 — Report Back

After page creation, return:

  • The title of the created Sources: page
  • The list of related pages extracted (distinguishing red links from blue links)
  • The reliability and bias assessment
  • Any ambiguities or decisions made during classification

This output feeds the deterministic candidate discovery step in the UI.

Constraints

  • Sources pages are immutable records once created. Do not alter the Content section after initial creation. Corrections belong in the Talk page.
  • Write the Content section in the in-universe voice of the original document. The source may be wrong, biased, or propaganda. Preserve this faithfully.
  • The Source Summary and Document Information sections are written out-of-universe (editorially).
  • Do not invent metadata not present in or clearly inferable from the source text. Use "unknown" where necessary.
  • Related Pages must be real entity names from the text, not thematic associations.

Quality Check

Before submitting the page, verify:

Source Metadata
id
type
subtype
author
affiliation
date
location
canonical true
reliability
bias
status published
related
tags
template is populated with no empty required fields (use "unknown" not blank)
  • Related Pages contains at least one link
  • Content section reproduces the source faithfully without editorial correction
  • Source Summary is written out-of-universe, not in the source's voice
  • Page title follows the Sources:Author/Publication – Subject format

Machine Extraction Guidance

The Maintenance tab Source Ingestion tool reads this page before calling the planner model. Keep this section concrete and machine-actionable.

When extracting preview metadata, return structured JSON only. The server will supply the exact JSON schema; this page defines how to choose values.

Missing Metadata

Raw pasted documents may not contain a formal metadata block. Infer cautiously from headers, signatures, datelines, document titles, and institutional voice. Use unknown when the value is absent or uncertain. Do not fail because metadata is missing.

Entity Extraction Rules

Only include named entities that could plausibly have encyclopedia pages:

  • named people
  • places, habitats, stations, colonies, regions, or celestial bodies
  • organizations, institutions, governments, corporations, committees, bureaus, or polities
  • named events, crises, treaties, conflicts, missions, programs, or projects
  • named technologies, artifacts, ships, infrastructure systems, or facilities

Prefer complete canonical names exactly as they appear in the document. Include a short acronym only when it is likely to be a useful redirect or page title, such as IANA.

Do not include:

  • citation wrappers such as From, Vol, Issue, Dateline, or bare months
  • journal names, volume labels, issue labels, section labels, byline labels, or document-format labels unless they are meaningful in-world institutions
  • Markdown syntax, heading text that is merely a title fragment, or prose fragments
  • generic nouns such as Committee, Authority, Studies, Report, or Dispatch by themselves
  • incomplete noun phrases, especially phrases ending in conjunctions or articles such as and, of, the, or beyond the
  • broad thematic associations not explicitly named in the source

If unsure whether a phrase is an entity, exclude it unless it is central to the source.

Source Summary Rules

Write the summary in out-of-universe editorial voice. Summarize what the document is, who appears to have produced it, and what it covers. Do not simply copy the citation header or first paragraph.