Instructions:Create/Source/Ingest: Difference between revisions

From Encyclopedia Ephemera
Document explicit wikilink ingestion signal
m 1 revision imported
 
(One intermediate revision by one other user not shown)
Line 2: Line 2:
|id=create-source-ingest
|id=create-source-ingest
|type=workflow
|type=workflow
|applies_to=Sources
|applies_to=ingest
|task_type=source_ingest
|priority=high
|priority=high
|status=active
|status=active
|canonical=true
|canonical=true
|requires=Instructions:World Bible,Instructions:Create/Source (Base Workflow),Instructions:Schema/Source Page,Instructions:Schema/Source Talk Page
|include_by_default=yes
|tags=source,ingest,create,graph-enrichment
|tags=source,ingest,entities,pre-linking
}}
}}


== Purpose ==
= Summary =
Workflow for ingesting source documents into Encyclopedia Ephemera. Extracts named entities, creates a Sources: page, and links it to related encyclopedia pages. Entity extraction uses a staged pipeline: deterministic wikilink detection first, then optional LLM calibration guided by this page's configuration.


This workflow governs the creation of a new Sources: page from raw ingested material — a pasted document, transcript, article, report, or other primary text. It is the entry point for the source ingestion pipeline.
= Workflow =


'''Core rule:''' Source ingestion enriches the wiki graph. It does '''not''' trigger article edits. Do not modify encyclopedia pages during ingestion. Do not create Project: queue tasks unless explicitly instructed. Create the Sources: page and its links, then stop.
# User fills in source metadata (title, author, date, type, publisher, url, summary) in the ingestion form.
# User pastes the source document.
# System extracts explicit <code>[[wikilinks]]</code> from the document and validates each against the wiki.
# If AI extraction is enabled: LLM identifies additional named entities using the entity type schema below.
# User reviews the three-group entity list (Known / Suggested / Low Confidence), adds or removes entities.
# User confirms. System creates the Sources: page with metadata, Related Pages section, and provenance comment.


== Scope ==
= Integration Decisions =


This instruction applies when:
After source creation, the integration review workflow (see <code>Instructions:Maintenance/Source Integration Review</code>) evaluates how related encyclopedia pages should be updated. Valid decisions per candidate:
* The agent receives raw source text to process
* The task type is <code>source_ingest</code>
* The Maintenance tab Source Ingestion tool submits a document


It does '''not''' apply to:
; no_action : Article already covers this information adequately.
* Editing existing Sources: pages
; citation_only : Add a citation link only; no content change needed.
* Creating encyclopedia articles
; citation_with_note : Add citation plus a brief inline note.
* Running integration reviews
; expansion_needed : Article requires substantive expansion using this source.
; contradiction_review : Source conflicts with existing article content; flag for human review.
; new_page : Entity appears in source but has no encyclopedia article yet.
; defer : Insufficient information to decide now; revisit when more sources exist.


== Step 1 — Classify the Source ==
Most candidates should receive <code>no_action</code>. Only create integration tasks for clear, high-confidence needs.


Determine the source subtype from the content and any user-provided hint. Available subtypes:
== Pre-Linking Configuration ==


* '''News Article''' — journalism, press coverage, media reporting
PHP reads the values below at ingestion time. Edit these to tune behaviour for your deployment.
* '''Interview''' — Q&A, transcript, recorded conversation
* '''Personal Log''' — diary, journal, first-person account
* '''Official Statement''' — press release, public announcement, formal declaration
* '''Academic Paper''' — research, analysis, scholarly work
* '''Corporate Advertisement''' — marketing material, promotional content
* '''Government Resolution''' — legislation, policy, formal resolution
* '''Government Report''' — official findings, agency report, census
* '''Propaganda Broadcast''' — biased mass communication, state media
* '''Legal Document''' — contract, ruling, deposition, legal filing


If the subtype is ambiguous, choose the closest match. Record your reasoning in the source summary.
; source_subtypes : Available source type options for the ingestion form dropdown.
: News Article
: Interview
: Personal Log
: Official Statement
: Academic Paper
: Corporate Advertisement
: Government Resolution
: Government Report
: Propaganda Broadcast
: Legal Document


== Step 2 — Extract Metadata ==
; pre_link_min_title_length : Skip wiki titles shorter than this character count. Default: 4.
: 4


From the raw text, extract:
; pre_link_stoplist : Wiki page titles to skip even when they appear in a document. These are titles that match too broadly — real words that are also article names but shouldn't be auto-linked.
: Source
: Project
: Help
: Template
: Category
: Sol
: Earth
: Mars
: Energy
: Field
: Law
: Station


* '''Title''' — a descriptive title for the Sources: page. Format: <code>Sources:Publication or Author – Subject</code>. Example: <code>Sources:New Troy Tribune – Yuèmin District Unrest</code>
; pre_link_prefixes : Honorific and title prefixes to strip when matching entity names. One entry per line.
* '''Author''' — individual or organisation responsible for the document
: Dr.
* '''Affiliation''' — the author's employer, faction, or institutional context
: Prof.
* '''Date''' — publication or creation date. Use in-universe dates where applicable
: Cmdr.
* '''Location''' — where the document originates or was published
: Admiral
* '''Reliability''' — your assessment: high / medium / low / unknown
: Captain
* '''Bias''' — brief characterisation of the author's likely perspective or agenda
: Director
* '''Canon status''' — primary / secondary / disputed / non-canon
: Chief
: Minister
: Secretary
: Commissioner
: The
: A
: An


For reliability and bias, reason from the affiliation and subtype. A corporate advertisement has inherent promotional bias. A government report from a faction with known interests has institutional bias. State this plainly.
== Entity Ontology ==


== Step 3 — Extract Entities ==
These fields tell the LLM what kinds of named entities Encyclopedia Ephemera tracks, and provide examples to anchor its extraction. Edit examples as the wiki grows.


Identify named entities in the source text:
; entity_types : Type schema for LLM entity extraction. Format: <code>TypeName: short description</code>.
: People: named individuals — characters, officials, journalists, scientists, historical figures
: Places: locations, regions, habitats, stations, settlements, orbital structures, planetary bodies
: Organisations: factions, corporations, governments, institutions, fleets, unions, authorities
: Events: named incidents, treaties, conflicts, discoveries, programmes, missions, crises
: Technologies: named systems, vessel classes, devices, protocols, artefacts, programmes


* People (named individuals)
; example_entities : Hand-curated examples per type, used to anchor LLM extraction. Format: <code>TypeName: Example1, Example2, Example3</code>.
* Places (locations, habitats, regions, stations)
: People: Alex Chambers, Maya Sato, Director Chen Wei
* Organisations (corporations, governments, factions, institutions)
: Places: New Troy, AquaNebula, Arcadia, Yuemin District
* Events (named incidents, treaties, conflicts, discoveries)
: Organisations: Jovian Union, MercuryLink, Hegemony Worlds Authority
* Technologies or artefacts (named systems, ships, technologies)
: Events: Yuemin District Unrest
: Technologies: Asterion Protocol


These become the '''Related Pages''' links. For each entity, determine whether an encyclopedia page exists. Red links are expected and correct — they become stub generation candidates.
== LLM Extraction ==


== Step 4 — Write the Sources: Page ==
; boilerplate_filter_instruction : Appended verbatim to the LLM entity extraction prompt.
 
: Do NOT return: volume numbers, issue numbers, page numbers, journal names, publisher names, citation fragments, partial strings, dates, generic terms, or common English words that are not proper nouns. Do NOT return single letters, abbreviations without clear referents, or entries from the pre-link stoplist above.
Create the page at the title determined in Step 2. Use this structure:
 
=== Template block ===
 
Place the {{Source}} template at the top of the page:
 
<pre><nowiki>
{{Source
|id=
|type=<subtype from Step 1>
|author=
|affiliation=
|date=
|location=
|canonical=true
|reliability=<high|medium|low|unknown>
|bias=
|canon_status=<primary|secondary|disputed|non-canon>
|related=<comma-separated entity names>
|tags=<comma-separated lowercase tags>
}}
</nowiki></pre>
 
=== Page sections ===
 
After the template, write these sections in order:
 
<pre><nowiki>
== Source Summary ==
A 2–4 sentence neutral description of what this document is, who created it,
and what it covers. Written out-of-universe (editorially), not in-universe voice.
 
== Document Information ==
; Type: <subtype>
; Author: <name>
; Affiliation: <organisation>
; Date: <date>
; Location: <place of origin>
; Reliability: <assessment and brief reason>
; Bias: <characterisation of perspective>
 
== Related Pages ==
* [[Entity One]]
* [[Entity Two]]
* [[Entity Three]]
(List all named entities from Step 3. Red links are correct and expected.)
 
== Content ==
The source document text, reproduced faithfully.
Preserve the in-universe voice and perspective of the original.
Do not editorially correct the content — bias and inaccuracy are features, not errors.
</nowiki></pre>
 
== Step 5 — Do Not Edit Encyclopedia Pages ==
 
After creating the Sources: page, stop. Do not:
 
* Edit or expand encyclopedia articles
* Create new encyclopedia stubs (unless separately instructed)
* Append citations to existing articles
* Create Talk page entries for integration tasks
 
These are Stage B operations handled by the integration review workflow
(<code>Instructions:Maintenance/Source Integration Review</code>) after human or
agent review of the candidate list.
 
== Step 6 — Report Back ==
 
After page creation, return:
 
* The title of the created Sources: page
* The list of related pages extracted (distinguishing red links from blue links)
* The reliability and bias assessment
* Any ambiguities or decisions made during classification
 
This output feeds the deterministic candidate discovery step in the UI.
 
== Constraints ==
 
* Sources pages are '''immutable records''' once created. Do not alter the Content section after initial creation. Corrections belong in the Talk page.
* Write the Content section in the '''in-universe voice''' of the original document. The source may be wrong, biased, or propaganda. Preserve this faithfully.
* The Source Summary and Document Information sections are written '''out-of-universe''' (editorially).
* Do not invent metadata not present in or clearly inferable from the source text. Use "unknown" where necessary.
* Related Pages must be real entity names from the text, not thematic associations.
 
== Quality Check ==
 
Before submitting the page, verify:
 
* {{Source}} template is populated with no empty required fields (use "unknown" not blank)
* Related Pages contains at least one link
* Content section reproduces the source faithfully without editorial correction
* Source Summary is written out-of-universe, not in the source's voice
* Page title follows the <code>Sources:Author/Publication – Subject</code> format


[[Category:Instructions]]
[[Category:Instructions]]
[[Category:Instructions/Workflows]]
[[Category:Workflow Instructions]]
[[Category:Instructions/Source]]
 
== Machine Extraction Guidance ==
 
The Maintenance tab Source Ingestion tool reads this page before calling the planner model. Keep this section concrete and machine-actionable.
 
When extracting preview metadata, return structured JSON only. The server will supply the exact JSON schema; this page defines how to choose values.
 
=== Missing Metadata ===
 
Raw pasted documents may not contain a formal metadata block. Infer cautiously from headers, signatures, datelines, document titles, and institutional voice. Use <code>unknown</code> when the value is absent or uncertain. Do not fail because metadata is missing.
 
=== Entity Extraction Rules ===
 
Only include named entities that could plausibly have encyclopedia pages:
 
* named people
* places, habitats, stations, colonies, regions, or celestial bodies
* organizations, institutions, governments, corporations, committees, bureaus, or polities
* named events, crises, treaties, conflicts, missions, programs, or projects
* named technologies, artifacts, ships, infrastructure systems, or facilities
 
Prefer complete canonical names exactly as they appear in the document. Include a short acronym only when it is likely to be a useful redirect or page title, such as <code>IANA</code>.
 
Do not include:
 
* citation wrappers such as <code>From</code>, <code>Vol</code>, <code>Issue</code>, <code>Dateline</code>, or bare months
* journal names, volume labels, issue labels, section labels, byline labels, or document-format labels unless they are meaningful in-world institutions
* Markdown syntax, heading text that is merely a title fragment, or prose fragments
* generic nouns such as <code>Committee</code>, <code>Authority</code>, <code>Studies</code>, <code>Report</code>, or <code>Dispatch</code> by themselves
* incomplete noun phrases, especially phrases ending in conjunctions or articles such as <code>and</code>, <code>of</code>, <code>the</code>, or <code>beyond the</code>
* broad thematic associations not explicitly named in the source
 
If unsure whether a phrase is an entity, exclude it unless it is central to the source.
 
=== Source Summary Rules ===
 
Write the summary in out-of-universe editorial voice. Summarize what the document is, who appears to have produced it, and what it covers. Do not simply copy the citation header or first paragraph.


=== Explicit editor-authored wikilinks ===
If the pasted source contains wikitext links such as [[Page Title]] or [[Page Title|display text]], treat the link target as an intentional Related Pages entity. Include these linked page titles even if other extraction heuristics would omit them. This lets editors force known integration targets by adding links before previewing the source.

Latest revision as of 16:29, 12 May 2026

Instruction Metadata
id create-source-ingest
type workflow
applies_to ingest
task_type
priority high
status active
canonical true
include_by_default yes
requires
tags source,ingest,entities,pre-linking


Summary

Workflow for ingesting source documents into Encyclopedia Ephemera. Extracts named entities, creates a Sources: page, and links it to related encyclopedia pages. Entity extraction uses a staged pipeline: deterministic wikilink detection first, then optional LLM calibration guided by this page's configuration.

Workflow

  1. User fills in source metadata (title, author, date, type, publisher, url, summary) in the ingestion form.
  2. User pastes the source document.
  3. System extracts explicit wikilinks from the document and validates each against the wiki.
  4. If AI extraction is enabled: LLM identifies additional named entities using the entity type schema below.
  5. User reviews the three-group entity list (Known / Suggested / Low Confidence), adds or removes entities.
  6. User confirms. System creates the Sources: page with metadata, Related Pages section, and provenance comment.

Integration Decisions

After source creation, the integration review workflow (see Instructions:Maintenance/Source Integration Review) evaluates how related encyclopedia pages should be updated. Valid decisions per candidate:

no_action
Article already covers this information adequately.
citation_only
Add a citation link only; no content change needed.
citation_with_note
Add citation plus a brief inline note.
expansion_needed
Article requires substantive expansion using this source.
contradiction_review
Source conflicts with existing article content; flag for human review.
new_page
Entity appears in source but has no encyclopedia article yet.
defer
Insufficient information to decide now; revisit when more sources exist.

Most candidates should receive no_action. Only create integration tasks for clear, high-confidence needs.

Pre-Linking Configuration

PHP reads the values below at ingestion time. Edit these to tune behaviour for your deployment.

source_subtypes
Available source type options for the ingestion form dropdown.
News Article
Interview
Personal Log
Official Statement
Academic Paper
Corporate Advertisement
Government Resolution
Government Report
Propaganda Broadcast
Legal Document
pre_link_min_title_length
Skip wiki titles shorter than this character count. Default: 4.
4
pre_link_stoplist
Wiki page titles to skip even when they appear in a document. These are titles that match too broadly — real words that are also article names but shouldn't be auto-linked.
Source
Project
Help
Template
Category
Sol
Earth
Mars
Energy
Field
Law
Station
pre_link_prefixes
Honorific and title prefixes to strip when matching entity names. One entry per line.
Dr.
Prof.
Cmdr.
Admiral
Captain
Director
Chief
Minister
Secretary
Commissioner
The
A
An

Entity Ontology

These fields tell the LLM what kinds of named entities Encyclopedia Ephemera tracks, and provide examples to anchor its extraction. Edit examples as the wiki grows.

entity_types
Type schema for LLM entity extraction. Format: TypeName: short description.
People: named individuals — characters, officials, journalists, scientists, historical figures
Places: locations, regions, habitats, stations, settlements, orbital structures, planetary bodies
Organisations: factions, corporations, governments, institutions, fleets, unions, authorities
Events: named incidents, treaties, conflicts, discoveries, programmes, missions, crises
Technologies: named systems, vessel classes, devices, protocols, artefacts, programmes
example_entities
Hand-curated examples per type, used to anchor LLM extraction. Format: TypeName: Example1, Example2, Example3.
People: Alex Chambers, Maya Sato, Director Chen Wei
Places: New Troy, AquaNebula, Arcadia, Yuemin District
Organisations: Jovian Union, MercuryLink, Hegemony Worlds Authority
Events: Yuemin District Unrest
Technologies: Asterion Protocol

LLM Extraction

boilerplate_filter_instruction
Appended verbatim to the LLM entity extraction prompt.
Do NOT return: volume numbers, issue numbers, page numbers, journal names, publisher names, citation fragments, partial strings, dates, generic terms, or common English words that are not proper nouns. Do NOT return single letters, abbreviations without clear referents, or entries from the pre-link stoplist above.