Instructions:Create/Source/Ingest: Difference between revisions
Document explicit wikilink ingestion signal |
m 1 revision imported |
||
| (One intermediate revision by one other user not shown) | |||
| Line 2: | Line 2: | ||
|id=create-source-ingest | |id=create-source-ingest | ||
|type=workflow | |type=workflow | ||
|applies_to= | |applies_to=ingest | ||
|priority=high | |priority=high | ||
|status=active | |status=active | ||
|canonical=true | |canonical=true | ||
| | |include_by_default=yes | ||
|tags=source,ingest, | |tags=source,ingest,entities,pre-linking | ||
}} | }} | ||
== | = Summary = | ||
Workflow for ingesting source documents into Encyclopedia Ephemera. Extracts named entities, creates a Sources: page, and links it to related encyclopedia pages. Entity extraction uses a staged pipeline: deterministic wikilink detection first, then optional LLM calibration guided by this page's configuration. | |||
= Workflow = | |||
# User fills in source metadata (title, author, date, type, publisher, url, summary) in the ingestion form. | |||
# User pastes the source document. | |||
# System extracts explicit <code>[[wikilinks]]</code> from the document and validates each against the wiki. | |||
# If AI extraction is enabled: LLM identifies additional named entities using the entity type schema below. | |||
# User reviews the three-group entity list (Known / Suggested / Low Confidence), adds or removes entities. | |||
# User confirms. System creates the Sources: page with metadata, Related Pages section, and provenance comment. | |||
= | = Integration Decisions = | ||
After source creation, the integration review workflow (see <code>Instructions:Maintenance/Source Integration Review</code>) evaluates how related encyclopedia pages should be updated. Valid decisions per candidate: | |||
; no_action : Article already covers this information adequately. | |||
; citation_only : Add a citation link only; no content change needed. | |||
; citation_with_note : Add citation plus a brief inline note. | |||
; expansion_needed : Article requires substantive expansion using this source. | |||
; contradiction_review : Source conflicts with existing article content; flag for human review. | |||
; new_page : Entity appears in source but has no encyclopedia article yet. | |||
; defer : Insufficient information to decide now; revisit when more sources exist. | |||
Most candidates should receive <code>no_action</code>. Only create integration tasks for clear, high-confidence needs. | |||
== Pre-Linking Configuration == | |||
PHP reads the values below at ingestion time. Edit these to tune behaviour for your deployment. | |||
; source_subtypes : Available source type options for the ingestion form dropdown. | |||
: News Article | |||
: Interview | |||
: Personal Log | |||
: Official Statement | |||
: Academic Paper | |||
: Corporate Advertisement | |||
: Government Resolution | |||
: Government Report | |||
: Propaganda Broadcast | |||
: Legal Document | |||
; pre_link_min_title_length : Skip wiki titles shorter than this character count. Default: 4. | |||
: 4 | |||
; pre_link_stoplist : Wiki page titles to skip even when they appear in a document. These are titles that match too broadly — real words that are also article names but shouldn't be auto-linked. | |||
: Source | |||
: Project | |||
: Help | |||
: Template | |||
: Category | |||
: Sol | |||
: Earth | |||
: Mars | |||
: Energy | |||
: Field | |||
: Law | |||
: Station | |||
; pre_link_prefixes : Honorific and title prefixes to strip when matching entity names. One entry per line. | |||
: Dr. | |||
: Prof. | |||
: Cmdr. | |||
: Admiral | |||
: Captain | |||
: Director | |||
: Chief | |||
: Minister | |||
: Secretary | |||
: Commissioner | |||
: The | |||
: A | |||
: An | |||
== Entity Ontology == | |||
These fields tell the LLM what kinds of named entities Encyclopedia Ephemera tracks, and provide examples to anchor its extraction. Edit examples as the wiki grows. | |||
; entity_types : Type schema for LLM entity extraction. Format: <code>TypeName: short description</code>. | |||
: People: named individuals — characters, officials, journalists, scientists, historical figures | |||
: Places: locations, regions, habitats, stations, settlements, orbital structures, planetary bodies | |||
: Organisations: factions, corporations, governments, institutions, fleets, unions, authorities | |||
: Events: named incidents, treaties, conflicts, discoveries, programmes, missions, crises | |||
: Technologies: named systems, vessel classes, devices, protocols, artefacts, programmes | |||
; example_entities : Hand-curated examples per type, used to anchor LLM extraction. Format: <code>TypeName: Example1, Example2, Example3</code>. | |||
: People: Alex Chambers, Maya Sato, Director Chen Wei | |||
: Places: New Troy, AquaNebula, Arcadia, Yuemin District | |||
: Organisations: Jovian Union, MercuryLink, Hegemony Worlds Authority | |||
: Events: Yuemin District Unrest | |||
: Technologies: Asterion Protocol | |||
== LLM Extraction == | |||
; boilerplate_filter_instruction : Appended verbatim to the LLM entity extraction prompt. | |||
: Do NOT return: volume numbers, issue numbers, page numbers, journal names, publisher names, citation fragments, partial strings, dates, generic terms, or common English words that are not proper nouns. Do NOT return single letters, abbreviations without clear referents, or entries from the pre-link stoplist above. | |||
[[Category:Instructions]] | [[Category:Instructions]] | ||
[[Category:Instructions | [[Category:Workflow Instructions]] | ||
Latest revision as of 16:29, 12 May 2026
| Instruction Metadata | |
|---|---|
| id | create-source-ingest |
| type | workflow |
| applies_to | ingest |
| task_type | |
| priority | high |
| status | active |
| canonical | true |
| include_by_default | yes |
| requires | |
| tags | source,ingest,entities,pre-linking |
Summary
Workflow for ingesting source documents into Encyclopedia Ephemera. Extracts named entities, creates a Sources: page, and links it to related encyclopedia pages. Entity extraction uses a staged pipeline: deterministic wikilink detection first, then optional LLM calibration guided by this page's configuration.
Workflow
- User fills in source metadata (title, author, date, type, publisher, url, summary) in the ingestion form.
- User pastes the source document.
- System extracts explicit
wikilinksfrom the document and validates each against the wiki. - If AI extraction is enabled: LLM identifies additional named entities using the entity type schema below.
- User reviews the three-group entity list (Known / Suggested / Low Confidence), adds or removes entities.
- User confirms. System creates the Sources: page with metadata, Related Pages section, and provenance comment.
Integration Decisions
After source creation, the integration review workflow (see Instructions:Maintenance/Source Integration Review) evaluates how related encyclopedia pages should be updated. Valid decisions per candidate:
- no_action
- Article already covers this information adequately.
- citation_only
- Add a citation link only; no content change needed.
- citation_with_note
- Add citation plus a brief inline note.
- expansion_needed
- Article requires substantive expansion using this source.
- contradiction_review
- Source conflicts with existing article content; flag for human review.
- new_page
- Entity appears in source but has no encyclopedia article yet.
- defer
- Insufficient information to decide now; revisit when more sources exist.
Most candidates should receive no_action. Only create integration tasks for clear, high-confidence needs.
Pre-Linking Configuration
PHP reads the values below at ingestion time. Edit these to tune behaviour for your deployment.
- source_subtypes
- Available source type options for the ingestion form dropdown.
- News Article
- Interview
- Personal Log
- Official Statement
- Academic Paper
- Corporate Advertisement
- Government Resolution
- Government Report
- Propaganda Broadcast
- Legal Document
- pre_link_min_title_length
- Skip wiki titles shorter than this character count. Default: 4.
- 4
- pre_link_stoplist
- Wiki page titles to skip even when they appear in a document. These are titles that match too broadly — real words that are also article names but shouldn't be auto-linked.
- Source
- Project
- Help
- Template
- Category
- Sol
- Earth
- Mars
- Energy
- Field
- Law
- Station
- pre_link_prefixes
- Honorific and title prefixes to strip when matching entity names. One entry per line.
- Dr.
- Prof.
- Cmdr.
- Admiral
- Captain
- Director
- Chief
- Minister
- Secretary
- Commissioner
- The
- A
- An
Entity Ontology
These fields tell the LLM what kinds of named entities Encyclopedia Ephemera tracks, and provide examples to anchor its extraction. Edit examples as the wiki grows.
- entity_types
- Type schema for LLM entity extraction. Format:
TypeName: short description. - People: named individuals — characters, officials, journalists, scientists, historical figures
- Places: locations, regions, habitats, stations, settlements, orbital structures, planetary bodies
- Organisations: factions, corporations, governments, institutions, fleets, unions, authorities
- Events: named incidents, treaties, conflicts, discoveries, programmes, missions, crises
- Technologies: named systems, vessel classes, devices, protocols, artefacts, programmes
- example_entities
- Hand-curated examples per type, used to anchor LLM extraction. Format:
TypeName: Example1, Example2, Example3. - People: Alex Chambers, Maya Sato, Director Chen Wei
- Places: New Troy, AquaNebula, Arcadia, Yuemin District
- Organisations: Jovian Union, MercuryLink, Hegemony Worlds Authority
- Events: Yuemin District Unrest
- Technologies: Asterion Protocol
LLM Extraction
- boilerplate_filter_instruction
- Appended verbatim to the LLM entity extraction prompt.
- Do NOT return: volume numbers, issue numbers, page numbers, journal names, publisher names, citation fragments, partial strings, dates, generic terms, or common English words that are not proper nouns. Do NOT return single letters, abbreviations without clear referents, or entries from the pre-link stoplist above.