NihAI Ingestion Tools
NihAI Ingestion Tools¶
This doc describes how to preload or grow the NihAI FAQ knowledge base from JSON or Markdown.
Collections¶
- NihAI entries live in the
NihAI
collection (or a collection you specify). - References live in the
Reference
collection and are denormalized as canonical strings in each entry'sreferences
list. - Internal FAQ references use
nihai://<uuid>
withsource="NihAI"
.
Entry Schema (properties)¶
question: str
(required)answer: str
(required)references: list[str]
or list of objects{title, url, source, content_snippet?}
(optional)is_verified: bool
(default false)approvers: str
(default empty)source: str
(e.g., "import", "md_ingest", "perplexity")accuracy_confidence: float
(0.0–1.0; default 0.5)is_active: bool
(default true)condition_area: str
(optional; e.g.,ANC
)topic: str
(optional; topic area above the question)
An embedding vector is computed from (question + "\n\n" + answer)
.
JSON Ingestion¶
File format: a JSON array of entry objects matching the schema above. You can send these via the API at POST /search/nihai/ingest-json
using the wrapper shape { "entries": [...], "collection": "NihAI" }
.
Example:
[
{
"question": "What are warning signs of preeclampsia?",
"answer": "Warning signs include severe headache, vision changes, ...",
"references": [
{"title": "ACOG Preeclampsia", "url": "https://www.acog.org/...", "source": "acog.org", "content_snippet": "Practice bulletin ..."}
],
"source": "import",
"accuracy_confidence": 0.7,
"condition_area": "ANC",
"topic": "Hypertensive disorders in pregnancy"
}
]
Request body to the endpoint:
{
"entries": [
{
"question": "What are warning signs of preeclampsia?",
"answer": "Warning signs include severe headache, vision changes, ...",
"references": [
{"title": "ACOG Preeclampsia", "url": "https://www.acog.org/...", "source": "acog.org", "content_snippet": "Practice bulletin ..."}
],
"source": "import",
"accuracy_confidence": 0.7,
"condition_area": "ANC",
"topic": "Hypertensive disorders in pregnancy"
}
],
"collection": "NihAI"
}
Programmatic usage:
from search.nihai.ingest import insert_nihai_entry, batch_insert_from_json
insert_nihai_entry({"question": "...", "answer": "..."}, collection="NihAI")
batch_insert_from_json("/abs/path/nihai_seed.json", collection="NihAI")
CLI:
Export to JSON (Git-friendly)¶
Use the export CLI to dump a collection into the same JSON format accepted by the importer.
What it does:
- Reads all objects via paging and converts stored references
JSON strings back into dicts/strings
- Sorts entries deterministically by question
(and topic
) for clean diffs
- Writes a single JSON array you can check into Git and re-import elsewhere
CLI:
Virtual environment and env setup: - macOS/Linux (bash/zsh):
- EnsureNIHAI_COLLECTION
and Weaviate connection env vars are set if you omit --collection
.
Notes:
- The exported file can be reloaded with batch_insert_from_json
or the CLI above.
- Vectors are recomputed on import; only your properties are stored in the file.
Markdown Ingestion via LLM¶
Use an LLM to parse a .md
file and propose canonical Q&A pairs; it will either:
- Insert a new Q&A (action new_qa
), or
- Link as a reference to an existing close match (action added_reference
).
CLI (auto-detects by file extension):
python -m search.nihai.md_ingest ./notes.md --collection NihAI
python -m search.nihai.md_ingest ./seed.json --collection NihAI
Requirements:
- OPENAI_API_KEY
set in the environment.
What gets logged:
- JSON list of actions: {action: 'new_qa'|'added_reference'|'skipped', question, target_uuid?}
Notes:
- The parser asks the model to return JSON with items of {question, answer, references?}
.
- References extracted from markdown can be either strings or objects {title, url, source, content_snippet?}
; they are normalized and upserted to the Reference
collection.
- Similarity threshold for linking vs. new insert is currently 0.8 on vector similarity to the top hit.
Related¶
- NihAI UI — browse/edit collections locally (dev server on
5176
, backend8000
).