NihAI Ingestion Tools

NihAI Ingestion Tools¶

This doc describes how to preload or grow the NihAI FAQ knowledge base from JSON or Markdown.

Collections¶

NihAI entries live in the NihAI collection (or a collection you specify).
References live in the Reference collection and are denormalized as canonical strings in each entry's references list.
Internal FAQ references use nihai://<uuid> with source="NihAI".

Entry Schema (properties)¶

question: str (required)
answer: str (required)
references: list[str] or list of objects {title, url, source, content_snippet?} (optional)
is_verified: bool (default false)
approvers: str (default empty)
source: str (e.g., "import", "md_ingest", "perplexity")
accuracy_confidence: float (0.0–1.0; default 0.5)
is_active: bool (default true)
condition_area: str (optional; e.g., ANC)
topic: str (optional; topic area above the question)

An embedding vector is computed from (question + "\n\n" + answer).

JSON Ingestion¶

File format: a JSON array of entry objects matching the schema above. You can send these via the API at POST /search/nihai/ingest-json using the wrapper shape { "entries": [...], "collection": "NihAI" }.

Example:

[
  {
    "question": "What are warning signs of preeclampsia?",
    "answer": "Warning signs include severe headache, vision changes, ...",
    "references": [
      {"title": "ACOG Preeclampsia", "url": "https://www.acog.org/...", "source": "acog.org", "content_snippet": "Practice bulletin ..."}
    ],
    "source": "import",
    "accuracy_confidence": 0.7,
    "condition_area": "ANC",
    "topic": "Hypertensive disorders in pregnancy"
  }
]

Request body to the endpoint:

{
  "entries": [
    {
      "question": "What are warning signs of preeclampsia?",
      "answer": "Warning signs include severe headache, vision changes, ...",
      "references": [
        {"title": "ACOG Preeclampsia", "url": "https://www.acog.org/...", "source": "acog.org", "content_snippet": "Practice bulletin ..."}
      ],
      "source": "import",
      "accuracy_confidence": 0.7,
      "condition_area": "ANC",
      "topic": "Hypertensive disorders in pregnancy"
    }
  ],
  "collection": "NihAI"
}

Programmatic usage:

from search.nihai.ingest import insert_nihai_entry, batch_insert_from_json
insert_nihai_entry({"question": "...", "answer": "..."}, collection="NihAI")
batch_insert_from_json("/abs/path/nihai_seed.json", collection="NihAI")

CLI:

python -m search.nihai.ingest --json /abs/path/nihai_seed.json --collection NihAI

Export to JSON (Git-friendly)¶

Use the export CLI to dump a collection into the same JSON format accepted by the importer.

What it does: - Reads all objects via paging and converts stored references JSON strings back into dicts/strings - Sorts entries deterministically by question (and topic) for clean diffs - Writes a single JSON array you can check into Git and re-import elsewhere

CLI:

python -m search.nihai.export --out /abs/path/nihai_seed.json --collection NihAI

Virtual environment and env setup: - macOS/Linux (bash/zsh):

source venv/bin/activate
# or use the project helper to load venv + .env
./run.sh

- Ensure NIHAI_COLLECTION and Weaviate connection env vars are set if you omit --collection.

Notes: - The exported file can be reloaded with batch_insert_from_json or the CLI above. - Vectors are recomputed on import; only your properties are stored in the file.

Markdown Ingestion via LLM¶

Use an LLM to parse a .md file and propose canonical Q&A pairs; it will either: - Insert a new Q&A (action new_qa), or - Link as a reference to an existing close match (action added_reference).

CLI (auto-detects by file extension):

python -m search.nihai.md_ingest ./notes.md --collection NihAI
python -m search.nihai.md_ingest ./seed.json --collection NihAI

Requirements: - OPENAI_API_KEY set in the environment.

What gets logged: - JSON list of actions: {action: 'new_qa'|'added_reference'|'skipped', question, target_uuid?}

Notes: - The parser asks the model to return JSON with items of {question, answer, references?}. - References extracted from markdown can be either strings or objects {title, url, source, content_snippet?}; they are normalized and upserted to the Reference collection. - Similarity threshold for linking vs. new insert is currently 0.8 on vector similarity to the top hit.

NihAI UI — browse/edit collections locally (dev server on 5176, backend 8000).

NihAI Ingestion Tools