Import Documents into the Knowledge Base

This guide explains how to bring existing materials (papers, clauses, research notes, reports, etc.) into Notez’s local knowledge base so they can be retrieved, cited, extended, and referenced in AI conversations.

1. Supported File Formats

Type	Extensions	Notes
Text	.md / .mdx / .txt	Prefer Markdown (clear structure)
Office	.docx / .doc	Body text only (complex styling simplified)
PDF	.pdf	Extracts selectable text layer; scanned PDFs need OCR first
Structured	.csv / .json (planned)	Coming soon for tabular / structured data

Note: Encrypted PDFs, image scans without text layers, DRM-protected files cannot be indexed (convert first).

2. Three Import Methods

2.1 Drag & Drop (Fastest)

Open the Knowledge Base module
Drag files or folders into the window
A task queue pops up → shows Parse / Chunk / Embedding progress

Best for: Ad‑hoc bulk import of already organized desktop folders.

2.2 Button-Based Selection

Click “Upload File” or “Upload Folder”
Multi-select in the system picker (Cmd/Shift for range or discrete)
Confirm to enqueue for parsing

Best for: Carefully choosing a small number of files.

2.3 Directory Sync (Continuous Updates)

Click “Add Sync Directory”
Choose a local folder
After enabling: Add / Modify / Delete events are watched and re-indexed (delay of a few seconds)

Best for: Long-lived project repos / research literature folders.
Tip: Moving or renaming many files may trigger rebuild; schedule during idle time.

3. Indexing Pipeline Overview

After import each file goes through:

Parsing: Decode text, clean redundant formatting
Structure Extraction: Detect headings, lists, sections (best with Markdown/Docx)
Chunking: Segment by semantics or length (avoid overly long inputs)
Embedding: Generate vector representations (requires configured embedding model)
Inverted Index Build (Keyword Index)
Additional Analysis (optional): Summary / topic tags (when deep search enabled)

Status indicators:

Pending / Parsing / Indexing / Completed / Failed
Common failure causes: Corrupted file, no text layer, encoding errors.

6. Update & Deletion Policy

Once uploaded, Notez will not update or delete anything unless the user performs a manual action.

7. Privacy & Locality

All original files, parse caches, and vectors are stored in the local app data directory
Only when calling external LLMs are truncated relevant context chunks sent
Without a configured model only keyword index is built (reduced features, fully offline)

8. Integration with AI Features

Feature	How Imported Data Is Used
Smart Continuation	Automatically retrieves similar chunks and merges
Chat Q&A	Vector recall + keyword filtering
Citation Tracing	Returns chunk + original filename + heading anchor
Selected Text Enhance	Reverse-retrieves surrounding context as supporting evidence

If a file is never cited: verify embedding build completion.

9. FAQ

Q: PDF shows garbled text?
A: Likely an image scan or custom font. Run OCR first (e.g., ocrmypdf).

Q: New file appears very late?
A: Check queue backlog; large files / many concurrent tasks cause waiting. (Priority reordering planned.)

Q: Too much duplicate content harming retrieval?
A: Enable “Duplicate Chunk Folding” in settings or manually consolidate scattered notes.

Q: Usable without an embedding model?
A: Only keyword search; no semantic relevance or smart citation ordering.

Q: After deleting source file the citation persists?
A: Old citation is marked invalid; click to trigger cleanup.

10. Troubleshooting Quick Reference

Symptom	Steps
All imports fail	Check disk permissions (macOS System Settings > Privacy & Security > Files & Folders)
Single file fails	Inspect logs; try re-saving as UTF-8
Embedding stuck	Verify embedding model URL / Key / Model name
Chat ignores local data	Check chunk count > 0; ensure “Right Side References” is selected
Slow speed	Reduce concurrency; split huge PDFs; disable unneeded deep search temporarily

11. Best Practices Checklist

Before first bulk import: organize folders to avoid frequent later rebuilds
Prefer Markdown: strongest structural signals → more precise model citations
Standardize tag style: lowercase English + hyphen, e.g. deep-learning, contract-law
Periodically prune obsolete versions to avoid semantic conflicts

12. Next Steps

After importing you can:

Try intelligent search: ask a natural language question and inspect results
Use in-doc continuation and verify citation accuracy
Configure multiple models and compare response quality

— End of import & indexing guide