Import Documents into the Knowledge Base
This guide explains how to bring existing materials (papers, clauses, research notes, reports, etc.) into Notez’s local knowledge base so they can be retrieved, cited, extended, and referenced in AI conversations.
1. Supported File Formats
Type | Extensions | Notes |
---|---|---|
Text | .md / .mdx / .txt | Prefer Markdown (clear structure) |
Office | .docx / .doc | Body text only (complex styling simplified) |
Extracts selectable text layer; scanned PDFs need OCR first | ||
Structured | .csv / .json (planned) | Coming soon for tabular / structured data |
Note: Encrypted PDFs, image scans without text layers, DRM-protected files cannot be indexed (convert first).
2. Three Import Methods
2.1 Drag & Drop (Fastest)
- Open the Knowledge Base module
- Drag files or folders into the window
- A task queue pops up → shows Parse / Chunk / Embedding progress
Best for: Ad‑hoc bulk import of already organized desktop folders.
2.2 Button-Based Selection
- Click “Upload File” or “Upload Folder”
- Multi-select in the system picker (Cmd/Shift for range or discrete)
- Confirm to enqueue for parsing
Best for: Carefully choosing a small number of files.
2.3 Directory Sync (Continuous Updates)
- Click “Add Sync Directory”
- Choose a local folder
- After enabling: Add / Modify / Delete events are watched and re-indexed (delay of a few seconds)
Best for: Long-lived project repos / research literature folders.
Tip: Moving or renaming many files may trigger rebuild; schedule during idle time.
3. Indexing Pipeline Overview
After import each file goes through:
- Parsing: Decode text, clean redundant formatting
- Structure Extraction: Detect headings, lists, sections (best with Markdown/Docx)
- Chunking: Segment by semantics or length (avoid overly long inputs)
- Embedding: Generate vector representations (requires configured embedding model)
- Inverted Index Build (Keyword Index)
- Additional Analysis (optional): Summary / topic tags (when deep search enabled)
Status indicators:
- Pending / Parsing / Indexing / Completed / Failed
Common failure causes: Corrupted file, no text layer, encoding errors.
6. Update & Deletion Policy
Once uploaded, Notez will not update or delete anything unless the user performs a manual action.
7. Privacy & Locality
- All original files, parse caches, and vectors are stored in the local app data directory
- Only when calling external LLMs are truncated relevant context chunks sent
- Without a configured model only keyword index is built (reduced features, fully offline)
8. Integration with AI Features
Feature | How Imported Data Is Used |
---|---|
Smart Continuation | Automatically retrieves similar chunks and merges |
Chat Q&A | Vector recall + keyword filtering |
Citation Tracing | Returns chunk + original filename + heading anchor |
Selected Text Enhance | Reverse-retrieves surrounding context as supporting evidence |
If a file is never cited: verify embedding build completion.
9. FAQ
Q: PDF shows garbled text?
A: Likely an image scan or custom font. Run OCR first (e.g., ocrmypdf
).
Q: New file appears very late?
A: Check queue backlog; large files / many concurrent tasks cause waiting. (Priority reordering planned.)
Q: Too much duplicate content harming retrieval?
A: Enable “Duplicate Chunk Folding” in settings or manually consolidate scattered notes.
Q: Usable without an embedding model?
A: Only keyword search; no semantic relevance or smart citation ordering.
Q: After deleting source file the citation persists?
A: Old citation is marked invalid; click to trigger cleanup.
10. Troubleshooting Quick Reference
Symptom | Steps |
---|---|
All imports fail | Check disk permissions (macOS System Settings > Privacy & Security > Files & Folders) |
Single file fails | Inspect logs; try re-saving as UTF-8 |
Embedding stuck | Verify embedding model URL / Key / Model name |
Chat ignores local data | Check chunk count > 0; ensure “Right Side References” is selected |
Slow speed | Reduce concurrency; split huge PDFs; disable unneeded deep search temporarily |
11. Best Practices Checklist
- Before first bulk import: organize folders to avoid frequent later rebuilds
- Prefer Markdown: strongest structural signals → more precise model citations
- Standardize tag style: lowercase English + hyphen, e.g.
deep-learning
,contract-law
- Periodically prune obsolete versions to avoid semantic conflicts
12. Next Steps
After importing you can:
- Try intelligent search: ask a natural language question and inspect results
- Use in-doc continuation and verify citation accuracy
- Configure multiple models and compare response quality
— End of import & indexing guide