Import Documents into the Knowledge Base

Last updated: 2025-09-02
Guide

This guide explains how to bring existing materials (papers, clauses, research notes, reports, etc.) into Notez’s local knowledge base so they can be retrieved, cited, extended, and referenced in AI conversations.

1. Supported File Formats

TypeExtensionsNotes
Text.md / .mdx / .txtPrefer Markdown (clear structure)
Office.docx / .docBody text only (complex styling simplified)
PDF.pdfExtracts selectable text layer; scanned PDFs need OCR first
Structured.csv / .json (planned)Coming soon for tabular / structured data

Note: Encrypted PDFs, image scans without text layers, DRM-protected files cannot be indexed (convert first).

2. Three Import Methods

2.1 Drag & Drop (Fastest)

  1. Open the Knowledge Base module
  2. Drag files or folders into the window
  3. A task queue pops up → shows Parse / Chunk / Embedding progress

Best for: Ad‑hoc bulk import of already organized desktop folders.

2.2 Button-Based Selection

  1. Click “Upload File” or “Upload Folder”
  2. Multi-select in the system picker (Cmd/Shift for range or discrete)
  3. Confirm to enqueue for parsing

Best for: Carefully choosing a small number of files.

2.3 Directory Sync (Continuous Updates)

  1. Click “Add Sync Directory”
  2. Choose a local folder
  3. After enabling: Add / Modify / Delete events are watched and re-indexed (delay of a few seconds)

Best for: Long-lived project repos / research literature folders.
Tip: Moving or renaming many files may trigger rebuild; schedule during idle time.

3. Indexing Pipeline Overview

After import each file goes through:

  1. Parsing: Decode text, clean redundant formatting
  2. Structure Extraction: Detect headings, lists, sections (best with Markdown/Docx)
  3. Chunking: Segment by semantics or length (avoid overly long inputs)
  4. Embedding: Generate vector representations (requires configured embedding model)
  5. Inverted Index Build (Keyword Index)
  6. Additional Analysis (optional): Summary / topic tags (when deep search enabled)

Status indicators:

  • Pending / Parsing / Indexing / Completed / Failed
    Common failure causes: Corrupted file, no text layer, encoding errors.

6. Update & Deletion Policy

Once uploaded, Notez will not update or delete anything unless the user performs a manual action.

7. Privacy & Locality

  • All original files, parse caches, and vectors are stored in the local app data directory
  • Only when calling external LLMs are truncated relevant context chunks sent
  • Without a configured model only keyword index is built (reduced features, fully offline)

8. Integration with AI Features

FeatureHow Imported Data Is Used
Smart ContinuationAutomatically retrieves similar chunks and merges
Chat Q&AVector recall + keyword filtering
Citation TracingReturns chunk + original filename + heading anchor
Selected Text EnhanceReverse-retrieves surrounding context as supporting evidence

If a file is never cited: verify embedding build completion.

9. FAQ

Q: PDF shows garbled text?
A: Likely an image scan or custom font. Run OCR first (e.g., ocrmypdf).

Q: New file appears very late?
A: Check queue backlog; large files / many concurrent tasks cause waiting. (Priority reordering planned.)

Q: Too much duplicate content harming retrieval?
A: Enable “Duplicate Chunk Folding” in settings or manually consolidate scattered notes.

Q: Usable without an embedding model?
A: Only keyword search; no semantic relevance or smart citation ordering.

Q: After deleting source file the citation persists?
A: Old citation is marked invalid; click to trigger cleanup.

10. Troubleshooting Quick Reference

SymptomSteps
All imports failCheck disk permissions (macOS System Settings > Privacy & Security > Files & Folders)
Single file failsInspect logs; try re-saving as UTF-8
Embedding stuckVerify embedding model URL / Key / Model name
Chat ignores local dataCheck chunk count > 0; ensure “Right Side References” is selected
Slow speedReduce concurrency; split huge PDFs; disable unneeded deep search temporarily

11. Best Practices Checklist

  • Before first bulk import: organize folders to avoid frequent later rebuilds
  • Prefer Markdown: strongest structural signals → more precise model citations
  • Standardize tag style: lowercase English + hyphen, e.g. deep-learning, contract-law
  • Periodically prune obsolete versions to avoid semantic conflicts

12. Next Steps

After importing you can:

  • Try intelligent search: ask a natural language question and inspect results
  • Use in-doc continuation and verify citation accuracy
  • Configure multiple models and compare response quality

— End of import & indexing guide