Documentation

Sources

Sources are the data inputs that populate your knowledge bases with up-to-date information. Seclai automatically fetches, processes, and indexes content from various sources, keeping your knowledge bases current without manual intervention.

What are Sources?

A source is a connection to external content that you want to make searchable. When you add a source to a knowledge base:

  1. Initial Polling: Seclai fetches content based on your seeding preferences
  2. Content Processing: Text is extracted, chunked, and embedded for vector search
  3. Automatic Updates: Content is refreshed based on your polling schedule
  4. Multi-Phase Processing: Audio/video is transcribed, and all content is indexed

Sources can be shared across multiple knowledge bases or organizations, with each connection maintaining independent settings for polling, retention, and indexing.

Source Types

RSS Feeds

RSS feeds automatically pull content from blogs, podcasts, news sites, and other syndicated content.

Best For:

  • Blog posts and articles
  • Podcast episodes
  • News feeds
  • YouTube channels (via RSS)

Features:

  • Automatic detection of RSS/Atom feeds
  • Content metadata extraction (title, author, date)
  • Support for full content or summary feeds
  • Historical data seeding options

Example Use Cases:

  • Monitor industry news and trends
  • Track competitor blog posts
  • Index podcast transcripts for searchability
  • Aggregate content from multiple sources

Websites

Website sources crawl and index web pages automatically, following links and respecting crawl limits.

Best For:

  • Documentation sites
  • Help centers
  • Product pages
  • Knowledge bases

Features:

  • Configurable crawl depth (1-5 levels)
  • Sitemap support for efficient crawling
  • URL filtering with regex patterns
  • Main content extraction (removes headers, footers, navigation)
  • AI-powered content filtering to skip ads

Crawl Configuration:

  • Crawl Depth: How many levels deep to follow links
  • Crawl Limit: Maximum number of pages to crawl
  • Path Regular Expressions: Include/exclude URL patterns
  • Sitemap: Automatically detect and use sitemap.xml

Example Use Cases:

  • Index your product documentation
  • Monitor competitor websites
  • Create searchable help center
  • Track regulatory or compliance documentation

File Uploads

Upload documents and files directly for indexing.

Best For:

  • Internal documents
  • Custom content
  • One-time uploads

Supported Formats:

  • Text: .txt, .md, .html, .csv, .json, .xml
  • Documents: .pdf, .doc/.docx, .ppt/.pptx, .xls/.xlsx, .epub, .msg
  • Images: .png, .jpg, .gif, .bmp, .tiff, .webp
  • Audio (with transcription): .mp3, .wav, .m4a, .flac, .ogg
  • Video (with transcription): .mp4, .mov, .avi
  • Archives: .zip

Features:

  • Content filtering options
  • Custom embedding configuration
  • Direct file storage integration

Custom Index

Customize indexing with choice of embedder and vector dimensions for best recall performance.

Best For:

  • Optimizing recall
  • Evaluating the performance of different embedding models and vector dimensions

Features:

  • API-driven content creation
  • Full control over content and metadata
  • Custom chunking and embedding

Adding a Source

Basic Setup

// Using the TypeScript SDK
const source = await seclai.sources.create({
  name: "Company Blog",
  source_type: "rss_feed",
  url: "https://example.com/blog/rss.xml",
  polling: "daily",
  retention: 90, // Keep content for 90 days
});

URL Detection

Seclai automatically detects whether a URL is an RSS feed or a website:

// Detect source type
const detection = await seclai.sources.detect({
  url: "https://example.com",
});

console.log(detection.source_type); // "website" or "rss_feed"
console.log(detection.final_url); // Resolved URL after redirects

The detection process:

  1. Fetches the URL and checks content type
  2. Analyzes response to identify RSS/Atom feeds
  3. Returns metadata about the source
  4. Checks for existing sources at the same URL

Seeding Options

When you first add a source, you can control which content is initially indexed:

Full History

Index all available content from the source.

{
  seed: "full_history";
}

Best For:

  • Small content sources
  • When you need complete historical context
  • Documentation sites with limited pages

Note: Large sources may incur significant embedding costs.

Latest N Items

Index only the most recent items.

{
  seed: "latest_n",
  latest_n: 10  // Last 10 items (default: 5)
}

Best For:

  • Large RSS feeds with frequent updates
  • When historical content isn't relevant
  • Testing new sources before full import

Selected Items

Choose specific items to index from RSS feeds.

{
  seed: "selected_items",
  selected_items: ["item-guid-1", "item-guid-2"]
}

Best For:

  • Cherry-picking specific episodes or articles
  • Manual content curation
  • Importing only relevant historical content

Polling Schedules

Control how often Seclai checks for new content:

Manually

No automatic polling. Trigger updates via API or UI.

{
  polling: "manually";
}

Use When:

  • Content rarely changes
  • You want manual control
  • File uploads or custom indexes

Once

Poll immediately when added, then never again.

{
  polling: "once";
}

Use When:

  • One-time content import
  • Archived content sources
  • Testing source configuration

Hourly

Check for new content every hour.

{
  polling: "hourly";
}

Use When:

  • Breaking news feeds
  • Real-time monitoring
  • Frequently updated sources

Note: Higher polling frequency increases costs.

Daily

Check for new content once per day.

{
  polling: "daily";
}

Use When:

  • Blog posts and articles
  • Most RSS feeds
  • Documentation sites
  • Recommended default for most sources

Weekly

Check for new content once per week.

{
  polling: "weekly";
}

Use When:

  • Infrequently updated content
  • Newsletters or weekly publications
  • Cost optimization for low-priority sources

Polling Actions

Control what happens when Seclai finds existing content during polling:

New Only

Only index new content items.

{
  polling_action: "new";
}

Behavior:

  • Skips items that were previously indexed
  • Lower costs (no reprocessing)
  • Doesn't catch content updates

Default for most source types.

New and Updated

Index new content and reprocess updated items.

{
  polling_action: "new_and_updated";
}

Behavior:

  • Detects content changes via ETag and Last-Modified headers
  • Reprocesses updated content
  • Higher accuracy for dynamic content
  • Increased costs for reprocessing

Use When:

  • Content frequently changes (edited posts, updated documentation)
  • Accuracy is more important than cost
  • You need the latest version of content

Polling Limits

Control how many items to fetch per poll:

{
  polling_max_items: 50; // Limit to 50 new items per poll
}

Benefits:

  • Prevents cost spikes from large batches
  • Gradual processing of backlog
  • Predictable resource usage

Content Retention

Automatically remove old content from your knowledge base:

{
  retention_days: 90; // Delete content older than 90 days
}

Settings:

  • Null (default): Keep content indefinitely
  • Number: Delete content after N days

Use Cases:

  • GDPR compliance (automatic data deletion)
  • Keep only recent content relevant
  • Control storage and search costs
  • Maintain fresh results

Example: Set to 30 days for news feeds, 365 days for documentation.

Content Processing Pipeline

Phase 1: Fetching

Content is fetched from the source URL.

For RSS Feeds:

  • Parses feed XML
  • Extracts item metadata (title, author, date, GUID)
  • Fetches full article content from item URLs

For Websites:

  • Crawls pages following configured rules
  • Extracts main content
  • Removes navigation, ads, headers/footers
  • Stores HTML for processing

Phase 2: Transcription (Audio/Video)

Audio and video content is transcribed to text.

Supported Formats:

  • MP3, WAV, M4A (audio)
  • MP4, MOV, AVI (video)

Features:

  • Automatic language detection
  • Speaker diarization
  • Timestamp preservation
  • Word-level confidence scores

Costs: Based on audio duration (minutes).

Phase 3: Indexing

Text content is chunked and embedded for vector search.

Process:

  1. Text Extraction: Clean and prepare text
  2. Chunking: Split into optimal segments (default: 1000 characters, 200 overlap)
  3. Embedding: Generate vector embeddings using configured model
  4. Storage: Store in PostgreSQL with pgvector

Costs: Text extraction is billed at 1 credit per second of processing time.

Configuration:

  • Chunk Size: Length of text segments (default: 1000 characters)
  • Chunk Overlap: Overlap between chunks (default: 200 characters)
  • Embedding Model: Model used for vector generation
  • Dimensions: Vector size (256-3584 based on model)

Phase 4: Linking

Content is linked to knowledge bases and triggers agents.

Actions:

  1. Links content to all knowledge bases using the source
  2. Triggers agents with "content_added" or "content_updated" triggers
  3. Records usage costs for embeddings and processing
  4. Updates content counts and metadata

Content Filtering

Control what content is extracted from websites and documents:

None

Extract all content without filtering.

{
  content_filter: "none";
}

Remove common page elements.

{
  content_filter: "header_footer_navigation";
}

Removes:

  • Navigation menus
  • Headers and footers
  • Sidebars
  • Copyright notices

Best For: Most website sources.

AI Main Content

Use AI to extract only the main content.

{
  content_filter: "ai_main_content";
}

Features:

  • Removes all non-content elements
  • Preserves article structure
  • Keeps images and media references

Best For: News sites, blogs, documentation.

AI Main Content (Skip Ads)

AI extraction with aggressive ad removal.

{
  content_filter: "ai_main_content_skip_ads";
}

Features:

  • Everything from AI Main Content
  • Removes advertisements
  • Removes promotional content
  • Removes sponsored sections

Best For: Sites with heavy advertising, news sites.

Note: Content filtering is only available for file uploads and custom indexes.

Embedding Configuration

Each source can use different embedding models:

{
  embedding_model: "text-embedding-3-small",
  dimensions: 1536,
  chunk_size: 1000,
  chunk_overlap: 200
}

Available Models:

  • text-embedding-3-small: 512 or 1536 dimensions (recommended)
  • text-embedding-3-large: 256 or 3072 dimensions (highest quality)
  • amazon-nova-2-multimodal: 256 or 1024 dimensions
  • voyage-3: 512 or 1024 dimensions
  • And 50+ other models from OpenAI, Cohere, Voyage, Jina, Mistral

Considerations:

  • Higher dimensions: Better accuracy, higher costs
  • Lower dimensions: Faster search, lower costs
  • Model consistency: Use the same model across sources in a knowledge base for best results

Managing Sources

Update Source Settings

await seclai.sources.update(sourceId, {
  name: "Updated Name",
  polling: "weekly",
  retention_days: 180,
  polling_action: "new_and_updated",
});

Manual Polling

Trigger an immediate content update:

await seclai.sources.pull(sourceId, {
  seed: "latest_n",
  latest_n: 5,
});

View Source Content

const content = await seclai.sources.listContent(sourceId, {
  page: 1,
  limit: 20,
  sort: "created_at",
  order: "desc",
});

Monitor Processing Progress

const progress = await seclai.sources.getProgress(sourceId);

console.log(progress.total); // Total items
console.log(progress.completed); // Completed items
console.log(progress.phases); // Phase-by-phase breakdown

Phases:

  • fetching: Content retrieval from source
  • transcribing: Audio/video transcription
  • indexing: Embedding generation

Delete a Source

await seclai.sources.delete(sourceId);

Note: Deleting a source removes it from all knowledge bases and deletes all associated content.

Best Practices

Choose the Right Polling Frequency

  • Hourly: Only for critical, frequently updated sources
  • Daily: Default for most RSS feeds and blogs
  • Weekly: Documentation, newsletters, low-priority sources
  • Manually: File uploads, one-time imports

Why: Higher polling frequency increases costs and processing overhead.

Use Retention Policies

  • Set retention periods for time-sensitive content
  • Keep 30-90 days for news feeds
  • Keep 365+ days for documentation
  • Null (infinite) for permanent content

Why: Reduces storage costs and keeps results relevant.

Optimize Seeding

  • Start with "latest_n" (5-10 items) for testing
  • Expand to "full_history" once verified
  • Use "selected_items" for specific content

Why: Prevents expensive initial indexing of unwanted content.

Monitor Processing Costs

  • Check embedding costs before full import
  • Use smaller embedding dimensions for large sources
  • Limit polling_max_items for high-volume feeds

Why: Large sources can quickly accumulate costs.

Content Filtering

  • Use "ai_main_content_skip_ads" for news sites
  • Use "header_footer_navigation" for most websites
  • Use "none" only when you need full page context

Why: Improves search quality and reduces noise.

Crawl Configuration (Websites)

  • Start with crawl_depth: 2 or 3
  • Use URL filters to include/exclude paths
  • Enable sitemap crawling when available
  • Set reasonable crawl_limit (100-1000 pages)

Why: Prevents over-crawling and focuses on relevant content.

Limits and Quotas

  • Maximum sources per account: Plan-dependent
  • Polling frequency: Hourly minimum (no real-time)
  • Crawl depth: Maximum 5 levels
  • Content size: Maximum 10MB per item
  • Retention: Minimum 1 day

Check your plan limits in the account under Settings → Usage.

Source Connection Sharing

Public sources (RSS feeds, websites) are shared across accounts to optimize resources:

  • Source: Shared base entity with URL and type
  • Source Connection: Your personal connection with unique settings
  • Benefits: Reduced redundant polling, shared storage costs, faster initial setup

Private sources (file uploads, custom indexes) are never shared between accounts.

Agent Integration

Sources can trigger agents automatically when content is added or updated:

// Create agent with content trigger
const agent = await seclai.agents.create({
  name: "Summarize New Posts",
  triggers: [
    {
      type: "content_added",
      knowledge_base_id: kbId,
    },
  ],
  steps: [
    /* ... */
  ],
});

Trigger Types:

  • content_added: Runs when new content is indexed
  • content_updated: Runs when content is reprocessed
  • content_added_or_updated: Runs on either event

Use Cases:

  • Generate summaries of new articles
  • Send notifications for important updates
  • Extract structured data from content
  • Categorize and tag content

Troubleshooting

Source Not Polling

Check:

  • Polling schedule is not set to "manually"
  • Source hasn't failed recently (check error status)
  • Account has active subscription
  • No maintenance mode active

Content Not Appearing

Possible Causes:

  • Still processing (check progress endpoint)
  • Content filtered out by seeding rules
  • Duplicate content (already indexed)
  • Content failed processing (check errors)

High Costs

Solutions:

  • Reduce polling frequency
  • Lower embedding dimensions
  • Set retention policies
  • Use polling_max_items limits
  • Switch to smaller embedding models

Crawl Not Finding Pages

Solutions:

  • Increase crawl_depth
  • Check URL filters aren't too restrictive
  • Enable sitemap crawling
  • Verify robots.txt isn't blocking
  • Check crawl_limit isn't too low

API Reference

For complete API documentation, see API Reference.