Sources

Sources are the data inputs that populate your knowledge bases with up-to-date information. Seclai automatically fetches, processes, and indexes content from various sources, keeping your knowledge bases current without manual intervention.

What are Sources?

A source is a connection to external content that you want to make searchable. When you add a source to a knowledge base:

Initial Polling: Seclai fetches content based on your seeding preferences
Content Processing: Text is extracted, chunked, and embedded for vector search
Automatic Updates: Content is refreshed based on your polling schedule
Multi-Phase Processing: Audio/video is transcribed, and all content is indexed

Sources can be shared across multiple knowledge bases or organizations, with each connection maintaining independent settings for polling, retention, and indexing.

Source Types

RSS Feeds

RSS feeds automatically pull content from blogs, podcasts, news sites, and other syndicated content.

Best For:

Blog posts and articles
Podcast episodes
News feeds
YouTube channels (via RSS)

Features:

Automatic detection of RSS/Atom feeds
Content metadata extraction (title, author, date)
Support for full content or summary feeds
Historical data seeding options

Example Use Cases:

Monitor industry news and trends
Track competitor blog posts
Index podcast transcripts for searchability
Aggregate content from multiple sources

Websites

Website sources crawl and index web pages automatically, following links and respecting crawl limits.

Best For:

Documentation sites
Help centers
Product pages
Knowledge bases

Features:

Configurable crawl depth (1-5 levels)
Sitemap support for efficient crawling
URL filtering with regex patterns
Main content extraction (removes headers, footers, navigation)
AI-powered content filtering to skip ads

Crawl Configuration:

Crawl Depth: How many levels deep to follow links
Crawl Limit: Maximum number of pages to crawl
Path Regular Expressions: Include/exclude URL patterns
Sitemap: Automatically detect and use sitemap.xml

Example Use Cases:

Index your product documentation
Monitor competitor websites
Create searchable help center
Track regulatory or compliance documentation

File Uploads

Upload documents and files directly for indexing.

Best For:

Internal documents
Custom content
One-time uploads

Supported Formats:

Text: .txt, .md, .html, .csv, .json, .xml
Documents: .pdf, .doc/.docx, .ppt/.pptx, .xls/.xlsx, .epub, .msg
Images: .png, .jpg, .gif, .bmp, .tiff, .webp
Audio (with transcription): .mp3, .wav, .m4a, .flac, .ogg
Video (with transcription): .mp4, .mov, .avi
Archives: .zip

Features:

Content filtering options
Custom embedding configuration
Direct file storage integration

Custom Index

Customize indexing with choice of embedder and vector dimensions for best recall performance.

Best For:

Optimizing recall
Evaluating the performance of different embedding models and vector dimensions

Features:

API-driven content creation
Full control over content and metadata
Custom chunking and embedding

Adding a Source

Basic Setup

// Using the TypeScript SDK
const source = await seclai.sources.create({
  name: "Company Blog",
  source_type: "rss_feed",
  url: "https://example.com/blog/rss.xml",
  polling: "daily",
  retention: 90, // Keep content for 90 days
});

URL Detection

Seclai automatically detects whether a URL is an RSS feed or a website:

// Detect source type
const detection = await seclai.sources.detect({
  url: "https://example.com",
});

console.log(detection.source_type); // "website" or "rss_feed"
console.log(detection.final_url); // Resolved URL after redirects

The detection process:

Fetches the URL and checks content type
Analyzes response to identify RSS/Atom feeds
Returns metadata about the source
Checks for existing sources at the same URL

Seeding Options

When you first add a source, you can control which content is initially indexed:

Full History

Index all available content from the source.

{
  seed: "full_history";
}

Best For:

Small content sources
When you need complete historical context
Documentation sites with limited pages

Note: Large sources may incur significant embedding costs.

Latest N Items

Index only the most recent items.

{
  seed: "latest_n",
  latest_n: 10  // Last 10 items (default: 5)
}

Best For:

Large RSS feeds with frequent updates
When historical content isn't relevant
Testing new sources before full import

Selected Items

Choose specific items to index from RSS feeds.

{
  seed: "selected_items",
  selected_items: ["item-guid-1", "item-guid-2"]
}

Best For:

Cherry-picking specific episodes or articles
Manual content curation
Importing only relevant historical content

Polling Schedules

Control how often Seclai checks for new content:

Manually

No automatic polling. Trigger updates via API or UI.

{
  polling: "manually";
}

Use When:

Content rarely changes
You want manual control
File uploads or custom indexes

Once

Poll immediately when added, then never again.

{
  polling: "once";
}

Use When:

One-time content import
Archived content sources
Testing source configuration

Hourly

Check for new content every hour.

{
  polling: "hourly";
}

Use When:

Breaking news feeds
Real-time monitoring
Frequently updated sources

Note: Higher polling frequency increases costs.

Daily

Check for new content once per day.

{
  polling: "daily";
}

Use When:

Blog posts and articles
Most RSS feeds
Documentation sites
Recommended default for most sources

Weekly

Check for new content once per week.

{
  polling: "weekly";
}

Use When:

Infrequently updated content
Newsletters or weekly publications
Cost optimization for low-priority sources

Polling Actions

Control what happens when Seclai finds existing content during polling:

New Only

Only index new content items.

{
  polling_action: "new";
}

Behavior:

Skips items that were previously indexed
Lower costs (no reprocessing)
Doesn't catch content updates

Default for most source types.

New and Updated

Index new content and reprocess updated items.

{
  polling_action: "new_and_updated";
}

Behavior:

Detects content changes via ETag and Last-Modified headers
Reprocesses updated content
Higher accuracy for dynamic content
Increased costs for reprocessing

Use When:

Content frequently changes (edited posts, updated documentation)
Accuracy is more important than cost
You need the latest version of content

Polling Limits

Control how many items to fetch per poll:

{
  polling_max_items: 50; // Limit to 50 new items per poll
}

Benefits:

Prevents cost spikes from large batches
Gradual processing of backlog
Predictable resource usage

Content Retention

Automatically remove old content from your knowledge base:

{
  retention_days: 90; // Delete content older than 90 days
}

Settings:

Null (default): Keep content indefinitely
Number: Delete content after N days

Use Cases:

GDPR compliance (automatic data deletion)
Keep only recent content relevant
Control storage and search costs
Maintain fresh results

Example: Set to 30 days for news feeds, 365 days for documentation.

Content Processing Pipeline

Phase 1: Fetching

Content is fetched from the source URL.

For RSS Feeds:

Parses feed XML
Extracts item metadata (title, author, date, GUID)
Fetches full article content from item URLs

For Websites:

Crawls pages following configured rules
Extracts main content
Removes navigation, ads, headers/footers
Stores HTML for processing

Phase 2: Transcription (Audio/Video)

Audio and video content is transcribed to text.

Supported Formats:

MP3, WAV, M4A (audio)
MP4, MOV, AVI (video)

Features:

Automatic language detection
Speaker diarization
Timestamp preservation
Word-level confidence scores

Costs: Based on audio duration (minutes).

Phase 3: Indexing

Text content is chunked and embedded for vector search.

Process:

Text Extraction: Clean and prepare text
Chunking: Split into optimal segments (default: 1000 characters, 200 overlap)
Embedding: Generate vector embeddings using configured model
Storage: Store in PostgreSQL with pgvector

Costs: Text extraction is billed at 1 credit per second of processing time.

Configuration:

Chunk Size: Length of text segments (default: 1000 characters)
Chunk Overlap: Overlap between chunks (default: 200 characters)
Embedding Model: Model used for vector generation
Dimensions: Vector size (256-3584 based on model)

Phase 4: Linking

Content is linked to knowledge bases and triggers agents.

Actions:

Links content to all knowledge bases using the source
Triggers agents with "content_added" or "content_updated" triggers
Records usage costs for embeddings and processing
Updates content counts and metadata

Content Filtering

Control what content is extracted from websites and documents:

None

Extract all content without filtering.

{
  content_filter: "none";
}

Header, Footer, Navigation

Remove common page elements.

{
  content_filter: "header_footer_navigation";
}

Removes:

Navigation menus
Headers and footers
Sidebars
Copyright notices

Best For: Most website sources.

AI Main Content

Use AI to extract only the main content.

{
  content_filter: "ai_main_content";
}

Features:

Removes all non-content elements
Preserves article structure
Keeps images and media references

Best For: News sites, blogs, documentation.

AI Main Content (Skip Ads)

AI extraction with aggressive ad removal.

{
  content_filter: "ai_main_content_skip_ads";
}

Features:

Everything from AI Main Content
Removes advertisements
Removes promotional content
Removes sponsored sections

Best For: Sites with heavy advertising, news sites.

Note: Content filtering is only available for file uploads and custom indexes.

Embedding Configuration

Each source can use different embedding models:

{
  embedding_model: "text-embedding-3-small",
  dimensions: 1536,
  chunk_size: 1000,
  chunk_overlap: 200
}

Available Models:

text-embedding-3-small: 512 or 1536 dimensions (recommended)
text-embedding-3-large: 256 or 3072 dimensions (highest quality)
amazon-nova-2-multimodal: 256 or 1024 dimensions
voyage-3: 512 or 1024 dimensions
And 50+ other models from OpenAI, Cohere, Voyage, Jina, Mistral

Considerations:

Higher dimensions: Better accuracy, higher costs
Lower dimensions: Faster search, lower costs
Model consistency: Use the same model across sources in a knowledge base for best results

Managing Sources

Update Source Settings

await seclai.sources.update(sourceId, {
  name: "Updated Name",
  polling: "weekly",
  retention_days: 180,
  polling_action: "new_and_updated",
});

Manual Polling

Trigger an immediate content update:

await seclai.sources.pull(sourceId, {
  seed: "latest_n",
  latest_n: 5,
});

View Source Content

const content = await seclai.sources.listContent(sourceId, {
  page: 1,
  limit: 20,
  sort: "created_at",
  order: "desc",
});

Monitor Processing Progress

const progress = await seclai.sources.getProgress(sourceId);

console.log(progress.total); // Total items
console.log(progress.completed); // Completed items
console.log(progress.phases); // Phase-by-phase breakdown

Phases:

fetching: Content retrieval from source
transcribing: Audio/video transcription
indexing: Embedding generation

Delete a Source

await seclai.sources.delete(sourceId);

Note: Deleting a source removes it from all knowledge bases and deletes all associated content.

Best Practices

Choose the Right Polling Frequency

Hourly: Only for critical, frequently updated sources
Daily: Default for most RSS feeds and blogs
Weekly: Documentation, newsletters, low-priority sources
Manually: File uploads, one-time imports

Why: Higher polling frequency increases costs and processing overhead.

Use Retention Policies

Set retention periods for time-sensitive content
Keep 30-90 days for news feeds
Keep 365+ days for documentation
Null (infinite) for permanent content

Why: Reduces storage costs and keeps results relevant.

Optimize Seeding

Start with "latest_n" (5-10 items) for testing
Expand to "full_history" once verified
Use "selected_items" for specific content

Why: Prevents expensive initial indexing of unwanted content.

Monitor Processing Costs

Check embedding costs before full import
Use smaller embedding dimensions for large sources
Limit polling_max_items for high-volume feeds

Why: Large sources can quickly accumulate costs.

Content Filtering

Use "ai_main_content_skip_ads" for news sites
Use "header_footer_navigation" for most websites
Use "none" only when you need full page context

Why: Improves search quality and reduces noise.

Crawl Configuration (Websites)

Start with crawl_depth: 2 or 3
Use URL filters to include/exclude paths
Enable sitemap crawling when available
Set reasonable crawl_limit (100-1000 pages)

Why: Prevents over-crawling and focuses on relevant content.

Limits and Quotas

Maximum sources per account: Plan-dependent
Polling frequency: Hourly minimum (no real-time)
Crawl depth: Maximum 5 levels
Content size: Maximum 10MB per item
Retention: Minimum 1 day

Check your plan limits in the account under Settings → Usage.

Public sources (RSS feeds, websites) are shared across accounts to optimize resources:

Source: Shared base entity with URL and type
Source Connection: Your personal connection with unique settings
Benefits: Reduced redundant polling, shared storage costs, faster initial setup

Private sources (file uploads, custom indexes) are never shared between accounts.

Agent Integration

Sources can trigger agents automatically when content is added or updated:

// Create agent with content trigger
const agent = await seclai.agents.create({
  name: "Summarize New Posts",
  triggers: [
    {
      type: "content_added",
      knowledge_base_id: kbId,
    },
  ],
  steps: [
    /* ... */
  ],
});

Trigger Types:

content_added: Runs when new content is indexed
content_updated: Runs when content is reprocessed
content_added_or_updated: Runs on either event

Use Cases:

Generate summaries of new articles
Send notifications for important updates
Extract structured data from content
Categorize and tag content

Troubleshooting

Source Not Polling

Check:

Polling schedule is not set to "manually"
Source hasn't failed recently (check error status)
Account has active subscription
No maintenance mode active

Content Not Appearing

Possible Causes:

Still processing (check progress endpoint)
Content filtered out by seeding rules
Duplicate content (already indexed)
Content failed processing (check errors)

High Costs

Solutions:

Reduce polling frequency
Lower embedding dimensions
Set retention policies
Use polling_max_items limits
Switch to smaller embedding models

Crawl Not Finding Pages

Solutions:

Increase crawl_depth
Check URL filters aren't too restrictive
Enable sitemap crawling
Verify robots.txt isn't blocking
Check crawl_limit isn't too low

API Reference

For complete API documentation, see API Reference.

Sources

What are Sources?

Source Types

RSS Feeds

Websites

File Uploads

Custom Index

Adding a Source

Basic Setup

URL Detection

Seeding Options

Full History

Latest N Items

Selected Items

Polling Schedules

Manually

Once

Hourly

Daily

Weekly

Polling Actions

New Only

New and Updated

Polling Limits

Content Retention

Content Processing Pipeline

Phase 1: Fetching

Phase 2: Transcription (Audio/Video)

Phase 3: Indexing

Phase 4: Linking

Content Filtering

None

Header, Footer, Navigation

AI Main Content

AI Main Content (Skip Ads)

Embedding Configuration

Managing Sources

Update Source Settings

Manual Polling

View Source Content

Monitor Processing Progress

Delete a Source

Best Practices

Choose the Right Polling Frequency

Use Retention Policies

Optimize Seeding

Monitor Processing Costs

Content Filtering

Crawl Configuration (Websites)

Limits and Quotas

Source Connection Sharing

Agent Integration

Troubleshooting

Source Not Polling

Content Not Appearing

High Costs

Crawl Not Finding Pages

API Reference