Sources
Sources are the data inputs that populate your knowledge bases with up-to-date information. Seclai automatically fetches, processes, and indexes content from various sources, keeping your knowledge bases current without manual intervention.
What are Sources?
A source is a connection to external content that you want to make searchable. When you add a source to a knowledge base:
- Initial Polling: Seclai fetches content based on your seeding preferences
- Content Processing: Text is extracted, chunked, and embedded for vector search
- Automatic Updates: Content is refreshed based on your polling schedule
- Multi-Phase Processing: Audio/video is transcribed, and all content is indexed
Sources can be shared across multiple knowledge bases or organizations, with each connection maintaining independent settings for polling, retention, and indexing.
Source Types
RSS Feeds
RSS feeds automatically pull content from blogs, podcasts, news sites, and other syndicated content.
Best For:
- Blog posts and articles
- Podcast episodes
- News feeds
- YouTube channels (via RSS)
Features:
- Automatic detection of RSS/Atom feeds
- Content metadata extraction (title, author, date)
- Support for full content or summary feeds
- Historical data seeding options
Example Use Cases:
- Monitor industry news and trends
- Track competitor blog posts
- Index podcast transcripts for searchability
- Aggregate content from multiple sources
Websites
Website sources crawl and index web pages automatically, following links and respecting crawl limits.
Best For:
- Documentation sites
- Help centers
- Product pages
- Knowledge bases
Features:
- Configurable crawl depth (1-5 levels)
- Sitemap support for efficient crawling
- URL filtering with regex patterns
- Main content extraction (removes headers, footers, navigation)
- AI-powered content filtering to skip ads
Crawl Configuration:
- Crawl Depth: How many levels deep to follow links
- Crawl Limit: Maximum number of pages to crawl
- Path Regular Expressions: Include/exclude URL patterns
- Sitemap: Automatically detect and use sitemap.xml
Example Use Cases:
- Index your product documentation
- Monitor competitor websites
- Create searchable help center
- Track regulatory or compliance documentation
File Uploads
Upload documents and files directly for indexing.
Best For:
- Internal documents
- Custom content
- One-time uploads
Supported Formats:
- Text: .txt, .md, .html, .csv, .json, .xml
- Documents: .pdf, .doc/.docx, .ppt/.pptx, .xls/.xlsx, .epub, .msg
- Images: .png, .jpg, .gif, .bmp, .tiff, .webp
- Audio (with transcription): .mp3, .wav, .m4a, .flac, .ogg
- Video (with transcription): .mp4, .mov, .avi
- Archives: .zip
Features:
- Content filtering options
- Custom embedding configuration
- Direct file storage integration
Custom Index
Customize indexing with choice of embedder and vector dimensions for best recall performance.
Best For:
- Optimizing recall
- Evaluating the performance of different embedding models and vector dimensions
Features:
- API-driven content creation
- Full control over content and metadata
- Custom chunking and embedding
Adding a Source
Basic Setup
// Using the TypeScript SDK
const source = await seclai.sources.create({
name: "Company Blog",
source_type: "rss_feed",
url: "https://example.com/blog/rss.xml",
polling: "daily",
retention: 90, // Keep content for 90 days
});
URL Detection
Seclai automatically detects whether a URL is an RSS feed or a website:
// Detect source type
const detection = await seclai.sources.detect({
url: "https://example.com",
});
console.log(detection.source_type); // "website" or "rss_feed"
console.log(detection.final_url); // Resolved URL after redirects
The detection process:
- Fetches the URL and checks content type
- Analyzes response to identify RSS/Atom feeds
- Returns metadata about the source
- Checks for existing sources at the same URL
Seeding Options
When you first add a source, you can control which content is initially indexed:
Full History
Index all available content from the source.
{
seed: "full_history";
}
Best For:
- Small content sources
- When you need complete historical context
- Documentation sites with limited pages
Note: Large sources may incur significant embedding costs.
Latest N Items
Index only the most recent items.
{
seed: "latest_n",
latest_n: 10 // Last 10 items (default: 5)
}
Best For:
- Large RSS feeds with frequent updates
- When historical content isn't relevant
- Testing new sources before full import
Selected Items
Choose specific items to index from RSS feeds.
{
seed: "selected_items",
selected_items: ["item-guid-1", "item-guid-2"]
}
Best For:
- Cherry-picking specific episodes or articles
- Manual content curation
- Importing only relevant historical content
Polling Schedules
Control how often Seclai checks for new content:
Manually
No automatic polling. Trigger updates via API or UI.
{
polling: "manually";
}
Use When:
- Content rarely changes
- You want manual control
- File uploads or custom indexes
Once
Poll immediately when added, then never again.
{
polling: "once";
}
Use When:
- One-time content import
- Archived content sources
- Testing source configuration
Hourly
Check for new content every hour.
{
polling: "hourly";
}
Use When:
- Breaking news feeds
- Real-time monitoring
- Frequently updated sources
Note: Higher polling frequency increases costs.
Daily
Check for new content once per day.
{
polling: "daily";
}
Use When:
- Blog posts and articles
- Most RSS feeds
- Documentation sites
- Recommended default for most sources
Weekly
Check for new content once per week.
{
polling: "weekly";
}
Use When:
- Infrequently updated content
- Newsletters or weekly publications
- Cost optimization for low-priority sources
Polling Actions
Control what happens when Seclai finds existing content during polling:
New Only
Only index new content items.
{
polling_action: "new";
}
Behavior:
- Skips items that were previously indexed
- Lower costs (no reprocessing)
- Doesn't catch content updates
Default for most source types.
New and Updated
Index new content and reprocess updated items.
{
polling_action: "new_and_updated";
}
Behavior:
- Detects content changes via ETag and Last-Modified headers
- Reprocesses updated content
- Higher accuracy for dynamic content
- Increased costs for reprocessing
Use When:
- Content frequently changes (edited posts, updated documentation)
- Accuracy is more important than cost
- You need the latest version of content
Polling Limits
Control how many items to fetch per poll:
{
polling_max_items: 50; // Limit to 50 new items per poll
}
Benefits:
- Prevents cost spikes from large batches
- Gradual processing of backlog
- Predictable resource usage
Content Retention
Automatically remove old content from your knowledge base:
{
retention_days: 90; // Delete content older than 90 days
}
Settings:
- Null (default): Keep content indefinitely
- Number: Delete content after N days
Use Cases:
- GDPR compliance (automatic data deletion)
- Keep only recent content relevant
- Control storage and search costs
- Maintain fresh results
Example: Set to 30 days for news feeds, 365 days for documentation.
Content Processing Pipeline
Phase 1: Fetching
Content is fetched from the source URL.
For RSS Feeds:
- Parses feed XML
- Extracts item metadata (title, author, date, GUID)
- Fetches full article content from item URLs
For Websites:
- Crawls pages following configured rules
- Extracts main content
- Removes navigation, ads, headers/footers
- Stores HTML for processing
Phase 2: Transcription (Audio/Video)
Audio and video content is transcribed to text.
Supported Formats:
- MP3, WAV, M4A (audio)
- MP4, MOV, AVI (video)
Features:
- Automatic language detection
- Speaker diarization
- Timestamp preservation
- Word-level confidence scores
Costs: Based on audio duration (minutes).
Phase 3: Indexing
Text content is chunked and embedded for vector search.
Process:
- Text Extraction: Clean and prepare text
- Chunking: Split into optimal segments (default: 1000 characters, 200 overlap)
- Embedding: Generate vector embeddings using configured model
- Storage: Store in PostgreSQL with pgvector
Costs: Text extraction is billed at 1 credit per second of processing time.
Configuration:
- Chunk Size: Length of text segments (default: 1000 characters)
- Chunk Overlap: Overlap between chunks (default: 200 characters)
- Embedding Model: Model used for vector generation
- Dimensions: Vector size (256-3584 based on model)
Phase 4: Linking
Content is linked to knowledge bases and triggers agents.
Actions:
- Links content to all knowledge bases using the source
- Triggers agents with "content_added" or "content_updated" triggers
- Records usage costs for embeddings and processing
- Updates content counts and metadata
Content Filtering
Control what content is extracted from websites and documents:
None
Extract all content without filtering.
{
content_filter: "none";
}
Header, Footer, Navigation
Remove common page elements.
{
content_filter: "header_footer_navigation";
}
Removes:
- Navigation menus
- Headers and footers
- Sidebars
- Copyright notices
Best For: Most website sources.
AI Main Content
Use AI to extract only the main content.
{
content_filter: "ai_main_content";
}
Features:
- Removes all non-content elements
- Preserves article structure
- Keeps images and media references
Best For: News sites, blogs, documentation.
AI Main Content (Skip Ads)
AI extraction with aggressive ad removal.
{
content_filter: "ai_main_content_skip_ads";
}
Features:
- Everything from AI Main Content
- Removes advertisements
- Removes promotional content
- Removes sponsored sections
Best For: Sites with heavy advertising, news sites.
Note: Content filtering is only available for file uploads and custom indexes.
Embedding Configuration
Each source can use different embedding models:
{
embedding_model: "text-embedding-3-small",
dimensions: 1536,
chunk_size: 1000,
chunk_overlap: 200
}
Available Models:
- text-embedding-3-small: 512 or 1536 dimensions (recommended)
- text-embedding-3-large: 256 or 3072 dimensions (highest quality)
- amazon-nova-2-multimodal: 256 or 1024 dimensions
- voyage-3: 512 or 1024 dimensions
- And 50+ other models from OpenAI, Cohere, Voyage, Jina, Mistral
Considerations:
- Higher dimensions: Better accuracy, higher costs
- Lower dimensions: Faster search, lower costs
- Model consistency: Use the same model across sources in a knowledge base for best results
Managing Sources
Update Source Settings
await seclai.sources.update(sourceId, {
name: "Updated Name",
polling: "weekly",
retention_days: 180,
polling_action: "new_and_updated",
});
Manual Polling
Trigger an immediate content update:
await seclai.sources.pull(sourceId, {
seed: "latest_n",
latest_n: 5,
});
View Source Content
const content = await seclai.sources.listContent(sourceId, {
page: 1,
limit: 20,
sort: "created_at",
order: "desc",
});
Monitor Processing Progress
const progress = await seclai.sources.getProgress(sourceId);
console.log(progress.total); // Total items
console.log(progress.completed); // Completed items
console.log(progress.phases); // Phase-by-phase breakdown
Phases:
- fetching: Content retrieval from source
- transcribing: Audio/video transcription
- indexing: Embedding generation
Delete a Source
await seclai.sources.delete(sourceId);
Note: Deleting a source removes it from all knowledge bases and deletes all associated content.
Best Practices
Choose the Right Polling Frequency
- Hourly: Only for critical, frequently updated sources
- Daily: Default for most RSS feeds and blogs
- Weekly: Documentation, newsletters, low-priority sources
- Manually: File uploads, one-time imports
Why: Higher polling frequency increases costs and processing overhead.
Use Retention Policies
- Set retention periods for time-sensitive content
- Keep 30-90 days for news feeds
- Keep 365+ days for documentation
- Null (infinite) for permanent content
Why: Reduces storage costs and keeps results relevant.
Optimize Seeding
- Start with "latest_n" (5-10 items) for testing
- Expand to "full_history" once verified
- Use "selected_items" for specific content
Why: Prevents expensive initial indexing of unwanted content.
Monitor Processing Costs
- Check embedding costs before full import
- Use smaller embedding dimensions for large sources
- Limit polling_max_items for high-volume feeds
Why: Large sources can quickly accumulate costs.
Content Filtering
- Use "ai_main_content_skip_ads" for news sites
- Use "header_footer_navigation" for most websites
- Use "none" only when you need full page context
Why: Improves search quality and reduces noise.
Crawl Configuration (Websites)
- Start with crawl_depth: 2 or 3
- Use URL filters to include/exclude paths
- Enable sitemap crawling when available
- Set reasonable crawl_limit (100-1000 pages)
Why: Prevents over-crawling and focuses on relevant content.
Limits and Quotas
- Maximum sources per account: Plan-dependent
- Polling frequency: Hourly minimum (no real-time)
- Crawl depth: Maximum 5 levels
- Content size: Maximum 10MB per item
- Retention: Minimum 1 day
Check your plan limits in the account under Settings → Usage.
Source Connection Sharing
Public sources (RSS feeds, websites) are shared across accounts to optimize resources:
- Source: Shared base entity with URL and type
- Source Connection: Your personal connection with unique settings
- Benefits: Reduced redundant polling, shared storage costs, faster initial setup
Private sources (file uploads, custom indexes) are never shared between accounts.
Agent Integration
Sources can trigger agents automatically when content is added or updated:
// Create agent with content trigger
const agent = await seclai.agents.create({
name: "Summarize New Posts",
triggers: [
{
type: "content_added",
knowledge_base_id: kbId,
},
],
steps: [
/* ... */
],
});
Trigger Types:
- content_added: Runs when new content is indexed
- content_updated: Runs when content is reprocessed
- content_added_or_updated: Runs on either event
Use Cases:
- Generate summaries of new articles
- Send notifications for important updates
- Extract structured data from content
- Categorize and tag content
Troubleshooting
Source Not Polling
Check:
- Polling schedule is not set to "manually"
- Source hasn't failed recently (check error status)
- Account has active subscription
- No maintenance mode active
Content Not Appearing
Possible Causes:
- Still processing (check progress endpoint)
- Content filtered out by seeding rules
- Duplicate content (already indexed)
- Content failed processing (check errors)
High Costs
Solutions:
- Reduce polling frequency
- Lower embedding dimensions
- Set retention policies
- Use polling_max_items limits
- Switch to smaller embedding models
Crawl Not Finding Pages
Solutions:
- Increase crawl_depth
- Check URL filters aren't too restrictive
- Enable sitemap crawling
- Verify robots.txt isn't blocking
- Check crawl_limit isn't too low
API Reference
For complete API documentation, see API Reference.