Content Sources
Content sources are the data inputs that populate your knowledge bases with up-to-date information. Seclai automatically fetches, processes, and indexes content from various sources, keeping your knowledge bases current without manual intervention.
What are Sources?
A source is a connection to external content that you want to make searchable. When you add a source to a knowledge base:
- Initial Polling: Seclai fetches content based on your seeding preferences
- Content Processing: Text is extracted, chunked, and embedded for vector search
- Automatic Updates: Content is refreshed based on your polling schedule
- Multi-Phase Processing: Audio/video is transcribed, and all content is indexed
Sources can be shared across multiple knowledge bases or organizations, with each connection maintaining independent settings for polling, retention, and indexing.
Source Types
RSS Feeds
RSS feeds automatically pull content from blogs, podcasts, news sites, and other syndicated content.
Best For:
- Blog posts and articles
- Podcast episodes
- News feeds
- YouTube channels (via RSS)
Features:
- Automatic detection of RSS/Atom feeds
- Content metadata extraction (title, author, date)
- Support for full content or summary feeds
- Historical data seeding options
Example Use Cases:
- Monitor industry news and trends
- Track competitor blog posts
- Index podcast transcripts for searchability
- Aggregate content from multiple sources
Websites
Website sources crawl and index web pages automatically, following links and respecting crawl limits.
Best For:
- Documentation sites
- Help centers
- Product pages
- Knowledge bases
Features:
- Configurable crawl depth (1-5 levels)
- Sitemap support for efficient crawling
- URL filtering with regex patterns
- Main content extraction (removes headers, footers, navigation)
- AI-powered content filtering to skip ads
Crawl Configuration:
- Crawl Depth: How many levels deep to follow links
- Crawl Limit: Maximum number of pages to crawl
- Path Regular Expressions: Include/exclude URL patterns
- Sitemap: Automatically detect and use sitemap.xml
Example Use Cases:
- Index your product documentation
- Monitor competitor websites
- Create searchable help center
- Track regulatory or compliance documentation
Content Store
Content stores (API type: custom_index) let you upload documents, push content via API, or build your own index programmatically. Each content store uses an index mode that controls the trade-off between embedding quality, search speed, and cost.
Index Modes:
| Mode | Dimensions | Chunk Size | Chunk Overlap | Best For |
|---|---|---|---|---|
| Fast & Cheap (default) | 256 | 3,000 chars | 500 chars | High volume, cost-sensitive workloads |
| Balanced | 512 | 1,500 chars | 300 chars | General-purpose use |
| Slow & Thorough | 1,024 | 1,000 chars | 200 chars | Maximum retrieval quality |
| Custom | You choose | You choose | You choose | Full control over embedding model and chunking |
All preset modes use the default embedding model. The Custom mode lets you pick any available embedding model, dimension count, chunk size, overlap, and language-specific or custom separators.
Best For:
- Internal documents and custom data
- API-driven content from external systems
- Programmatic content management
- Optimizing retrieval quality with custom embeddings
Supported File Formats:
- Text: .txt, .md, .html, .csv, .json, .xml
- Documents: .pdf, .docx, .ppt/.pptx, .xls/.xlsx, .epub, .msg
- Images: .png, .jpg, .gif, .bmp, .tiff, .webp
- Audio (with transcription): .mp3, .wav, .m4a, .flac, .ogg
- Video (with transcription): .mp4, .mov, .avi
- Archives: .zip
Legacy Word
.docfiles: Microsoft Word 97–2003 binary documents (.doc) are not supported. Open the file in Microsoft Word, Google Docs, LibreOffice, or Apple Pages, save it as.docx, and upload the new file.
Features:
- Content filtering options
- Custom or preset embedding configuration
- API-driven content creation with full control over metadata
- Embedding migration to change models or dimensions after creation
- Direct file storage integration
Note: The legacy
file_uploadssource type is accepted as an alias forcustom_indexwithindex_mode=fast_and_cheap.
Adding a Source
Basic Setup
// Using the TypeScript SDK
const source = await seclai.sources.create({
name: "Company Blog",
source_type: "rss_feed",
url: "https://example.com/blog/rss.xml",
polling: "daily",
retention: 90, // Keep content for 90 days
});
// Create a Content Store with a preset index mode
const contentStore = await seclai.sources.create({
name: "Internal Docs",
source_type: "custom_index",
index_mode: "balanced", // or "fast_and_cheap", "slow_and_thorough", "custom"
});
URL Detection
Seclai automatically detects whether a URL is an RSS feed or a website:
// Detect source type
const detection = await seclai.sources.detect({
url: "https://example.com",
});
console.log(detection.source_type); // "website" or "rss_feed"
console.log(detection.final_url); // Resolved URL after redirects
The detection process:
- Fetches the URL and checks content type
- Analyzes response to identify RSS/Atom feeds
- Returns metadata about the source
- Checks for existing sources at the same URL
Seeding Options
When you first add a source, you can control which content is initially indexed:
Full History
Index all available content from the source.
{
seed: "full_history";
}
Best For:
- Small content sources
- When you need complete historical context
- Documentation sites with limited pages
Note: Large sources may incur significant embedding costs.
Latest N Items
Index only the most recent items.
{
seed: "latest_n",
latest_n: 10 // Last 10 items (default: 5)
}
Best For:
- Large RSS feeds with frequent updates
- When historical content isn't relevant
- Testing new sources before full import
Selected Items
Choose specific items to index from RSS feeds.
{
seed: "selected_items",
selected_items: ["item-guid-1", "item-guid-2"]
}
Best For:
- Cherry-picking specific episodes or articles
- Manual content curation
- Importing only relevant historical content
Polling Schedules
Control how often Seclai checks for new content:
Manually
No automatic polling. Trigger updates via API or UI.
{
polling: "manually";
}
Use When:
- Content rarely changes
- You want manual control
- Content stores
Once
Poll immediately when added, then never again.
{
polling: "once";
}
Use When:
- One-time content import
- Archived content sources
- Testing source configuration
Hourly
Check for new content every hour.
{
polling: "hourly";
}
Use When:
- Breaking news feeds
- Real-time monitoring
- Frequently updated sources
Note: Higher polling frequency increases costs.
Daily
Check for new content once per day.
{
polling: "daily";
}
Use When:
- Blog posts and articles
- Most RSS feeds
- Documentation sites
- Recommended default for most sources
Weekly
Check for new content once per week.
{
polling: "weekly";
}
Use When:
- Infrequently updated content
- Newsletters or weekly publications
- Cost optimization for low-priority sources
Polling Actions
Control what happens when Seclai finds existing content during polling:
New Only
Only index new content items.
{
polling_action: "new";
}
Behavior:
- Skips items that were previously indexed
- Lower costs (no reprocessing)
- Doesn't catch content updates
Default for most source types.
New and Updated
Index new content and reprocess updated items.
{
polling_action: "new_and_updated";
}
Behavior:
- Detects content changes via ETag and Last-Modified headers
- Reprocesses updated content
- Higher accuracy for dynamic content
- Increased costs for reprocessing
Use When:
- Content frequently changes (edited posts, updated documentation)
- Accuracy is more important than cost
- You need the latest version of content
Polling Limits
Control how many items to fetch per poll:
{
polling_max_items: 50; // Limit to 50 new items per poll
}
Benefits:
- Prevents cost spikes from large batches
- Gradual processing of backlog
- Predictable resource usage
Pull Status & Errors
Every time Seclai polls a source it creates a pull record describing what happened. The Pulls tab on the source detail page (and the list_source_pulls MCP tool / GET /authenticated/sources/{id}/pulls endpoint) shows the most recent pulls along with a Status column.
Pull Statuses
| Status | Meaning |
|---|---|
pending | The pull has been scheduled but has not finished yet. |
completed | At least one URL was scraped successfully. Some individual URLs may still have failed — see errors_count. |
failed | Every URL attempted in this pull failed. No new content was indexed. |
Each pull also carries:
items_attempted— how many URLs the pull tried to fetch.source_connection_content_count— how many of those produced indexed content for this source connection.errors_count— how many URLs failed during this pull.error— a short aggregated summary of up to ten failing URLs (each line isurl: message), with a…marker when more errors were truncated. The per-pull errors endpoint returns a per-URL breakdown parsed from this same stored summary, so it reflects the same truncation and is not guaranteed to include every failed URL when the pull had more than ten failures.
When the most recent pull failed, the source detail page also surfaces a red banner with a link to the Pulls tab so failures are not missed.
Reading the Status Column
- Green badge —
completedwitherrors_count = 0: All URLs in the pull succeeded. - Yellow badge —
completedwitherrors_count > 0: Partial success. Hover the badge for an error summary, or click the row to see per-URL details. - Red badge —
failed: No URLs succeeded. Inspect the error summary and the per-URL details to decide whether to retry, change the source URL, or remove the source.
Common Pull Errors
| Error message contains | Cause | Recommended action |
|---|---|---|
not supported by the scraper / 403 | The site explicitly blocks our scraper, or the URL is on a blocklist. | Use a different URL (an RSS/Atom feed, sitemap, or alternate page). Some sites can be added on request — contact support. |
unauthorized / 401 | The page requires authentication that we cannot provide. | Either expose a public version of the page, or remove the URL from the source. |
payment required / 402 | The page is behind a hard paywall. | Use a public summary feed instead. |
bad request / 400 | The URL is malformed or returns an unexpected response. | Double-check the URL works in a browser. Fix or remove it from the source. |
request timeout / 408 | The remote server took too long to respond. | Usually transient — wait for the next scheduled pull, or trigger a manual pull. If it keeps failing, the site may be down. |
HTTP 4xx / HTTP 5xx | The remote server returned an error. | 4xx usually indicates a permanent issue (page moved or removed). 5xx is usually transient — retry on the next pull. |
rate limited / 429 | We were throttled by the scraper or the upstream site. | We retry rate-limited requests automatically. If you see this repeatedly, reduce the polling frequency. |
If a pull keeps failing for the same reason, consider:
- Editing the source URL or filters.
- Lowering the polling frequency.
- Removing the source if the underlying site no longer exists.
- Contacting support so the site can be reviewed for compatibility.
File Upload Processing Errors
When you upload a file to a Content Store, the file is accepted immediately and then processed asynchronously. If processing fails, the content item shows a failed status on the Content tab. Hover over the red status badge to see the specific error message.
| Error message | Cause | Recommended action |
|---|---|---|
| "This file could not be converted…" | The file format is not supported by the text extraction engine, or the file is corrupted. | Re-save the file in a supported format (PDF, DOCX, PPTX, XLSX, EPUB, MSG, HTML, CSV, JSON, XML, TXT, MD). For scanned PDFs, ensure text is selectable. |
| "No text could be extracted…" | The file was processed but contained no extractable text (e.g. an image-only PDF without OCR-readable text, or an empty document). | Open the file locally to verify it has readable text. For image-based PDFs, re-export with OCR enabled. |
| "An unexpected error occurred…" | An internal processing error. | Try uploading again. If the error persists, contact support. |
Tips for successful uploads:
- Ensure PDFs contain selectable text (not just scanned images).
- Use standard document formats — exotic or proprietary formats may not be supported.
- Keep individual files under 200 MB for reliable processing.
- For image files (.png, .jpg, etc.), OCR is attempted automatically but works best with clear, high-contrast text.
Content Retention
Automatically remove old content from your knowledge base:
{
retention_days: 90; // Delete content older than 90 days
}
Settings:
- Null (default): Keep content indefinitely
- Number: Delete content after N days
Use Cases:
- GDPR compliance (automatic data deletion)
- Keep only recent content relevant
- Control storage and search costs
- Maintain fresh results
Example: Set to 30 days for news feeds, 365 days for documentation.
Content Processing Pipeline
Phase 1: Fetching
Content is fetched from the source URL.
For RSS Feeds:
- Parses feed XML
- Extracts item metadata (title, author, date, GUID)
- Fetches full article content from item URLs
For Websites:
- Crawls pages following configured rules
- Extracts main content
- Removes navigation, ads, headers/footers
- Stores HTML for processing
Phase 2: Transcription (Audio/Video)
Audio and video content is transcribed to text.
Supported Formats:
- MP3, WAV, M4A (audio)
- MP4, MOV, AVI (video)
Features:
- Automatic language detection
- Speaker diarization
- Timestamp preservation
- Word-level confidence scores
Costs: Based on audio duration (minutes).
Phase 3: Indexing
Text content is chunked and embedded for vector search.
Process:
- Text Extraction: Clean and prepare text
- Chunking: Split into optimal segments (default: 1000 characters, 200 overlap)
- Embedding: Generate vector embeddings using configured model
- Storage: Store in PostgreSQL with pgvector
Costs: Text extraction is billed at 1 credit per second of processing time.
Configuration:
- Chunk Size: Length of text segments (default: 1000 characters)
- Chunk Overlap: Overlap between chunks (default: 200 characters)
- Embedding Model: Model used for vector generation
- Dimensions: Vector size (256-3584 based on model)
Phase 4: Linking
Content is linked to knowledge bases and triggers agents.
Actions:
- Links content to all knowledge bases using the source
- Triggers agents with "content_added" or "content_updated" triggers
- Records usage costs for embeddings and processing
- Updates content counts and metadata
Content Filtering
Control what content is extracted from websites and documents:
None
Extract all content without filtering.
{
content_filter: "none";
}
Header, Footer, Navigation
Remove common page elements.
{
content_filter: "header_footer_navigation";
}
Removes:
- Navigation menus
- Headers and footers
- Sidebars
- Copyright notices
Best For: Most website sources.
AI Main Content
Use AI to extract only the main content.
{
content_filter: "ai_main_content";
}
Features:
- Removes all non-content elements
- Preserves article structure
- Keeps images and media references
Best For: News sites, blogs, documentation.
AI Main Content (Skip Ads)
AI extraction with aggressive ad removal.
{
content_filter: "ai_main_content_skip_ads";
}
Features:
- Everything from AI Main Content
- Removes advertisements
- Removes promotional content
- Removes sponsored sections
Best For: Sites with heavy advertising, news sites.
Note: Content filtering is only available for content stores.
Embedding Configuration
Each source can use different embedding models:
{
embedding_model: "text-embedding-3-small",
dimensions: 1536,
chunk_size: 1000,
chunk_overlap: 200
}
Available Models:
- text-embedding-3-small: 512 or 1536 dimensions (recommended)
- text-embedding-3-large: 256 or 3072 dimensions (highest quality)
- amazon-nova-2-multimodal: 256 or 1024 dimensions
- voyage-3: 512 or 1024 dimensions
- And 90+ other models from OpenAI, Cohere, Voyage, Jina, Mistral
Considerations:
- Higher dimensions: Better accuracy, higher costs
- Lower dimensions: Faster search, lower costs
- Model consistency: Use the same model across sources in a knowledge base for best results
Managing Sources
Update Source Settings
await seclai.sources.update(sourceId, {
name: "Updated Name",
polling: "weekly",
retention_days: 180,
polling_action: "new_and_updated",
});
Manual Polling
Trigger an immediate content update:
await seclai.sources.pull(sourceId, {
seed: "latest_n",
latest_n: 5,
});
View Source Content
Each source produces indexed content items (articles, files, documents). You can list, inspect, replace, and delete individual items via the API. See Contents for the full content management API.
const content = await seclai.sources.listContent(sourceId, {
page: 1,
limit: 20,
sort: "created_at",
order: "desc",
});
Monitor Processing Progress
const progress = await seclai.sources.getProgress(sourceId);
console.log(progress.total); // Total items
console.log(progress.completed); // Completed items
console.log(progress.phases); // Phase-by-phase breakdown
Phases:
- fetching: Content retrieval from source
- transcribing: Audio/video transcription
- indexing: Embedding generation
Delete a Source
await seclai.sources.delete(sourceId);
Note: Deleting a source removes it from all knowledge bases and deletes all associated content.
Best Practices
Choose the Right Polling Frequency
- Hourly: Only for critical, frequently updated sources
- Daily: Default for most RSS feeds and blogs
- Weekly: Documentation, newsletters, low-priority sources
- Manually: Content stores, one-time imports
Why: Higher polling frequency increases costs and processing overhead.
Use Retention Policies
- Set retention periods for time-sensitive content
- Keep 30-90 days for news feeds
- Keep 365+ days for documentation
- Null (infinite) for permanent content
Why: Reduces storage costs and keeps results relevant.
Optimize Seeding
- Start with "latest_n" (5-10 items) for testing
- Expand to "full_history" once verified
- Use "selected_items" for specific content
Why: Prevents expensive initial indexing of unwanted content.
Monitor Processing Costs
- Check embedding costs before full import
- Use smaller embedding dimensions for large sources
- Limit polling_max_items for high-volume feeds
Why: Large sources can quickly accumulate costs.
Content Filtering
- Use "ai_main_content_skip_ads" for news sites
- Use "header_footer_navigation" for most websites
- Use "none" only when you need full page context
Why: Improves search quality and reduces noise.
Crawl Configuration (Websites)
- Start with crawl_depth: 2 or 3
- Use URL filters to include/exclude paths
- Enable sitemap crawling when available
- Set reasonable crawl_limit (100-1000 pages)
Why: Prevents over-crawling and focuses on relevant content.
Limits and Quotas
- Maximum sources per account: Plan-dependent
- Polling frequency: Hourly minimum (no real-time)
- Crawl depth: Maximum 5 levels
- Content size: Maximum 10MB per item
- Retention: Minimum 1 day
Check your plan limits in the account under Settings → Usage.
Source Connection Sharing
Public sources (RSS feeds, websites) are shared across accounts to optimize resources:
- Source: Shared base entity with URL and type
- Source Connection: Your personal connection with unique settings
- Benefits: Reduced redundant polling, shared storage costs, faster initial setup
Private sources (content stores) are never shared between accounts.
Agent Integration
Sources can trigger agents automatically when content is added or updated:
// Create agent with content trigger
const agent = await seclai.agents.create({
name: "Summarize New Posts",
triggers: [
{
type: "content_added",
knowledge_base_id: kbId,
},
],
steps: [
/* ... */
],
});
Trigger Types:
- content_added: Runs when new content is indexed
- content_updated: Runs when content is reprocessed
- content_added_or_updated: Runs on either event
Use Cases:
- Generate summaries of new articles
- Send notifications for important updates
- Extract structured data from content
- Categorize and tag content
Troubleshooting
Source Not Polling
Check:
- Polling schedule is not set to "manually"
- Source hasn't failed recently (check error status)
- Account has active subscription
- No maintenance mode active
Content Not Appearing
Possible Causes:
- Still processing (check progress endpoint)
- Content filtered out by seeding rules
- Duplicate content (already indexed)
- Content failed processing (check errors)
High Costs
Solutions:
- Reduce polling frequency
- Lower embedding dimensions
- Set retention policies
- Use polling_max_items limits
- Switch to smaller embedding models
Crawl Not Finding Pages
Solutions:
- Increase crawl_depth
- Check URL filters aren't too restrictive
- Enable sitemap crawling
- Verify robots.txt isn't blocking
- Check crawl_limit isn't too low
API Reference
For complete API documentation, see API Reference.