Content Sources

Content sources are the data inputs that populate your knowledge bases with up-to-date information. Seclai automatically fetches, processes, and indexes content from various sources, keeping your knowledge bases current without manual intervention.

What are Sources?

A source is a connection to external content that you want to make searchable. When you add a source to a knowledge base:

Initial Polling: Seclai fetches content based on your seeding preferences
Content Processing: Text is extracted, chunked, and embedded for vector search
Automatic Updates: Content is refreshed based on your polling schedule
Multi-Phase Processing: Audio/video is transcribed, and all content is indexed

Sources can be shared across multiple knowledge bases or organizations, with each connection maintaining independent settings for polling, retention, and indexing.

Source Types

RSS Feeds

RSS feeds automatically pull content from blogs, podcasts, news sites, and other syndicated content.

Best For:

Blog posts and articles
Podcast episodes
News feeds
YouTube channels (via RSS)

Features:

Automatic detection of RSS/Atom feeds
Content metadata extraction (title, author, date)
Support for full content or summary feeds
Historical data seeding options

Example Use Cases:

Monitor industry news and trends
Track competitor blog posts
Index podcast transcripts for searchability
Aggregate content from multiple sources

Websites

Website sources crawl and index web pages automatically, following links and respecting crawl limits.

Best For:

Documentation sites
Help centers
Product pages
Knowledge bases

Features:

Configurable crawl depth (1-5 levels)
Sitemap support for efficient crawling
URL filtering with regex patterns
Main content extraction (removes headers, footers, navigation)
AI-powered content filtering to skip ads

Crawl Configuration:

Crawl Depth: How many levels deep to follow links
Crawl Limit: Maximum number of pages to crawl
Path Regular Expressions: Include/exclude URL patterns
Sitemap: Automatically detect and use sitemap.xml

Example Use Cases:

Index your product documentation
Monitor competitor websites
Create searchable help center
Track regulatory or compliance documentation

Content Store

Content stores (API type: custom_index) let you upload documents, push content via API, or build your own index programmatically. Each content store uses an index mode that controls the trade-off between embedding quality, search speed, and cost.

Index Modes:

Mode	Dimensions	Chunk Size	Chunk Overlap	Best For
Fast & Cheap (default)	256	3,000 chars	500 chars	High volume, cost-sensitive workloads
Balanced	512	1,500 chars	300 chars	General-purpose use
Slow & Thorough	1,024	1,000 chars	200 chars	Maximum retrieval quality
Custom	You choose	You choose	You choose	Full control over embedding model and chunking

All preset modes use the default embedding model. The Custom mode lets you pick any available embedding model, dimension count, chunk size, overlap, and language-specific or custom separators.

Best For:

Internal documents and custom data
API-driven content from external systems
Programmatic content management
Optimizing retrieval quality with custom embeddings

Supported File Formats:

Text: .txt, .md, .html, .csv, .json, .xml
Documents: .pdf, .docx, .ppt/.pptx, .xls/.xlsx, .epub, .msg
Images: .png, .jpg, .gif, .bmp, .tiff, .webp
Audio (with transcription): .mp3, .wav, .m4a, .flac, .ogg
Video (with transcription): .mp4, .mov, .avi
Archives: .zip

Legacy Word .doc files: Microsoft Word 97–2003 binary documents (.doc) are not supported. Open the file in Microsoft Word, Google Docs, LibreOffice, or Apple Pages, save it as .docx, and upload the new file.

Features:

Content filtering options
Custom or preset embedding configuration
API-driven content creation with full control over metadata
Embedding migration to change models or dimensions after creation
Direct file storage integration

Note: The legacy file_uploads source type is accepted as an alias for custom_index with index_mode=fast_and_cheap.

Cloud Drives

Cloud drive sources (type: cloud_drive) ingest a folder from a connected cloud drive into a knowledge base and keep it in sync — new and changed files are pulled in automatically, near-real-time via the provider's change webhook (a periodic poll is the backstop). Connect a drive once under Integrations → Cloud Drives (Dropbox is supported today), then point a source at a folder. See the Cloud Drives guide for the connection walkthrough.

Best For:

Keeping a knowledge base in sync with a shared team folder
Ingesting reports, meeting notes, and proposals as they land
Research papers and technical documentation kept on a drive

Features:

Automatic add/update ingestion — each file flows through the normal extract → index pipeline, versioned by the provider's file id
Near-real-time sync via the provider webhook, with a scheduled poll as the backstop
Optional filters: a folder path, a path glob (e.g. **/*.pdf), a MIME allowlist (e.g. application/pdf, image/*), and a per-file size cap
The same file support as uploads, including audio/video transcription and document/image text extraction

Configuration:

Connection (required) — the cloud-drive connection the files come from
Folder path — the folder to ingest (blank = the connection root)
Path pattern — optional case-insensitive glob to keep only matching files
Content types — optional comma-separated MIME allowlist
Max file size (MB) — optional per-file cap (bounded by the platform ceiling)

Supported providers: Dropbox (more as they ship). File types match the Content Store formats above.

Note: v1 ingests adds and updates only — deleting a file on the drive does not remove it from the knowledge base yet. Governance source-content screening applies to cloud-drive files just like any other source.

Adding a Source

Basic Setup

// Using the TypeScript SDK
const source = await seclai.sources.create({
  name: "Company Blog",
  source_type: "rss_feed",
  url: "https://example.com/blog/rss.xml",
  polling: "daily",
  retention: 90, // Keep content for 90 days
});

// Create a Content Store with a preset index mode
const contentStore = await seclai.sources.create({
  name: "Internal Docs",
  source_type: "custom_index",
  index_mode: "balanced", // or "fast_and_cheap", "slow_and_thorough", "custom"
});

URL Detection

Seclai automatically detects whether a URL is an RSS feed or a website:

// Detect source type
const detection = await seclai.sources.detect({
  url: "https://example.com",
});

console.log(detection.source_type); // "website" or "rss_feed"
console.log(detection.final_url); // Resolved URL after redirects

The detection process:

Fetches the URL and checks content type
Analyzes response to identify RSS/Atom feeds
Returns metadata about the source
Checks for existing sources at the same URL

Seeding Options

When you first add a source, you can control which content is initially indexed:

Full History

Index all available content from the source.

{
  seed: "full_history";
}

Best For:

Small content sources
When you need complete historical context
Documentation sites with limited pages

Note: Large sources may incur significant embedding costs.

Latest N Items

Index only the most recent items.

{
  seed: "latest_n",
  latest_n: 10  // Last 10 items (default: 5)
}

Best For:

Large RSS feeds with frequent updates
When historical content isn't relevant
Testing new sources before full import

Selected Items

Choose specific items to index from RSS feeds.

{
  seed: "selected_items",
  selected_items: ["item-guid-1", "item-guid-2"]
}

Best For:

Cherry-picking specific episodes or articles
Manual content curation
Importing only relevant historical content

Polling Schedules

Control how often Seclai checks for new content:

Manually

No automatic polling. Trigger updates via API or UI.

{
  polling: "manually";
}

Use When:

Content rarely changes
You want manual control
Content stores

Once

Poll immediately when added, then never again.

{
  polling: "once";
}

Use When:

One-time content import
Archived content sources
Testing source configuration

Hourly

Check for new content every hour.

{
  polling: "hourly";
}

Use When:

Breaking news feeds
Real-time monitoring
Frequently updated sources

Note: Higher polling frequency increases costs.

Daily

Check for new content once per day.

{
  polling: "daily";
}

Use When:

Blog posts and articles
Most RSS feeds
Documentation sites
Recommended default for most sources

Weekly

Check for new content once per week.

{
  polling: "weekly";
}

Use When:

Infrequently updated content
Newsletters or weekly publications
Cost optimization for low-priority sources

Polling Actions

Control what happens when Seclai finds existing content during polling:

New Only

Only index new content items.

{
  polling_action: "new";
}

Behavior:

Skips items that were previously indexed
Lower costs (no reprocessing)
Doesn't catch content updates

Default for most source types.

New and Updated

Index new content and reprocess updated items.

{
  polling_action: "new_and_updated";
}

Behavior:

Detects content changes via ETag and Last-Modified headers
Reprocesses updated content
Higher accuracy for dynamic content
Increased costs for reprocessing

Use When:

Content frequently changes (edited posts, updated documentation)
Accuracy is more important than cost
You need the latest version of content

Polling Limits

Control how many items to fetch per poll:

{
  polling_max_items: 50; // Limit to 50 new items per poll
}

Benefits:

Prevents cost spikes from large batches
Gradual processing of backlog
Predictable resource usage

Pull Status & Errors

Every time Seclai polls a source it creates a pull record describing what happened. The Pulls tab on the source detail page (and the list_source_pulls MCP tool / GET /authenticated/sources/{id}/pulls endpoint) shows the most recent pulls along with a Status column.

Pull Statuses

Status	Meaning
`pending`	The pull has been scheduled but has not finished yet.
`completed`	At least one URL was scraped successfully. Some individual URLs may still have failed — see `errors_count`.
`failed`	Every URL attempted in this pull failed. No new content was indexed.

Each pull also carries:

items_attempted — how many URLs the pull tried to fetch.
source_connection_content_count — how many of those produced indexed content for this source connection.
errors_count — how many URLs failed during this pull.
error — a short aggregated summary of up to ten failing URLs (each line is url: message), with a … marker when more errors were truncated. The per-pull errors endpoint returns a per-URL breakdown parsed from this same stored summary, so it reflects the same truncation and is not guaranteed to include every failed URL when the pull had more than ten failures.

When the most recent pull failed, the source detail page also surfaces a red banner with a link to the Pulls tab so failures are not missed.

Reading the Status Column

Green badge — completed with errors_count = 0: All URLs in the pull succeeded.
Yellow badge — completed with errors_count > 0: Partial success. Hover the badge for an error summary, or click the row to see per-URL details.
Red badge — failed: No URLs succeeded. Inspect the error summary and the per-URL details to decide whether to retry, change the source URL, or remove the source.

Common Pull Errors

Error message contains	Cause	Recommended action
`not supported by the scraper` / `403`	The site explicitly blocks our scraper, or the URL is on a blocklist.	Use a different URL (an RSS/Atom feed, sitemap, or alternate page). Some sites can be added on request — contact support.
`unauthorized` / `401`	The page requires authentication that we cannot provide.	Either expose a public version of the page, or remove the URL from the source.
`payment required` / `402`	The page is behind a hard paywall.	Use a public summary feed instead.
`bad request` / `400`	The URL is malformed or returns an unexpected response.	Double-check the URL works in a browser. Fix or remove it from the source.
`request timeout` / `408`	The remote server took too long to respond.	Usually transient — wait for the next scheduled pull, or trigger a manual pull. If it keeps failing, the site may be down.
`HTTP 4xx` / `HTTP 5xx`	The remote server returned an error.	4xx usually indicates a permanent issue (page moved or removed). 5xx is usually transient — retry on the next pull.
`rate limited` / `429`	We were throttled by the scraper or the upstream site.	We retry rate-limited requests automatically. If you see this repeatedly, reduce the polling frequency.

If a pull keeps failing for the same reason, consider:

Editing the source URL or filters.
Lowering the polling frequency.
Removing the source if the underlying site no longer exists.
Contacting support so the site can be reviewed for compatibility.

File Upload Processing Errors

When you upload a file to a Content Store, the file is accepted immediately and then processed asynchronously. If processing fails, the content item shows a failed status on the Content tab. Hover over the red status badge to see the specific error message.

Error message	Cause	Recommended action
"This file could not be converted…"	The file format is not supported by the text extraction engine, or the file is corrupted.	Re-save the file in a supported format (PDF, DOCX, PPTX, XLSX, EPUB, MSG, HTML, CSV, JSON, XML, TXT, MD). For scanned PDFs, ensure text is selectable.
"No text could be extracted…"	The file was processed but contained no extractable text (e.g. an image-only PDF without OCR-readable text, or an empty document).	Open the file locally to verify it has readable text. For image-based PDFs, re-export with OCR enabled.
"An unexpected error occurred…"	An internal processing error.	Try uploading again. If the error persists, contact support.

Tips for successful uploads:

Ensure PDFs contain selectable text (not just scanned images).
Use standard document formats — exotic or proprietary formats may not be supported.
Keep individual files under 200 MB for reliable processing.
For image files (.png, .jpg, etc.), OCR is attempted automatically but works best with clear, high-contrast text.

Content Retention

Automatically remove old content from your knowledge base:

{
  retention_days: 90; // Delete content older than 90 days
}

Settings:

Null (default): Keep content indefinitely
Number: Delete content after N days

Use Cases:

GDPR compliance (automatic data deletion)
Keep only recent content relevant
Control storage and search costs
Maintain fresh results

Example: Set to 30 days for news feeds, 365 days for documentation.

Content Processing Pipeline

Phase 1: Fetching

Content is fetched from the source URL.

For RSS Feeds:

Parses feed XML
Extracts item metadata (title, author, date, GUID)
Fetches full article content from item URLs

For Websites:

Crawls pages following configured rules
Extracts main content
Removes navigation, ads, headers/footers
Stores HTML for processing

Phase 2: Transcription (Audio/Video)

Audio and video content is transcribed to text.

Supported Formats:

MP3, WAV, M4A (audio)
MP4, MOV, AVI (video)

Features:

Automatic language detection
Speaker diarization
Timestamp preservation
Word-level confidence scores

Costs: Based on audio duration (minutes).

Phase 3: Indexing

Text content is chunked and embedded for vector search.

Process:

Text Extraction: Clean and prepare text
Chunking: Split into optimal segments (default: 1000 characters, 200 overlap)
Embedding: Generate vector embeddings using configured model
Storage: Store in PostgreSQL with pgvector

Costs: Text extraction is billed at 1 credit per second of processing time.

Configuration:

Chunk Size: Length of text segments (default: 1000 characters)
Chunk Overlap: Overlap between chunks (default: 200 characters)
Embedding Model: Model used for vector generation
Dimensions: Vector size (256-3584 based on model)

Phase 4: Linking

Content is linked to knowledge bases and triggers agents.

Actions:

Links content to all knowledge bases using the source
Triggers agents with "content_added" or "content_updated" triggers
Records usage costs for embeddings and processing
Updates content counts and metadata

Content Filtering

Control what content is extracted from websites and documents:

None

Extract all content without filtering.

{
  content_filter: "none";
}

Header, Footer, Navigation

Remove common page elements.

{
  content_filter: "header_footer_navigation";
}

Removes:

Navigation menus
Headers and footers
Sidebars
Copyright notices

Best For: Most website sources.

AI Main Content

Use AI to extract only the main content.

{
  content_filter: "ai_main_content";
}

Features:

Removes all non-content elements
Preserves article structure
Keeps images and media references

Best For: News sites, blogs, documentation.

AI Main Content (Skip Ads)

AI extraction with aggressive ad removal.

{
  content_filter: "ai_main_content_skip_ads";
}

Features:

Everything from AI Main Content
Removes advertisements
Removes promotional content
Removes sponsored sections

Best For: Sites with heavy advertising, news sites.

Note: Content filtering is only available for content stores.

Embedding Configuration

Each source can use different embedding models:

{
  embedding_model: "text-embedding-3-small",
  dimensions: 1536,
  chunk_size: 1000,
  chunk_overlap: 200
}

Available Models:

text-embedding-3-small: 512 or 1536 dimensions (recommended)
text-embedding-3-large: 256 or 3072 dimensions (highest quality)
amazon-nova-2-multimodal: 256 or 1024 dimensions
voyage-3: 512 or 1024 dimensions
And 200+ other models from OpenAI, Cohere, Voyage, Jina, Mistral

Considerations:

Higher dimensions: Better accuracy, higher costs
Lower dimensions: Faster search, lower costs
Model consistency: Use the same model across sources in a knowledge base for best results

Managing Sources

Update Source Settings

await seclai.sources.update(sourceId, {
  name: "Updated Name",
  polling: "weekly",
  retention_days: 180,
  polling_action: "new_and_updated",
});

Manual Polling

Trigger an immediate content update:

await seclai.sources.pull(sourceId, {
  seed: "latest_n",
  latest_n: 5,
});

View Source Content

Each source produces indexed content items (articles, files, documents). You can list, inspect, replace, and delete individual items via the API. See Contents for the full content management API.

const content = await seclai.sources.listContent(sourceId, {
  page: 1,
  limit: 20,
  sort: "created_at",
  order: "desc",
});

Monitor Processing Progress

const progress = await seclai.sources.getProgress(sourceId);

console.log(progress.total); // Total items
console.log(progress.completed); // Completed items
console.log(progress.phases); // Phase-by-phase breakdown

Phases:

fetching: Content retrieval from source
transcribing: Audio/video transcription
indexing: Embedding generation

Delete a Source

await seclai.sources.delete(sourceId);

Note: Deleting a source removes it from all knowledge bases and deletes all associated content.

Best Practices

Choose the Right Polling Frequency

Hourly: Only for critical, frequently updated sources
Daily: Default for most RSS feeds and blogs
Weekly: Documentation, newsletters, low-priority sources
Manually: Content stores, one-time imports

Why: Higher polling frequency increases costs and processing overhead.

Use Retention Policies

Set retention periods for time-sensitive content
Keep 30-90 days for news feeds
Keep 365+ days for documentation
Null (infinite) for permanent content

Why: Reduces storage costs and keeps results relevant.

Optimize Seeding

Start with "latest_n" (5-10 items) for testing
Expand to "full_history" once verified
Use "selected_items" for specific content

Why: Prevents expensive initial indexing of unwanted content.

Monitor Processing Costs

Check embedding costs before full import
Use smaller embedding dimensions for large sources
Limit polling_max_items for high-volume feeds

Why: Large sources can quickly accumulate costs.

Content Filtering

Use "ai_main_content_skip_ads" for news sites
Use "header_footer_navigation" for most websites
Use "none" only when you need full page context

Why: Improves search quality and reduces noise.

Crawl Configuration (Websites)

Start with crawl_depth: 2 or 3
Use URL filters to include/exclude paths
Enable sitemap crawling when available
Set reasonable crawl_limit (100-1000 pages)

Why: Prevents over-crawling and focuses on relevant content.

Limits and Quotas

Maximum sources per account: Plan-dependent
Polling frequency: Hourly minimum (no real-time)
Crawl depth: Maximum 5 levels
Content size: Maximum 10MB per item
Retention: Minimum 1 day

Check your plan limits in the account under Settings → Usage.

Public sources (RSS feeds, websites) are shared across accounts to optimize resources:

Source: Shared base entity with URL and type
Source Connection: Your personal connection with unique settings
Benefits: Reduced redundant polling, shared storage costs, faster initial setup

Private sources (content stores) are never shared between accounts.

Agent Integration

Sources can trigger agents automatically when content is added or updated:

// Create agent with content trigger
const agent = await seclai.agents.create({
  name: "Summarize New Posts",
  triggers: [
    {
      type: "content_added",
      knowledge_base_id: kbId,
    },
  ],
  steps: [
    /* ... */
  ],
});

Trigger Types:

content_added: Runs when new content is indexed
content_updated: Runs when content is reprocessed
content_added_or_updated: Runs on either event

Use Cases:

Generate summaries of new articles
Send notifications for important updates
Extract structured data from content
Categorize and tag content

Troubleshooting

Source Not Polling

Check:

Polling schedule is not set to "manually"
Source hasn't failed recently (check error status)
Account has active subscription
No maintenance mode active

Content Not Appearing

Possible Causes:

Still processing (check progress endpoint)
Content filtered out by seeding rules
Duplicate content (already indexed)
Content failed processing (check errors)

High Costs

Solutions:

Reduce polling frequency
Lower embedding dimensions
Set retention policies
Use polling_max_items limits
Switch to smaller embedding models

Crawl Not Finding Pages

Solutions:

Increase crawl_depth
Check URL filters aren't too restrictive
Enable sitemap crawling
Verify robots.txt isn't blocking
Check crawl_limit isn't too low

API Reference

For complete API documentation, see API Reference.

Content Sources

What are Sources?

Source Types

RSS Feeds

Websites

Content Store

Cloud Drives

Adding a Source

Basic Setup

URL Detection

Seeding Options

Full History

Latest N Items

Selected Items

Polling Schedules

Manually

Once

Hourly

Daily

Weekly

Polling Actions

New Only

New and Updated

Polling Limits

Pull Status & Errors

Pull Statuses

Reading the Status Column

Common Pull Errors

File Upload Processing Errors

Content Retention

Content Processing Pipeline

Phase 1: Fetching

Phase 2: Transcription (Audio/Video)

Phase 3: Indexing

Phase 4: Linking

Content Filtering

None

Header, Footer, Navigation

AI Main Content

AI Main Content (Skip Ads)

Embedding Configuration

Managing Sources

Update Source Settings

Manual Polling

View Source Content

Monitor Processing Progress

Delete a Source

Best Practices

Choose the Right Polling Frequency

Use Retention Policies

Optimize Seeding

Monitor Processing Costs

Content Filtering

Crawl Configuration (Websites)

Limits and Quotas

Source Connection Sharing

Agent Integration

Troubleshooting

Source Not Polling

Content Not Appearing

High Costs

Crawl Not Finding Pages

API Reference