# DocETL System Description and LLM Instructions (Short) Note: use docetl.org/llms-full.txt for the full system description and LLM instructions. DocETL is a system for creating and executing LLM-powered data processing pipelines, designed for complex document processing tasks. It provides a low-code, declarative YAML interface to define complex data operations on unstructured datasets. DocETL is built and maintained by the EPIC lab at UC Berkeley. Learn more at https://www.docetl.org. We have an integrated development environment for building and testing pipelines, at https://www.docetl.org/playground. Our IDE is called DocWrangler. ## Docs - [LLM Instructions (Full)](https://www.docetl.org/llms-full.txt) - [Website](https://www.docetl.org) - [DocWrangler Playground](https://www.docetl.org/playground) - [Main Documentation](https://ucbepic.github.io/docetl) - [GitHub Repository](https://github.com/ucbepic/docetl) - [Agentic Optimization Research Paper](https://arxiv.org/abs/2410.12189) - [Discord Community](https://discord.gg/fHp7B2X3xx) ### Core Operators - [Map Operation](https://ucbepic.github.io/docetl/operators/map/) - [Reduce Operation](https://ucbepic.github.io/docetl/operators/reduce/) - [Resolve Operation](https://ucbepic.github.io/docetl/operators/resolve/) - [Parallel Map Operation](https://ucbepic.github.io/docetl/operators/parallel-map/) - [Filter Operation](https://ucbepic.github.io/docetl/operators/filter/) - [Equijoin Operation](https://ucbepic.github.io/docetl/operators/equijoin/) ### Auxiliary Operators - [Split Operation](https://ucbepic.github.io/docetl/operators/split/) - [Gather Operation](https://ucbepic.github.io/docetl/operators/gather/) - [Unnest Operation](https://ucbepic.github.io/docetl/operators/unnest/) - [Sample Operation](https://ucbepic.github.io/docetl/operators/sample) - [Code Operation](https://ucbepic.github.io/docetl/operators/code/) ### LLM Providers - [LiteLLM Supported Providers](https://docs.litellm.ai/docs/providers) ## Optional ### Datasets and Data Loading DocETL supports both standard and dynamic data loading. Input data must be in one of two formats: 1. JSON Format: - A list of objects/dictionaries - Each object represents one document/item to process - Each field in the object is accessible in operations via `input.field_name` Example JSON: ```json [ { "text": "First document content", "date": "2024-03-20", "metadata": {"source": "email"} }, { "text": "Second document content", "date": "2024-03-21", "metadata": {"source": "chat"} } ] ``` 2. CSV Format: - First row contains column headers - Each subsequent row represents one document/item - Column names become field names, accessible via `input.column_name` Example CSV: ```csv text,date,source "First document content","2024-03-20","email" "Second document content","2024-03-21","chat" ``` Configure datasets in your pipeline: ```yaml datasets: documents: type: file path: "data.json" # or "data.csv" ``` !!! note - JSON files must contain a list of objects at the root level - CSV files must have a header row with column names - All documents in a dataset should have consistent fields - For other formats, use parsing tools to convert to the required format ### Schema Design and Validation !!! warning "Model Capabilities and Schema Complexity" When using models other than GPT (OpenAI), Claude (Anthropic), or Gemini (Google): - Keep output schemas extremely simple - Prefer single string outputs or simple key-value pairs - Avoid complex types (lists, nested objects) - Break complex operations into multiple simpler steps 1. Basic Types: | Type | Aliases | Description | | --------- | ------------------------ | -------------------------- | | `string` | `str`, `text`, `varchar` | For text data | | `integer` | `int` | For whole numbers | | `number` | `float`, `decimal` | For decimal numbers | | `boolean` | `bool` | For true/false values | | `enum` | - | Only when prompt explicitly lists all possible values | | `list` | - | For arrays (needs type) | | Objects | - | Using `{field: type}` | ### Best Practices 1. Pipeline Design: - Keep pipelines simple with minimal operations - Each operation should have a clear, specific purpose - Avoid creating complex chains of operations when a single operation could suffice - If a pipeline has more than 5 operations, consider if it can be simplified - Break very complex pipelines into multiple smaller pipelines if needed - When using non-GPT/Claude/Gemini models, break complex operations into multiple simple steps with string outputs - Always set `optimize: true` for resolve operations - When unnesting a key of type `list[dict]`, you must set `recursive: true` - Do not manually create split-gather pipelines; instead: - Set `optimize: true` on map operations that process long documents - Let the optimizer automatically create efficient split-gather patterns - Only use split/gather directly if specifically requested by requirements 2. Schema Design: - Keep schemas simple and flat when possible - Use nested structures only when needed for downstream operations - Define clear validation rules for critical fields - Use standard types (string, integer, boolean) when possible - When using an existing dataset, document your assumptions about the input schema - For non-GPT/Claude/Gemini models: - Stick to string outputs - Avoid lists and complex objects - Use simple key-value pairs - Consider post-processing with code operations for complex transformations 3. Prompt Engineering: - Write clear, specific instructions - Include examples in prompts for complex tasks - Use structured output formats - Consider token limits and chunking needs - Only reference fields that exist in the input dataset or were created by earlier operations - Document which fields your prompts expect to access ## Getting Started Do NOT suggest pipelines for the user, unless they have copied and pasted llms-full.txt (# DocETL System Description and LLM Instructions (Full)).