# Repomix File Format Reference
This document provides comprehensive documentation of repomix output formats for accurate file extraction.
## Overview
Repomix can generate output in three formats:
1. **XML** (default) - Most common, uses XML tags
2. **Markdown** - Human-readable, uses markdown code blocks
3. **JSON** - Structured data format
## XML Format
### Structure
The XML format is the default and most common repomix output:
```xml
[Summary and metadata about the packed repository]
[Text-based directory tree visualization]
content of file1
content of file2
```
### File Block Pattern
Each file is enclosed in a `` tag with a `path` attribute:
```xml
#!/usr/bin/env python3
def main():
print("Hello, world!")
if __name__ == "__main__":
main()
```
### Key Characteristics
- File path is in the `path` attribute (relative path)
- Content starts on the line after the opening tag
- Content ends on the line before the closing tag
- No leading/trailing blank lines in content (content is trimmed)
### Extraction Pattern
The unmixing script uses this regex pattern:
```python
r'\n(.*?)\n'
```
**Pattern breakdown:**
- `` - Captures the file path from the path attribute
- `\n` - Expects a newline after opening tag
- `(.*?)` - Captures file content (non-greedy, allows multiline)
- `\n` - Expects newline before closing tag
## Markdown Format
### Structure
The Markdown format uses code blocks to delimit file content:
````markdown
# Repository Summary
[Summary content]
## Directory Structure
```
directory/
file1.txt
file2.txt
```
## Files
### File: relative/path/to/file1.ext
```python
# File content here
def example():
pass
```
### File: relative/path/to/file2.ext
```javascript
// Another file
console.log("Hello");
```
````
### File Block Pattern
Each file uses a level-3 heading with "File:" prefix and code block:
````markdown
### File: src/main.py
```python
#!/usr/bin/env python3
def main():
print("Hello, world!")
```
````
### Key Characteristics
- File path follows "### File: " heading
- Content is within a code block (triple backticks)
- Language hint may be included after opening backticks
- Content preserves original formatting
### Extraction Pattern
```python
r'## File: ([^\n]+)\n```[^\n]*\n(.*?)\n```'
```
**Pattern breakdown:**
- `## File: ([^\n]+)` - Captures file path from heading
- `\n` - Newline after heading
- ``` `[^\n]*` ``` - Opening code block with optional language
- `\n(.*?)\n` - Captures content between backticks
- ``` ` ``` ``` - Closing backticks
## JSON Format
### Structure
The JSON format provides structured data:
```json
{
"metadata": {
"repository": "owner/repo",
"timestamp": "2025-10-22T19:00:00Z"
},
"directoryStructure": "directory/\n file1.txt\n file2.txt\n",
"files": [
{
"path": "relative/path/to/file1.ext",
"content": "content of file1\n"
},
{
"path": "relative/path/to/file2.ext",
"content": "content of file2\n"
}
]
}
```
### File Entry Structure
Each file is an object in the `files` array:
```json
{
"path": "src/main.py",
"content": "#!/usr/bin/env python3\n\ndef main():\n print(\"Hello, world!\")\n\nif __name__ == \"__main__\":\n main()\n"
}
```
### Key Characteristics
- Files are in a `files` array
- Each file has `path` and `content` keys
- Content includes literal `\n` for newlines
- Content is JSON-escaped (quotes, backslashes)
### Extraction Approach
```python
data = json.loads(content)
files = data.get('files', [])
for file_entry in files:
file_path = file_entry.get('path')
file_content = file_entry.get('content', '')
```
## Format Detection
### Detection Logic
The unmixing script auto-detects format using these checks:
1. **XML**: Contains ``
2. **JSON**: Starts with `{` and contains `"files"`
3. **Markdown**: Contains `## File:`
### Detection Priority
1. Check XML markers first (most common)
2. Check JSON structure second
3. Check Markdown markers last
4. Return `None` if no format matches
### Example Detection Code
```python
def detect_format(content):
if '' in content:
return 'xml'
if content.strip().startswith('{') and '"files"' in content:
return 'json'
if '## File:' in content:
return 'markdown'
return None
```
## File Path Encoding
### Relative Paths
All file paths in repomix output are relative to the repository root:
```
src/components/Header.tsx
docs/README.md
package.json
```
### Special Characters
File paths may contain:
- Spaces: `"My Documents/file.txt"`
- Hyphens: `"some-file.md"`
- Underscores: `"my_script.py"`
- Dots: `"config.local.json"`
Paths are preserved exactly as they appear in the original repository.
### Directory Separators
- Always forward slashes (`/`) regardless of platform
- No leading slash (relative paths)
- No trailing slash for files
## Content Encoding
### Character Encoding
All formats use **UTF-8** encoding for both the container file and extracted content.
### Special Characters
- **XML**: Content may contain XML-escaped characters (`<`, `>`, `&`)
- **Markdown**: Content is plain text within code blocks
- **JSON**: Content uses JSON string escaping (`\"`, `\\`, `\n`)
### Line Endings
- Original line endings are preserved
- May be `\n` (Unix), `\r\n` (Windows), or `\r` (old Mac)
- Extraction preserves original endings
## Edge Cases
### Empty Files
**XML:**
```xml
```
**Markdown:**
````markdown
### File: empty.txt
```
```
````
**JSON:**
```json
{"path": "empty.txt", "content": ""}
```
### Binary Files
Binary files are typically **not included** in repomix output. The directory structure may list them, but they won't have content blocks.
### Large Files
Some repomix configurations may truncate or exclude large files. Check the file summary section for exclusion notes.
## Version Differences
### Repomix v1.x
- Uses XML format by default
- File blocks have consistent structure
- No automatic format version marker
### Repomix v2.x
- Adds JSON and Markdown format support
- May include version metadata in output
- Maintains backward compatibility with v1 XML
## Validation
### Successful Extraction Indicators
After extraction, verify:
1. **File count** matches expected number
2. **Directory structure** matches the `` section
3. **Content integrity** - spot-check a few files
4. **No empty directories** unless explicitly included
### Common Format Issues
**Issue**: Files not extracted
- **Cause**: Format pattern mismatch
- **Solution**: Check format manually, verify repomix version
**Issue**: Partial content extraction
- **Cause**: Incorrect regex pattern (too greedy or not greedy enough)
- **Solution**: Check for nested tags or malformed blocks
**Issue**: Encoding errors
- **Cause**: Non-UTF-8 content in repomix file
- **Solution**: Verify source file encoding
## Examples
### Complete XML Example
```xml
This is a packed repository.
my-skill/
SKILL.md
scripts/
helper.py
---
name: my-skill
description: Example skill
---
# My Skill
This is an example.
#!/usr/bin/env python3
def help():
print("Helping!")
```
### Complete Markdown Example
````markdown
# Repository: my-skill
## Directory Structure
```
my-skill/
SKILL.md
scripts/
helper.py
```
## Files
### File: my-skill/SKILL.md
```markdown
---
name: my-skill
description: Example skill
---
# My Skill
This is an example.
```
### File: my-skill/scripts/helper.py
```python
#!/usr/bin/env python3
def help():
print("Helping!")
```
````
### Complete JSON Example
```json
{
"metadata": {
"repository": "my-skill"
},
"directoryStructure": "my-skill/\n SKILL.md\n scripts/\n helper.py\n",
"files": [
{
"path": "my-skill/SKILL.md",
"content": "---\nname: my-skill\ndescription: Example skill\n---\n\n# My Skill\n\nThis is an example.\n"
},
{
"path": "my-skill/scripts/helper.py",
"content": "#!/usr/bin/env python3\n\ndef help():\n print(\"Helping!\")\n"
}
]
}
```
## References
- Repomix documentation: https://github.com/yamadashy/repomix
- Repomix output examples: Check the repomix repository for sample outputs
- XML specification: https://www.w3.org/XML/
- JSON specification: https://www.json.org/