Files
daymade 2833aaec42 Fix PII: replace /Users/username/ with ~ and <username> placeholder
Replace hardcoded user paths that triggered gitleaks PII detection:
- /Users/username/ → ~/
- /Users/user/ → ~/
- -Users-username- → -Users-<username>- (normalized paths)

Also fix the sed example to use <home> placeholder instead of
regex pattern that would match actual usernames.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 13:36:50 +08:00

286 lines
5.3 KiB
Markdown

# Claude Code Session File Format
## Overview
Claude Code stores conversation history in JSONL (JSON Lines) format, where each line is a complete JSON object representing a message or event in the conversation.
## File Locations
### Session Files
```
~/.claude/projects/<normalized-project-path>/<session-id>.jsonl
```
**Path normalization**: Project paths are converted by replacing `/` with `-`
Example:
- Project: `~/Workspace/js/myproject`
- Directory: `~/.claude/projects/-Users-<username>-Workspace-js-myproject/`
### File Types
| Pattern | Type | Description |
|---------|------|-------------|
| `<uuid>.jsonl` | Main session | User conversation sessions |
| `agent-<id>.jsonl` | Agent session | Sub-agent execution logs |
## JSON Structure
### Message Object
Every line in a JSONL file follows this structure:
```json
{
"role": "user" | "assistant",
"message": {
"role": "user" | "assistant",
"content": [...]
},
"timestamp": "2025-11-26T00:00:00.000Z",
"uuid": "message-uuid",
"parentUuid": "parent-message-uuid",
"sessionId": "session-uuid"
}
```
### Content Types
The `content` array contains different types of content blocks:
#### Text Content
```json
{
"type": "text",
"text": "Message text content"
}
```
#### Tool Use (Write)
```json
{
"type": "tool_use",
"name": "Write",
"input": {
"file_path": "/absolute/path/to/file.js",
"content": "File content here..."
}
}
```
#### Tool Use (Edit)
```json
{
"type": "tool_use",
"name": "Edit",
"input": {
"file_path": "/absolute/path/to/file.js",
"old_string": "Original text",
"new_string": "Replacement text",
"replace_all": false
}
}
```
#### Tool Use (Read)
```json
{
"type": "tool_use",
"name": "Read",
"input": {
"file_path": "/absolute/path/to/file.js",
"offset": 0,
"limit": 100
}
}
```
#### Tool Use (Bash)
```json
{
"type": "tool_use",
"name": "Bash",
"input": {
"command": "ls -la",
"description": "List files"
}
}
```
### Tool Result
```json
{
"type": "tool_result",
"tool_use_id": "tool-use-uuid",
"content": "Result content",
"is_error": false
}
```
## Common Extraction Patterns
### Finding Write Operations
Look for assistant messages with `tool_use` type and `name: "Write"`:
```python
if item.get("type") == "tool_use" and item.get("name") == "Write":
file_path = item["input"]["file_path"]
content = item["input"]["content"]
```
### Finding Edit Operations
```python
if item.get("type") == "tool_use" and item.get("name") == "Edit":
file_path = item["input"]["file_path"]
old_string = item["input"]["old_string"]
new_string = item["input"]["new_string"]
```
### Extracting Text Content
```python
for item in message_content:
if item.get("type") == "text":
text = item.get("text", "")
```
## Field Locations
Due to schema variations, some fields may appear in different locations:
### Role Field
```python
role = data.get("role") or data.get("message", {}).get("role")
```
### Content Field
```python
content = data.get("content") or data.get("message", {}).get("content", [])
```
### Timestamp Field
```python
timestamp = data.get("timestamp", "")
```
## Common Use Cases
### Recover Deleted Files
1. Search for `Write` tool calls with matching file path
2. Extract `input.content` from latest occurrence
3. Save to disk with original filename
### Track File Changes
1. Find all `Edit` and `Write` operations for a file
2. Build chronological list of changes
3. Reconstruct file history
### Search Conversations
1. Extract all `text` content from messages
2. Search for keywords or patterns
3. Return matching sessions
### Analyze Tool Usage
1. Count occurrences of each tool type
2. Track which files were accessed
3. Generate usage statistics
## Edge Cases
### Empty Content
Some messages may have empty content arrays:
```python
content = data.get("content", [])
if not content:
continue
```
### Missing Fields
Always use `.get()` with defaults:
```python
file_path = item.get("input", {}).get("file_path", "")
```
### JSON Decode Errors
Session files may contain malformed lines:
```python
try:
data = json.loads(line)
except json.JSONDecodeError:
continue # Skip malformed lines
```
### Large Files
Session files can be very large (>100MB). Process line-by-line:
```python
with open(session_file, 'r') as f:
for line in f: # Streaming, not f.read()
process_line(line)
```
## Performance Tips
### Memory Efficiency
- Process files line-by-line (streaming)
- Don't load entire file into memory
- Use generators for large result sets
### Search Optimization
- Early exit when keyword count threshold met
- Case-insensitive search: normalize once
- Use `in` operator for substring matching
### Deduplication
When recovering files, keep latest version only:
```python
files_by_path = {}
for call in write_calls:
files_by_path[file_path] = call # Overwrites earlier versions
```
## Security Considerations
### Personal Information
Session files may contain:
- Absolute file paths with usernames
- API keys or credentials in code
- Company-specific information
- Private conversations
### Safe Sharing
Before sharing extracted content:
1. Remove absolute paths
2. Redact sensitive information
3. Use placeholders for usernames
4. Verify no credentials present