Add comprehensive architecture design for thicket CLI

+332

1 changed file

expand all

ARCH.md

+332

ARCH.md

··· 1 + # Thicket Architecture Design 2 + 3 + ## Overview 4 + Thicket is a modern CLI tool for persisting Atom/RSS feeds in a Git repository, designed to enable distributed webblog comment structures. 5 + 6 + ## Technology Stack 7 + 8 + ### Core Libraries 9 + 10 + #### CLI Framework 11 + - **Typer** (0.15.x) - Modern CLI framework with type hints 12 + - **Rich** (13.x) - Beautiful terminal output, progress bars, and tables 13 + - **prompt-toolkit** - Interactive prompts when needed 14 + 15 + #### Feed Processing 16 + - **feedparser** (6.0.11) - Universal feed parser supporting RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 17 + - Alternative: **atoma** for stricter Atom/RSS parsing with JSON feed support 18 + - Alternative: **fastfeedparser** for high-performance parsing (10x faster) 19 + 20 + #### Git Integration 21 + - **GitPython** (3.1.44) - High-level git operations, requires git CLI 22 + - Alternative: **pygit2** (1.18.0) - Direct libgit2 bindings, better for authentication 23 + 24 + #### HTTP Client 25 + - **httpx** (0.28.x) - Modern async/sync HTTP client with connection pooling 26 + - **aiohttp** (3.11.x) - For async-only operations if needed 27 + 28 + #### Configuration & Data Models 29 + - **pydantic** (2.11.x) - Data validation and settings management 30 + - **pydantic-settings** (2.10.x) - Configuration file handling with env var support 31 + 32 + #### Utilities 33 + - **pendulum** (3.x) - Better datetime handling 34 + - **bleach** (6.x) - HTML sanitization for feed content 35 + - **platformdirs** (4.x) - Cross-platform directory paths 36 + 37 + ## Project Structure 38 + 39 + ``` 40 + thicket/ 41 + ├── pyproject.toml # Modern Python packaging 42 + ├── README.md # Project documentation 43 + ├── ARCH.md # This file 44 + ├── CLAUDE.md # Project instructions 45 + ├── .gitignore 46 + ├── src/ 47 + │ └── thicket/ 48 + │ ├── __init__.py 49 + │ ├── __main__.py # Entry point for `python -m thicket` 50 + │ ├── cli/ # CLI commands and interface 51 + │ │ ├── __init__.py 52 + │ │ ├── main.py # Main CLI app with Typer 53 + │ │ ├── commands/ # Subcommands 54 + │ │ │ ├── __init__.py 55 + │ │ │ ├── init.py # Initialize git store 56 + │ │ │ ├── add.py # Add feed to config 57 + │ │ │ ├── sync.py # Sync feeds 58 + │ │ │ ├── list.py # List users/feeds 59 + │ │ │ └── search.py # Search entries 60 + │ │ └── utils.py # CLI utilities (progress, formatting) 61 + │ ├── core/ # Core business logic 62 + │ │ ├── __init__.py 63 + │ │ ├── feed_parser.py # Feed parsing and normalization 64 + │ │ ├── git_store.py # Git repository operations 65 + │ │ ├── cache.py # Cache management 66 + │ │ └── sanitizer.py # Filename and HTML sanitization 67 + │ ├── models/ # Pydantic data models 68 + │ │ ├── __init__.py 69 + │ │ ├── config.py # Configuration models 70 + │ │ ├── feed.py # Feed/Entry models 71 + │ │ └── user.py # User metadata models 72 + │ └── utils/ # Shared utilities 73 + │ ├── __init__.py 74 + │ ├── paths.py # Path handling 75 + │ └── network.py # HTTP client wrapper 76 + ├── tests/ 77 + │ ├── __init__.py 78 + │ ├── conftest.py # pytest configuration 79 + │ ├── test_feed_parser.py 80 + │ ├── test_git_store.py 81 + │ └── fixtures/ # Test data 82 + │ └── feeds/ 83 + └── docs/ 84 + └── examples/ # Example configurations 85 + ``` 86 + 87 + ## Data Models 88 + 89 + ### Configuration File (YAML/TOML) 90 + ```python 91 + class ThicketConfig(BaseSettings): 92 + git_store: Path # Git repository location 93 + cache_dir: Path # Cache directory 94 + users: list[UserConfig] 95 + 96 + model_config = SettingsConfigDict( 97 + env_prefix="THICKET_", 98 + env_file=".env", 99 + yaml_file="thicket.yaml" 100 + ) 101 + 102 + class UserConfig(BaseModel): 103 + username: str 104 + feeds: list[HttpUrl] 105 + email: Optional[EmailStr] = None 106 + homepage: Optional[HttpUrl] = None 107 + icon: Optional[HttpUrl] = None 108 + display_name: Optional[str] = None 109 + ``` 110 + 111 + ### Feed Storage Format 112 + ```python 113 + class AtomEntry(BaseModel): 114 + id: str # Original Atom ID 115 + title: str 116 + link: HttpUrl 117 + updated: datetime 118 + published: Optional[datetime] 119 + summary: Optional[str] 120 + content: Optional[str] # Full body content from Atom entry 121 + content_type: Optional[str] = "html" # text, html, xhtml 122 + author: Optional[dict] 123 + categories: list[str] = [] 124 + rights: Optional[str] = None # Copyright info 125 + source: Optional[str] = None # Source feed URL 126 + # Additional Atom fields preserved during RSS->Atom conversion 127 + 128 + model_config = ConfigDict( 129 + json_encoders={ 130 + datetime: lambda v: v.isoformat() 131 + } 132 + ) 133 + 134 + class DuplicateMap(BaseModel): 135 + """Maps duplicate entry IDs to canonical entry IDs""" 136 + duplicates: dict[str, str] = {} # duplicate_id -> canonical_id 137 + comment: str = "Entry IDs that map to the same canonical content" 138 + 139 + def add_duplicate(self, duplicate_id: str, canonical_id: str) -> None: 140 + """Add a duplicate mapping""" 141 + self.duplicates[duplicate_id] = canonical_id 142 + 143 + def remove_duplicate(self, duplicate_id: str) -> bool: 144 + """Remove a duplicate mapping. Returns True if existed.""" 145 + return self.duplicates.pop(duplicate_id, None) is not None 146 + 147 + def get_canonical(self, entry_id: str) -> str: 148 + """Get canonical ID for an entry (returns original if not duplicate)""" 149 + return self.duplicates.get(entry_id, entry_id) 150 + 151 + def is_duplicate(self, entry_id: str) -> bool: 152 + """Check if entry ID is marked as duplicate""" 153 + return entry_id in self.duplicates 154 + ``` 155 + 156 + ## Git Repository Structure 157 + ``` 158 + git-store/ 159 + ├── index.json # User directory index 160 + ├── duplicates.json # Manual curation of duplicate entries 161 + ├── user1/ 162 + │ ├── metadata.json # User metadata 163 + │ ├── entry_id_1.json # Sanitized entry files 164 + │ ├── entry_id_2.json 165 + │ └── ... 166 + └── user2/ 167 + └── ... 168 + ``` 169 + 170 + ## Key Design Decisions 171 + 172 + ### 1. Feed Normalization & Auto-Discovery 173 + - All RSS feeds converted to Atom format before storage 174 + - Preserves maximum metadata during conversion 175 + - Sanitizes HTML content to prevent XSS 176 + - **Auto-discovery**: Extracts user metadata from feed during `add user` command 177 + 178 + ### 2. ID Sanitization 179 + - Consistent algorithm to convert Atom IDs to safe filenames 180 + - Handles edge cases (very long IDs, special characters) 181 + - Maintains reversibility where possible 182 + 183 + ### 3. Git Operations 184 + - Uses GitPython for simplicity (no authentication required) 185 + - Single main branch for all users and entries 186 + - Atomic commits per sync operation 187 + - Meaningful commit messages with feed update summaries 188 + - Preserves complete history - never delete entries even if they disappear from feeds 189 + 190 + ### 4. Caching Strategy 191 + - HTTP caching with Last-Modified/ETag support 192 + - Local cache of parsed feeds with TTL 193 + - Cache invalidation on configuration changes 194 + - Git store serves as permanent historical archive beyond feed depth limits 195 + 196 + ### 5. Error Handling 197 + - Graceful handling of feed parsing errors 198 + - Retry logic for network failures 199 + - Clear error messages with recovery suggestions 200 + 201 + ## CLI Command Structure 202 + 203 + ```bash 204 + # Initialize a new git store 205 + thicket init /path/to/store 206 + 207 + # Add a user with feeds (auto-discovers metadata from feed) 208 + thicket add user "alyssa" \ 209 + --feed "https://example.com/feed.atom" 210 + # Auto-populates: email, homepage, icon, display_name from feed metadata 211 + 212 + # Add a user with manual overrides 213 + thicket add user "alyssa" \ 214 + --feed "https://example.com/feed.atom" \ 215 + --email "alyssa@example.com" \ 216 + --homepage "https://alyssa.example.com" \ 217 + --icon "https://example.com/avatar.png" \ 218 + --display-name "Alyssa P. Hacker" 219 + 220 + # Add additional feed to existing user 221 + thicket add feed "alyssa" "https://example.com/other-feed.rss" 222 + 223 + # Sync all feeds (designed for cron usage) 224 + thicket sync --all 225 + 226 + # Sync specific user 227 + thicket sync --user alyssa 228 + 229 + # List users and their feeds 230 + thicket list users 231 + thicket list feeds --user alyssa 232 + 233 + # Search entries 234 + thicket search "keyword" --user alyssa --since 2025-01-01 235 + 236 + # Manage duplicate entries 237 + thicket duplicates list 238 + thicket duplicates add <entry_id_1> <entry_id_2> # Mark as duplicates 239 + thicket duplicates remove <entry_id_1> <entry_id_2> # Unmark duplicates 240 + ``` 241 + 242 + ## Performance Considerations 243 + 244 + 1. **Concurrent Feed Fetching**: Use httpx with asyncio for parallel downloads 245 + 2. **Incremental Updates**: Only fetch/parse feeds that have changed 246 + 3. **Efficient Git Operations**: Batch commits, use shallow clones where appropriate 247 + 4. **Progress Feedback**: Rich progress bars for long operations 248 + 249 + ## Security Considerations 250 + 251 + 1. **HTML Sanitization**: Use bleach to clean feed content 252 + 2. **URL Validation**: Strict validation of feed URLs 253 + 3. **Git Security**: No credentials stored in repository 254 + 4. **Path Traversal**: Careful sanitization of filenames 255 + 256 + ## Future Enhancements 257 + 258 + 1. **Web Interface**: Optional web UI for browsing the git store 259 + 2. **Webhooks**: Notify external services on feed updates 260 + 3. **Feed Discovery**: Auto-discover feeds from HTML pages 261 + 4. **Export Formats**: Generate static sites, OPML exports 262 + 5. **Federation**: P2P sync between thicket instances 263 + 264 + ## Requirements Clarification 265 + 266 + **✓ Resolved Requirements:** 267 + 1. **Feed Update Frequency**: Designed for cron usage - no built-in scheduling needed 268 + 2. **Duplicate Handling**: Manual curation via `duplicates.json` file with CLI commands 269 + 3. **Git Branching**: Single main branch for all users and entries 270 + 4. **Authentication**: No feeds require authentication currently 271 + 5. **Content Storage**: Store complete Atom entry body content as provided 272 + 6. **Deleted Entries**: Preserve all entries in Git store permanently (historical archive) 273 + 7. **History Depth**: Git store maintains full history beyond feed depth limits 274 + 8. **Feed Auto-Discovery**: Extract user metadata from feed during `add user` command 275 + 276 + ## Duplicate Entry Management 277 + 278 + ### Duplicate Detection Strategy 279 + - **Manual Curation**: Duplicates identified and managed manually via CLI 280 + - **Storage**: `duplicates.json` file in Git root maps entry IDs to canonical entries 281 + - **Structure**: `{"duplicate_id": "canonical_id", ...}` 282 + - **CLI Commands**: Add/remove duplicate mappings with validation 283 + - **Query Resolution**: Search/list commands resolve duplicates to canonical entries 284 + 285 + ### Duplicate File Format 286 + ```json 287 + { 288 + "https://example.com/feed/entry/123": "https://canonical.com/posts/same-post", 289 + "https://mirror.com/articles/456": "https://canonical.com/posts/same-post", 290 + "comment": "Entry IDs that map to the same canonical content" 291 + } 292 + ``` 293 + 294 + ## Feed Metadata Auto-Discovery 295 + 296 + ### Extraction Strategy 297 + When adding a new user with `thicket add user`, the system fetches and parses the feed to extract: 298 + 299 + - **Display Name**: From `feed.title` or `feed.author.name` 300 + - **Email**: From `feed.author.email` or `feed.managingEditor` 301 + - **Homepage**: From `feed.link` or `feed.author.uri` 302 + - **Icon**: From `feed.logo`, `feed.icon`, or `feed.image.url` 303 + 304 + ### Discovery Priority Order 305 + 1. **Author Information**: Prefer `feed.author.*` fields (more specific to person) 306 + 2. **Feed-Level**: Fall back to feed-level metadata 307 + 3. **Manual Override**: CLI flags always take precedence over discovered values 308 + 4. **Update Behavior**: Auto-discovery only runs during initial `add user`, not on sync 309 + 310 + ### Extracted Metadata Format 311 + ```python 312 + class FeedMetadata(BaseModel): 313 + title: Optional[str] = None 314 + author_name: Optional[str] = None 315 + author_email: Optional[EmailStr] = None 316 + author_uri: Optional[HttpUrl] = None 317 + link: Optional[HttpUrl] = None 318 + logo: Optional[HttpUrl] = None 319 + icon: Optional[HttpUrl] = None 320 + image_url: Optional[HttpUrl] = None 321 + 322 + def to_user_config(self, username: str, feed_url: HttpUrl) -> UserConfig: 323 + """Convert discovered metadata to UserConfig with fallbacks""" 324 + return UserConfig( 325 + username=username, 326 + feeds=[feed_url], 327 + display_name=self.author_name or self.title, 328 + email=self.author_email, 329 + homepage=self.author_uri or self.link, 330 + icon=self.logo or self.icon or self.image_url 331 + ) 332 + ```

Configure Feed

Configure Feed