Add initial project structure with README, Cargo configuration, and main logic

- Create .gitignore to exclude target and output directories - Initialize Cargo.toml with project metadata and dependencies - Add README.md with project description, features, installation, and usage instructions - Implement main.rs for extracting blog posts from HTML files and exporting to JSON and text formats
2025-12-11 21:22:31 +01:00
commit 56306a4852
5 changed files with 744 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,90 @@
+# Blog Extractor
+
+A Rust tool that extracts blog posts from HTML files and exports them in multiple formats (JSON and plain text).
+
+I just needed a simple script that would extract data from blogger.com
+
+## Features
+
+- **HTML Parsing**: Extracts blog post data (title, date, content) from HTML files using CSS selectors
+- **JSON Export**: Saves each blog post as an individual JSON file for easy programmatic access
+- **Combined Text Export**: Generates a single text file containing all posts in a human-readable format
+- **Batch Processing**: Processes all HTML files in a directory in one run
+- **Error Handling**: Gracefully handles processing errors without stopping the entire operation
+
+## How It Works
+
+The tool scans a specified directory for HTML files and extracts:
+
+- **Title**: From the `h3.post-title` element
+- **Date**: From the `time.published` element
+- **Content**: From the `.post-body.entry-content` element
+- **Source File**: The original filename for reference
+
+Each post is saved in two formats:
+
+1. Individual JSON files in `output/json/` with the same filename as the source HTML
+2. A combined text file at `output/all_posts_combined.txt` containing all posts
+
+## Installation
+
+Make sure you have Rust installed. If not, visit [rustup.rs](https://rustup.rs/).
+
+Clone the repository:
+
+```bash
+git clone https://git.gabrielkaszewski.dev/GKaszewski/blog-extractor.git
+cd blog-extractor
+```
+
+## Usage
+
+Run the tool with a directory path containing HTML files:
+
+```bash
+cargo run -- <directory_path>
+```
+
+Example:
+
+```bash
+cargo run -- ./blog_html_files
+```
+
+### Output
+
+After running, you'll find:
+
+- `output/json/` - Individual JSON files for each blog post
+- `output/all_posts_combined.txt` - All posts combined in text format
+
+## Dependencies
+
+- **scraper** - HTML parsing and CSS selector support
+- **serde** - Serialization framework
+- **serde_json** - JSON serialization
+
+## Requirements
+
+The HTML files should contain:
+
+- A title in an `<h3 class="post-title">` element
+- A publish date in a `<time class="published">` element
+- Post content in a `<div class="post-body entry-content">` element
+
+## Project Structure
+
+```
+blog-extractor/
+├── src/
+│   └── main.rs              # Main application logic
+├── output/
+│   ├── json/                # Individual JSON files
+│   └── all_posts_combined.txt # Combined text file
+├── Cargo.toml               # Project configuration
+└── README.md                # This file
+```
+
+## License
+
+This project is open source and available under the MIT License.