- Create .gitignore to exclude target and output directories - Initialize Cargo.toml with project metadata and dependencies - Add README.md with project description, features, installation, and usage instructions - Implement main.rs for extracting blog posts from HTML files and exporting to JSON and text formats
2.5 KiB
2.5 KiB
Blog Extractor
A Rust tool that extracts blog posts from HTML files and exports them in multiple formats (JSON and plain text).
I just needed a simple script that would extract data from blogger.com
Features
- HTML Parsing: Extracts blog post data (title, date, content) from HTML files using CSS selectors
- JSON Export: Saves each blog post as an individual JSON file for easy programmatic access
- Combined Text Export: Generates a single text file containing all posts in a human-readable format
- Batch Processing: Processes all HTML files in a directory in one run
- Error Handling: Gracefully handles processing errors without stopping the entire operation
How It Works
The tool scans a specified directory for HTML files and extracts:
- Title: From the
h3.post-titleelement - Date: From the
time.publishedelement - Content: From the
.post-body.entry-contentelement - Source File: The original filename for reference
Each post is saved in two formats:
- Individual JSON files in
output/json/with the same filename as the source HTML - A combined text file at
output/all_posts_combined.txtcontaining all posts
Installation
Make sure you have Rust installed. If not, visit rustup.rs.
Clone the repository:
git clone https://git.gabrielkaszewski.dev/GKaszewski/blog-extractor.git
cd blog-extractor
Usage
Run the tool with a directory path containing HTML files:
cargo run -- <directory_path>
Example:
cargo run -- ./blog_html_files
Output
After running, you'll find:
output/json/- Individual JSON files for each blog postoutput/all_posts_combined.txt- All posts combined in text format
Dependencies
- scraper - HTML parsing and CSS selector support
- serde - Serialization framework
- serde_json - JSON serialization
Requirements
The HTML files should contain:
- A title in an
<h3 class="post-title">element - A publish date in a
<time class="published">element - Post content in a
<div class="post-body entry-content">element
Project Structure
blog-extractor/
├── src/
│ └── main.rs # Main application logic
├── output/
│ ├── json/ # Individual JSON files
│ └── all_posts_combined.txt # Combined text file
├── Cargo.toml # Project configuration
└── README.md # This file
License
This project is open source and available under the MIT License.