Files

Gabriel Kaszewski 56306a4852 Add initial project structure with README, Cargo configuration, and main logic

- Create .gitignore to exclude target and output directories
- Initialize Cargo.toml with project metadata and dependencies
- Add README.md with project description, features, installation, and usage instructions
- Implement main.rs for extracting blog posts from HTML files and exporting to JSON and text formats

2025-12-11 21:22:31 +01:00

2.5 KiB

Raw Blame History

Blog Extractor

A Rust tool that extracts blog posts from HTML files and exports them in multiple formats (JSON and plain text).

I just needed a simple script that would extract data from blogger.com

Features

HTML Parsing: Extracts blog post data (title, date, content) from HTML files using CSS selectors
JSON Export: Saves each blog post as an individual JSON file for easy programmatic access
Combined Text Export: Generates a single text file containing all posts in a human-readable format
Batch Processing: Processes all HTML files in a directory in one run
Error Handling: Gracefully handles processing errors without stopping the entire operation

How It Works

The tool scans a specified directory for HTML files and extracts:

Title: From the h3.post-title element
Date: From the time.published element
Content: From the .post-body.entry-content element
Source File: The original filename for reference

Each post is saved in two formats:

Individual JSON files in output/json/ with the same filename as the source HTML
A combined text file at output/all_posts_combined.txt containing all posts

Installation

Make sure you have Rust installed. If not, visit rustup.rs.

Clone the repository:

git clone https://git.gabrielkaszewski.dev/GKaszewski/blog-extractor.git
cd blog-extractor

Usage

Run the tool with a directory path containing HTML files:

cargo run -- <directory_path>

Example:

cargo run -- ./blog_html_files

Output

After running, you'll find:

output/json/ - Individual JSON files for each blog post
output/all_posts_combined.txt - All posts combined in text format

Dependencies

scraper - HTML parsing and CSS selector support
serde - Serialization framework
serde_json - JSON serialization

Requirements

The HTML files should contain:

A title in an <h3 class="post-title"> element
A publish date in a <time class="published"> element
Post content in a <div class="post-body entry-content"> element

Project Structure

blog-extractor/
├── src/
│   └── main.rs              # Main application logic
├── output/
│   ├── json/                # Individual JSON files
│   └── all_posts_combined.txt # Combined text file
├── Cargo.toml               # Project configuration
└── README.md                # This file

License

This project is open source and available under the MIT License.

2.5 KiB Raw Blame History