- Create .gitignore to exclude target and output directories - Initialize Cargo.toml with project metadata and dependencies - Add README.md with project description, features, installation, and usage instructions - Implement main.rs for extracting blog posts from HTML files and exporting to JSON and text formats
91 lines
2.5 KiB
Markdown
91 lines
2.5 KiB
Markdown
# Blog Extractor
|
|
|
|
A Rust tool that extracts blog posts from HTML files and exports them in multiple formats (JSON and plain text).
|
|
|
|
I just needed a simple script that would extract data from blogger.com
|
|
|
|
## Features
|
|
|
|
- **HTML Parsing**: Extracts blog post data (title, date, content) from HTML files using CSS selectors
|
|
- **JSON Export**: Saves each blog post as an individual JSON file for easy programmatic access
|
|
- **Combined Text Export**: Generates a single text file containing all posts in a human-readable format
|
|
- **Batch Processing**: Processes all HTML files in a directory in one run
|
|
- **Error Handling**: Gracefully handles processing errors without stopping the entire operation
|
|
|
|
## How It Works
|
|
|
|
The tool scans a specified directory for HTML files and extracts:
|
|
|
|
- **Title**: From the `h3.post-title` element
|
|
- **Date**: From the `time.published` element
|
|
- **Content**: From the `.post-body.entry-content` element
|
|
- **Source File**: The original filename for reference
|
|
|
|
Each post is saved in two formats:
|
|
|
|
1. Individual JSON files in `output/json/` with the same filename as the source HTML
|
|
2. A combined text file at `output/all_posts_combined.txt` containing all posts
|
|
|
|
## Installation
|
|
|
|
Make sure you have Rust installed. If not, visit [rustup.rs](https://rustup.rs/).
|
|
|
|
Clone the repository:
|
|
|
|
```bash
|
|
git clone https://git.gabrielkaszewski.dev/GKaszewski/blog-extractor.git
|
|
cd blog-extractor
|
|
```
|
|
|
|
## Usage
|
|
|
|
Run the tool with a directory path containing HTML files:
|
|
|
|
```bash
|
|
cargo run -- <directory_path>
|
|
```
|
|
|
|
Example:
|
|
|
|
```bash
|
|
cargo run -- ./blog_html_files
|
|
```
|
|
|
|
### Output
|
|
|
|
After running, you'll find:
|
|
|
|
- `output/json/` - Individual JSON files for each blog post
|
|
- `output/all_posts_combined.txt` - All posts combined in text format
|
|
|
|
## Dependencies
|
|
|
|
- **scraper** - HTML parsing and CSS selector support
|
|
- **serde** - Serialization framework
|
|
- **serde_json** - JSON serialization
|
|
|
|
## Requirements
|
|
|
|
The HTML files should contain:
|
|
|
|
- A title in an `<h3 class="post-title">` element
|
|
- A publish date in a `<time class="published">` element
|
|
- Post content in a `<div class="post-body entry-content">` element
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
blog-extractor/
|
|
├── src/
|
|
│ └── main.rs # Main application logic
|
|
├── output/
|
|
│ ├── json/ # Individual JSON files
|
|
│ └── all_posts_combined.txt # Combined text file
|
|
├── Cargo.toml # Project configuration
|
|
└── README.md # This file
|
|
```
|
|
|
|
## License
|
|
|
|
This project is open source and available under the MIT License.
|