Hoto - extract tags and metadata from HTML and MAFF

Would you like to have nice names for your stored web pages?

I just published hoto.

Install Hoto

Extract HTML tags and metadata, optionally rename files. Supports MAFF as used by WebScrapbook.

Features

  • Extract HTML tags
    • h1
    • title
    • .authorname
  • Rename HTML and MAFF files using HTML tags and metadata
  • Use CSS selectors, just like jQuery
  • Extract MAFF files
    • Supports MAFF (Mozilla Archive Format) archives created by WebScrapbook Firefox addon
    • Supports index.rdf metadata
      • archive date - automatically converted to calendar.txt format
      • original host
    • Transparently extracts HTML and RDF from compressed MAFF
  • Run almost any Python code when renaming (using f-string format)
  • Replace HTML tag content with regular expressions

Quickstart

$ wget terokarvinen.com
...'index.html' saved

$ ./hoto.py index.html
Tero Karvinen.html

$ ./hoto.py index.html --suggest|grep -v "not found"
##  index.html
Tero Karvinen - sel.h1
Tero Karvinen - sel('h1:first')
Tero Karvinen - Learn Free software with me - sel.title
Python weppipalvelu - ideasta tuotantoon - sel('h2:first')
Someone Karvinen - sel('h1',find='Tero',replace='Someone')
index.html - path
.html - path.suffix
index.html - path.name
Build your own robots, hack computers (legally) and admin Linux boxes - hundreds of them! - sel.__description
Tero Karvinen - Learn Free software with me - title
html - ext
Tero Karvinen - h1
index.html - filename
index - stem

$ ./hoto.py index.html --rename
$ ls
hoto.py  'Tero Karvinen.html'

Install

$ sudo apt-get update
$ sudo apt-get install wget python3-pyquery python3-rdflib
$ wget https://raw.githubusercontent.com/terokarvinen/hoto/main/hoto.py
$ chmod ugo+x hoto.py
$ ./hoto.py
Usage: 'hoto foo.html'. Try --help.

Feel free to star on Github.

hoto --help

usage: hoto.py [-h] [--format FORMAT] [-v] [-d]
               [--suggest | --no-suggest | -s] [--rename | --no-rename]
               [--no-action | --no-no-action | -n]
               [files ...]

hoto - rename HTML and MAFF files from HTML tags and metadata

Prints new filenames: html h1 text. Keeps the existing suffix. This uses the default --format, which is '{h1}.{ext}'

$ hoto foo.html bar.maff

Print top heading (h1) of each file.

$ hoto -f '{h1}' foo.html bar.maff

Print example variables you can use.

$ hoto -s foo.html

Rename the files to HTML title, keeping existing suffix.

$ hoto -f '{title}.{ext}' foo.html bar.maff --rename

Advanced Usage

Hoto can extract HTML tags using CSS selectors. This is similar to jQuery and pyQuery. Hoto uses pyQuery library for tag extraction.

$ hoto.py tero.html --format="{sel.h2}"
Python weppipalvelu - ideasta tuotantoon Palvelinten Hallinta Tunkeutumistestaus Information Security WebGoat with Podman Making Zero Days New Course: Network A	

All HTML tag extractions are also supported with MAFF archives

$ hoto.py tero.maff --format="{title}"
Tero Karvinen - Learn Free software with me

If you leave out curly brackets, they are added automatically.

$ hoto.py tero.html -f sel.title
Tero Karvinen - Learn Free software with me

All CSS selectors supported by pyQuery are available. For more complex selectors, use function syntax. Single quotes '' are required on function syntax.

$ ./hoto.py tero.html -f "sel('h2:first')" # single quotes required with sel('')
Python weppipalvelu - ideasta tuotantoon

You can combine multiple variables and fixed text

$ hoto.py tero.html -f "{stem} - {h1} - 2024.{ext}"
tero - Tero Karvinen - 2024.htm

Variable Types: HTML Tags with CSS Selectors

$ hoto.py tero.html --format="{sel.h2}"
$ hoto.py tero.html -f sel.title
$ ./hoto.py tero.html -f "sel('h2:first')" # single quotes required with sel('')

Variable Types: Shorthand

$ ./hoto.py tero.html -f stem
$ ./hoto.py tero.html -f h1
$ ./hoto.py tero.html --format="{h1}"
$ ./hoto.py tero.html -f ext

Variable Types: RDF for MAFF Archives

MAFF is the Mozilla Archive Format. MAFF stores a whole page, including style sheets and images, into a single ZIP file.

You can create MAFF files with Firefox WebScrapbook addon. Current hoto implementation of MAFF index.rdf parsing is only tested and developed with WebScrapbook.

$ hoto.py tero.maff -f '{rdf.archived} {rdf.originalurl} {rdf.host}'
2024-06-15 w24 Sat https://terokarvinen.com/ terokarvinen.com

See you at https://TeroKarvinen.com

positional arguments:
  files                 HTML and MAFF files (default: None)

options:
  -h, --help            show this help message and exit
  --format FORMAT, -f FORMAT
                        Output format, Python f-string syntax. Can run almost
                        any Python code. See --help for using selectors
                        (sel.h1) and specials. (default: {h1}.{ext})
  -v, --verbose         Set logging level to verbose (INFO) (default: 30)
  -d, --debug
  --suggest, --no-suggest, -s
                        Suggest tags and metadata for files, showing both
                        selectors "{sel.h1}" and matches "Tero's homepage".
                        (default: False)
  --rename, --no-rename
                        Rename files to output format. (default: False)
  --no-action, --no-no-action, -n
                        Does not actually modify any files, but shows what
                        would happen. (default: False)

Copyright 2024 Tero Karvinen https://TeroKarvinen.com