Hoto - extract tags and metadata from HTML and MAFF

Would you like to have nice names for your stored web pages?

I just published hoto.

Install Hoto

Extract HTML tags and metadata, optionally rename files. Supports MAFF as used by WebScrapbook.


  • Extract HTML tags
    • h1
    • title
    • .authorname
  • Rename HTML and MAFF files using HTML tags and metadata
  • Use CSS selectors, just like jQuery
  • Extract MAFF files
    • Supports MAFF (Mozilla Archive Format) archives created by WebScrapbook Firefox addon
    • Supports index.rdf metadata
      • archive date - automatically converted to calendar.txt format
      • original host
    • Transparently extracts HTML and RDF from compressed MAFF
  • Run almost any Python code when renaming (using f-string format)
  • Replace HTML tag content with regular expressions


$ wget
...'index.html' saved

$ ./ index.html
Tero Karvinen.html

$ ./ index.html --suggest|grep -v "not found"
##  index.html
Tero Karvinen - sel.h1
Tero Karvinen - sel('h1:first')
Tero Karvinen - Learn Free software with me - sel.title
Python weppipalvelu - ideasta tuotantoon - sel('h2:first')
Someone Karvinen - sel('h1',find='Tero',replace='Someone')
index.html - path
.html - path.suffix
index.html -
Build your own robots, hack computers (legally) and admin Linux boxes - hundreds of them! - sel.__description
Tero Karvinen - Learn Free software with me - title
html - ext
Tero Karvinen - h1
index.html - filename
index - stem

$ ./ index.html --rename
$ ls  'Tero Karvinen.html'


$ sudo apt-get update
$ sudo apt-get install wget python3-pyquery python3-rdflib
$ wget
$ chmod ugo+x
$ ./
Usage: 'hoto foo.html'. Try --help.

Feel free to star on Github.

hoto --help

usage: [-h] [--format FORMAT] [-v] [-d]
               [--suggest | --no-suggest | -s] [--rename | --no-rename]
               [--no-action | --no-no-action | -n]
               [files ...]

hoto - rename HTML and MAFF files from HTML tags and metadata

Prints new filenames: html h1 text. Keeps the existing suffix. This uses the default --format, which is '{h1}.{ext}'

$ hoto foo.html bar.maff

Print top heading (h1) of each file.

$ hoto -f '{h1}' foo.html bar.maff

Print example variables you can use.

$ hoto -s foo.html

Rename the files to HTML title, keeping existing suffix.

$ hoto -f '{title}.{ext}' foo.html bar.maff --rename

Advanced Usage

Hoto can extract HTML tags using CSS selectors. This is similar to jQuery and pyQuery. Hoto uses pyQuery library for tag extraction.

$ tero.html --format="{sel.h2}"
Python weppipalvelu - ideasta tuotantoon Palvelinten Hallinta Tunkeutumistestaus Information Security WebGoat with Podman Making Zero Days New Course: Network A	

All HTML tag extractions are also supported with MAFF archives

$ tero.maff --format="{title}"
Tero Karvinen - Learn Free software with me

If you leave out curly brackets, they are added automatically.

$ tero.html -f sel.title
Tero Karvinen - Learn Free software with me

All CSS selectors supported by pyQuery are available. For more complex selectors, use function syntax. Single quotes '' are required on function syntax.

$ ./ tero.html -f "sel('h2:first')" # single quotes required with sel('')
Python weppipalvelu - ideasta tuotantoon

You can combine multiple variables and fixed text

$ tero.html -f "{stem} - {h1} - 2024.{ext}"
tero - Tero Karvinen - 2024.htm

Variable Types: HTML Tags with CSS Selectors

$ tero.html --format="{sel.h2}"
$ tero.html -f sel.title
$ ./ tero.html -f "sel('h2:first')" # single quotes required with sel('')

Variable Types: Shorthand

$ ./ tero.html -f stem
$ ./ tero.html -f h1
$ ./ tero.html --format="{h1}"
$ ./ tero.html -f ext

Variable Types: RDF for MAFF Archives

MAFF is the Mozilla Archive Format. MAFF stores a whole page, including style sheets and images, into a single ZIP file.

You can create MAFF files with Firefox WebScrapbook addon. Current hoto implementation of MAFF index.rdf parsing is only tested and developed with WebScrapbook.

$ tero.maff -f '{rdf.archived} {rdf.originalurl} {}'
2024-06-15 w24 Sat

See you at

positional arguments:
  files                 HTML and MAFF files (default: None)

  -h, --help            show this help message and exit
  --format FORMAT, -f FORMAT
                        Output format, Python f-string syntax. Can run almost
                        any Python code. See --help for using selectors
                        (sel.h1) and specials. (default: {h1}.{ext})
  -v, --verbose         Set logging level to verbose (INFO) (default: 30)
  -d, --debug
  --suggest, --no-suggest, -s
                        Suggest tags and metadata for files, showing both
                        selectors "{sel.h1}" and matches "Tero's homepage".
                        (default: False)
  --rename, --no-rename
                        Rename files to output format. (default: False)
  --no-action, --no-no-action, -n
                        Does not actually modify any files, but shows what
                        would happen. (default: False)

Copyright 2024 Tero Karvinen