Skip to main content

Crawler FAQ

Heritrix vs ArchiveBox

  • ArchiveBox - MIT,Python
    • Web UI is primary interface
    • Command Line is secondar interface
    • More like the WebArchive
    • Support more extract
    • Support Chrome integration
  • Heritrix - Apache-2.0,Java
    • Configuration is primary interface
    • Web UI used for start, stop and monitor job
    • More like the engine of WebArchive
    • Support crawl full site

获取页面所有 URL

copy(Array.from(new Set($$('a[href]').map((v) => decodeURI(v.href)))).sort());