Category

Web archiving

page 1

GNU Wget (or just Wget, formerly Geturl, also written as its package name, wget) is a computer program that retrieves content from web servers. It is part of the GNU Project. Its name derives from "World Wide Web" and "get", a HTTP request method. It supports downloading via HTTP, HTTPS, and FTP.

process of data preservation done by collecting and saving web content

HTTrack is a free and open-source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License Version 3.

nonprofit organization eponym of a large web periodic and open crawl

Wikiwix is a web-based search engine that indexes and searches Wikipedia articles. It also provides a related archiving service that preserves snapshots of referenced web pages.

file format that specifies a method for combining multiple digital resources into an aggregate archival file together with related information

Software Heritage

infrastructure supported by INRIA and UNESCO; its 3 core missions are: to collect, preserve and share the source code of publicly available software.

Heritrix is a web crawler designed for web archiving. It was originally written in collaboration between the Internet Archive, National Library of Norway and National Library of Iceland. Heritrix is available under a free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.