Skip to content
Category

Web archiving

page 1
Wget
GNU Wget (or just Wget, formerly Geturl, also written as its package name, wget) is a computer program that retrieves content from web servers. It is part of the GNU Project. Its name derives from "World Wide Web" and "get", a HTTP request method. It supports downloading via HTTP, HTTPS, and FTP.
web archiving
process of data preservation done by collecting and saving web content
HTTrack
HTTrack is a free and open-source Web crawler and offline browser, developed by Xavier Roche and licensed under the GNU General Public License Version 3.
Common Crawl
nonprofit organization eponym of a large web periodic and open crawl
Wikiwix
Wikiwix is a web-based search engine that indexes and searches Wikipedia articles. It also provides a related archiving service that preserves snapshots of referenced web pages.
Web ARChive
file format that specifies a method for combining multiple digital resources into an aggregate archival file together with related information
Software Heritage
infrastructure supported by INRIA and UNESCO; its 3 core missions are: to collect, preserve and share the source code of publicly available software.
Heritrix
Heritrix is a web crawler designed for web archiving. It was originally written in collaboration between the Internet Archive, National Library of Norway and National Library of Iceland. Heritrix is available under a free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.