Command line HTML processing tool using CSS selectors to filter and extract data from HTML documents
pup is a command line HTML processor that reads HTML from stdin and filters content using CSS selectors. Inspired by jq, it provides a terminal-based approach to parsing and extracting data from HTML documents. The tool supports standard CSS selectors including class, ID, element, attribute, and pseudo-class selectors, along with combinators like +, >, and , for complex queries.
The tool offers multiple output formats through display functions. The text{} function extracts plain text content, attr{key} retrieves attribute values, and json{} converts HTML elements to JSON format with configurable indentation. pup automatically cleans and indents malformed HTML, making it useful for both data extraction and HTML formatting tasks.
pup integrates well with standard Unix tools and web scraping workflows. Common use cases include extracting links and titles from web pages, parsing structured data from HTML documents, and converting HTML content to JSON for further processing. The tool supports most CSS3 selectors including nth-child selectors, :contains(), :empty, and custom selectors like :parent-of(). Developers and data analysts working with HTML content from APIs, web scraping, or document processing will find pup particularly useful for command-line HTML manipulation tasks.
# via Go
go get github.com/ericchiang/pup
# via Homebrew
brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb