Command line tool for processing HTML using CSS selectors to filter and extract content from web pages
pup is a command line HTML processor that reads from stdin and filters content using CSS selectors. It parses HTML documents and applies CSS selector syntax to extract specific elements, attributes, or text content, similar to how jq processes JSON data.
The tool supports a comprehensive set of CSS selectors including class, ID, element, attribute, and pseudo-class selectors. It can chain selectors together and use combinators like +, >, and , for complex queries. For example, pup 'table table tr:nth-last-of-type(n+2) td.title a' extracts specific links from nested table structures.
pup includes display functions that transform the output format. The text{} function extracts plain text content, attr{attrkey} retrieves attribute values, and json{} converts HTML elements to JSON objects with configurable indentation. By default, pup also cleans and indents malformed HTML, making it readable with optional color highlighting.
The tool is particularly useful for web scraping, HTML parsing in shell scripts, and extracting structured data from web pages. It integrates well with curl, wget, and other command line tools through Unix pipes, enabling developers and system administrators to process web content efficiently from the terminal.
# via Go
go get github.com/ericchiang/pup
# via Homebrew
brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb