AutoScraper¶

The autoscraper_proxy module provides proxy header support for AutoScraper.

Installation¶

First, install AutoScraper:

pip install autoscraper

Then you can use the proxy header extension.

Usage¶

Basic Usage¶

The ProxyAutoScraper class is a drop-in replacement for AutoScraper that adds proxy header capabilities:

from python_proxy_headers.autoscraper_proxy import ProxyAutoScraper

# Create a scraper with proxy headers
scraper = ProxyAutoScraper(proxy_headers={'X-ProxyMesh-Country': 'US'})

# Build rules from a sample page
result = scraper.build(
    url='https://finance.yahoo.com/quote/AAPL/',
    wanted_list=['Apple Inc.'],
    request_args={'proxies': {'https': 'http://proxy.example.com:8080'}}
)

print(result)

Using Learned Rules¶

Once you’ve built rules, you can use them on other pages:

from python_proxy_headers.autoscraper_proxy import ProxyAutoScraper

scraper = ProxyAutoScraper(proxy_headers={'X-ProxyMesh-Country': 'US'})

# Build rules
scraper.build(
    url='https://finance.yahoo.com/quote/AAPL/',
    wanted_list=['Apple Inc.'],
    request_args={'proxies': {'https': 'http://proxy:8080'}}
)

# Use rules on another page
result = scraper.get_result_similar(
    url='https://finance.yahoo.com/quote/GOOG/',
    request_args={'proxies': {'https': 'http://proxy:8080'}}
)

print(result)  # ['Alphabet Inc.']

Saving and Loading Rules¶

You can save and load learned rules:

scraper = ProxyAutoScraper(proxy_headers={'X-ProxyMesh-Country': 'US'})

# Build and save rules
scraper.build(url='...', wanted_list=['...'])
scraper.save('my_rules.json')

# Later, load rules
scraper2 = ProxyAutoScraper(proxy_headers={'X-ProxyMesh-Country': 'UK'})
scraper2.load('my_rules.json')

Context Manager¶

Use as a context manager to ensure proper cleanup:

with ProxyAutoScraper(proxy_headers={'X-Custom': 'value'}) as scraper:
    result = scraper.build(
        url='https://example.com',
        wanted_list=['Example Domain'],
        request_args={'proxies': {'https': 'http://proxy:8080'}}
    )

Updating Proxy Headers¶

You can update proxy headers at runtime:

scraper = ProxyAutoScraper(proxy_headers={'X-Country': 'US'})

# Make some requests...

# Change proxy headers
scraper.set_proxy_headers({'X-Country': 'UK'})

# Subsequent requests use new headers

API Reference¶

ProxyAutoScraper Class¶

class ProxyAutoScraper(proxy_headers=None, stack_list=None)¶

AutoScraper subclass with proxy header support.

Inherits all methods from autoscraper.AutoScraper.

Parameters:

proxy_headers – Dict of headers to send to proxy servers
stack_list – Initial stack list (rules) for the scraper

set_proxy_headers(proxy_headers)¶

Update the proxy headers. Creates a new session on next request.

Parameters:: proxy_headers – New proxy headers to use

close()¶: Close the underlying session.

build(url=None, wanted_list=None, wanted_dict=None, html=None, request_args=None, update=False, text_fuzz_ratio=1.0)¶

Build scraping rules with proxy header support.

Parameters:

url – URL of the target web page
wanted_list – List of needed contents to be scraped
wanted_dict – Dict of needed contents (keys are aliases)
html – HTML string (alternative to URL)
request_args – Request arguments including proxies
update – If True, add to existing rules
text_fuzz_ratio – Fuzziness ratio for matching

Returns:

List of similar results

get_result_similar(url=None, html=None, soup=None, request_args=None, ...)¶: Get similar results with proxy header support.

get_result_exact(url=None, html=None, soup=None, request_args=None, ...)¶: Get exact results with proxy header support.

get_result(url=None, html=None, request_args=None, ...)¶: Get both similar and exact results with proxy header support.