AutoScraper¶
The autoscraper_proxy module provides proxy header support for AutoScraper.
Installation¶
First, install AutoScraper:
pip install autoscraper
Then you can use the proxy header extension.
Usage¶
Basic Usage¶
The ProxyAutoScraper class is a drop-in replacement for AutoScraper
that adds proxy header capabilities:
from python_proxy_headers.autoscraper_proxy import ProxyAutoScraper
# Create a scraper with proxy headers
scraper = ProxyAutoScraper(proxy_headers={'X-ProxyMesh-Country': 'US'})
# Build rules from a sample page
result = scraper.build(
url='https://finance.yahoo.com/quote/AAPL/',
wanted_list=['Apple Inc.'],
request_args={'proxies': {'https': 'http://proxy.example.com:8080'}}
)
print(result)
Using Learned Rules¶
Once you’ve built rules, you can use them on other pages:
from python_proxy_headers.autoscraper_proxy import ProxyAutoScraper
scraper = ProxyAutoScraper(proxy_headers={'X-ProxyMesh-Country': 'US'})
# Build rules
scraper.build(
url='https://finance.yahoo.com/quote/AAPL/',
wanted_list=['Apple Inc.'],
request_args={'proxies': {'https': 'http://proxy:8080'}}
)
# Use rules on another page
result = scraper.get_result_similar(
url='https://finance.yahoo.com/quote/GOOG/',
request_args={'proxies': {'https': 'http://proxy:8080'}}
)
print(result) # ['Alphabet Inc.']
Saving and Loading Rules¶
You can save and load learned rules:
scraper = ProxyAutoScraper(proxy_headers={'X-ProxyMesh-Country': 'US'})
# Build and save rules
scraper.build(url='...', wanted_list=['...'])
scraper.save('my_rules.json')
# Later, load rules
scraper2 = ProxyAutoScraper(proxy_headers={'X-ProxyMesh-Country': 'UK'})
scraper2.load('my_rules.json')
Context Manager¶
Use as a context manager to ensure proper cleanup:
with ProxyAutoScraper(proxy_headers={'X-Custom': 'value'}) as scraper:
result = scraper.build(
url='https://example.com',
wanted_list=['Example Domain'],
request_args={'proxies': {'https': 'http://proxy:8080'}}
)
Updating Proxy Headers¶
You can update proxy headers at runtime:
scraper = ProxyAutoScraper(proxy_headers={'X-Country': 'US'})
# Make some requests...
# Change proxy headers
scraper.set_proxy_headers({'X-Country': 'UK'})
# Subsequent requests use new headers
API Reference¶
ProxyAutoScraper Class¶
- class ProxyAutoScraper(proxy_headers=None, stack_list=None)¶
AutoScraper subclass with proxy header support.
Inherits all methods from
autoscraper.AutoScraper.- Parameters:
proxy_headers – Dict of headers to send to proxy servers
stack_list – Initial stack list (rules) for the scraper
- set_proxy_headers(proxy_headers)¶
Update the proxy headers. Creates a new session on next request.
- Parameters:
proxy_headers – New proxy headers to use
- close()¶
Close the underlying session.
- build(url=None, wanted_list=None, wanted_dict=None, html=None, request_args=None, update=False, text_fuzz_ratio=1.0)¶
Build scraping rules with proxy header support.
- Parameters:
url – URL of the target web page
wanted_list – List of needed contents to be scraped
wanted_dict – Dict of needed contents (keys are aliases)
html – HTML string (alternative to URL)
request_args – Request arguments including proxies
update – If True, add to existing rules
text_fuzz_ratio – Fuzziness ratio for matching
- Returns:
List of similar results
- get_result_similar(url=None, html=None, soup=None, request_args=None, ...)¶
Get similar results with proxy header support.
- get_result_exact(url=None, html=None, soup=None, request_args=None, ...)¶
Get exact results with proxy header support.
- get_result(url=None, html=None, request_args=None, ...)¶
Get both similar and exact results with proxy header support.