AutoScraper =========== The ``autoscraper_proxy`` module provides proxy header support for AutoScraper. Installation ------------ First, install AutoScraper:: pip install autoscraper Then you can use the proxy header extension. Usage ----- Basic Usage ~~~~~~~~~~~ The ``ProxyAutoScraper`` class is a drop-in replacement for ``AutoScraper`` that adds proxy header capabilities: .. code-block:: python from python_proxy_headers.autoscraper_proxy import ProxyAutoScraper # Create a scraper with proxy headers scraper = ProxyAutoScraper(proxy_headers={'X-ProxyMesh-Country': 'US'}) # Build rules from a sample page result = scraper.build( url='https://finance.yahoo.com/quote/AAPL/', wanted_list=['Apple Inc.'], request_args={'proxies': {'https': 'http://proxy.example.com:8080'}} ) print(result) Using Learned Rules ~~~~~~~~~~~~~~~~~~~ Once you've built rules, you can use them on other pages: .. code-block:: python from python_proxy_headers.autoscraper_proxy import ProxyAutoScraper scraper = ProxyAutoScraper(proxy_headers={'X-ProxyMesh-Country': 'US'}) # Build rules scraper.build( url='https://finance.yahoo.com/quote/AAPL/', wanted_list=['Apple Inc.'], request_args={'proxies': {'https': 'http://proxy:8080'}} ) # Use rules on another page result = scraper.get_result_similar( url='https://finance.yahoo.com/quote/GOOG/', request_args={'proxies': {'https': 'http://proxy:8080'}} ) print(result) # ['Alphabet Inc.'] Saving and Loading Rules ~~~~~~~~~~~~~~~~~~~~~~~~ You can save and load learned rules: .. code-block:: python scraper = ProxyAutoScraper(proxy_headers={'X-ProxyMesh-Country': 'US'}) # Build and save rules scraper.build(url='...', wanted_list=['...']) scraper.save('my_rules.json') # Later, load rules scraper2 = ProxyAutoScraper(proxy_headers={'X-ProxyMesh-Country': 'UK'}) scraper2.load('my_rules.json') Context Manager ~~~~~~~~~~~~~~~ Use as a context manager to ensure proper cleanup: .. code-block:: python with ProxyAutoScraper(proxy_headers={'X-Custom': 'value'}) as scraper: result = scraper.build( url='https://example.com', wanted_list=['Example Domain'], request_args={'proxies': {'https': 'http://proxy:8080'}} ) Updating Proxy Headers ~~~~~~~~~~~~~~~~~~~~~~ You can update proxy headers at runtime: .. code-block:: python scraper = ProxyAutoScraper(proxy_headers={'X-Country': 'US'}) # Make some requests... # Change proxy headers scraper.set_proxy_headers({'X-Country': 'UK'}) # Subsequent requests use new headers API Reference ------------- ProxyAutoScraper Class ~~~~~~~~~~~~~~~~~~~~~~ .. py:class:: ProxyAutoScraper(proxy_headers=None, stack_list=None) AutoScraper subclass with proxy header support. Inherits all methods from ``autoscraper.AutoScraper``. :param proxy_headers: Dict of headers to send to proxy servers :param stack_list: Initial stack list (rules) for the scraper .. py:method:: set_proxy_headers(proxy_headers) Update the proxy headers. Creates a new session on next request. :param proxy_headers: New proxy headers to use .. py:method:: close() Close the underlying session. .. py:method:: build(url=None, wanted_list=None, wanted_dict=None, html=None, request_args=None, update=False, text_fuzz_ratio=1.0) Build scraping rules with proxy header support. :param url: URL of the target web page :param wanted_list: List of needed contents to be scraped :param wanted_dict: Dict of needed contents (keys are aliases) :param html: HTML string (alternative to URL) :param request_args: Request arguments including proxies :param update: If True, add to existing rules :param text_fuzz_ratio: Fuzziness ratio for matching :returns: List of similar results .. py:method:: get_result_similar(url=None, html=None, soup=None, request_args=None, ...) Get similar results with proxy header support. .. py:method:: get_result_exact(url=None, html=None, soup=None, request_args=None, ...) Get exact results with proxy header support. .. py:method:: get_result(url=None, html=None, request_args=None, ...) Get both similar and exact results with proxy header support.