Construct a Custom Parser

Basicallly, constructing a custom Parser should follow 3 rules:

  1. Inherit the Parser or KeywordParser class, according to your task requirements.

  2. Use the parameters provided in the base class with additional parameters for your task to finish the custom Parser class.

  3. Override the .run() attribute with its returning value following the form of a list of ImageInfo. An error will be raised if you do not override it!

Inherit a Parser Class

Parser Class

For most tasks, you can just inherit the Parser class.

You can utilize the attribute functions provided to simplify your programs, especially these functions for fetching websites:

All parameters and attribute functions are listed here:

class image_crawler_utils.Parser(station_url, crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''))[source]

Bases: ABC

A Parser include several basic functions.

Parameters:
classmethod load_from_pkl(pkl_file, log=<image_crawler_utils.log.Log object>)[source]

Load the parser from .pkl file.

ATTENTION: You should use the correspondent Parser class when loading. For example, loading DanbooruKeywordParser should use DanbooruKeywordParser.load_from_pkl().

Parameters:
Returns:

A CrawlerSettings class loaded from pkl file, or None if failed.

Return type:

CrawlerSettings

display_all_configs()[source]

Display all config info. Dataclasses will be displayed in a neater way.

get_cloudflare_cookies(url=None, headless=False, timeout=60, save_cookies_file=None, try_clicking=False)[source]

Bypass Cloudflare check and get its cookies.

Parameters:
  • url (str) – Get Cloudflare cookies using this URL. Set to None (default) will use the station_url in this class.

  • headless (bool) – Whether to display a browser window. Recommend setting to True in case you need to manually bypass Cloudflare.

  • save_cookies_file (str, None) – Path to save the new cookies. Default set to None, meaning not saving cookies.

  • timeout (float) – Try to finish Cloudflare test in timeout seconds.

  • try_clicking (bool) – Try to repeatedly click the verification box. MAY CAUSE THE WEBSITE TO GET STUCK IN THE VERIFICATION PAGE.

nodriver_request_page_content(url, browser=None, headless=True, is_json=False, thread_delay=None, page_stay_time=None)[source]

Download webpage content with nodriver.

For those sites having strong anti-crawling measures, try using this function to bypass them.

Parameters:
  • url (str) – The URL of the page to download.

  • browser (nodriver.Browser, None) – Whether to use an existing browser instance.

  • headless (bool) – Whether to set the browser in headless mode. Default set to True. Only works when browser is None.

  • is_json (bool) – Whether the result is a JSON text. Default set to False.

  • thread_delay (float, Callable, None) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

  • page_stay_time (float, None) – Force the page to stay for page_stay_time seconds so that it can be fully loaded. Default set to None meaning no restrictions in time.

Returns:

The HTML content of the webpage.

nodriver_threading_request_page_content(url_list, restriction_num=None, is_json=False, thread_delay=None, batch_num=None, batch_delay=0.0, headless=True, deconstruct_browser=False, page_stay_time=None)[source]

Download multiple webpage content using asynchronous coroutines (similar to threads) with nodriver.

For those sites having strong anti-crawling measures, try using this function to bypass them.

Parameters:
  • url_list (list[str]) – The list of URLs of the page to download.

  • restriction_num (int, None) – Only download the first restriction_num number of pages. Set to None (default) meaning no restrictions.

  • is_json (bool or Iterable instance) – Whether the result is a JSON text. Can be a bool or a iterable object with the same length as url_list. Default set to False.

  • thread_delay (float, Callable, None) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

  • batch_num (int) – Number of pages for each batch; using it with batch_delay to wait a certain period of time after downloading each batch. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

  • batch_delay (float, Callable) – Delaying time (seconds) after each batch is downloaded. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

  • headless (bool) – Display a browser window or not. Default set to True, and setting it to False is helpful for debugging and bypassing some anti-crawling measures.

  • deconstruct_browser (int) – Whether to deconstruct all instances and clear caches upon finishing. Can improve performances in restricted environments.

  • page_stay_time (float, None) – Force the page to stay for page_stay_time seconds so that it can be fully loaded. Default set to None meaning no restrictions in time.

Returns:

A list of the HTML contents of the webpages. Its order is the same as the one of url_list.

Return type:

list[str]

request_page_content(url, session=<requests.Session object>, headers=<image_crawler_utils.utils.Empty object>, thread_delay=None)[source]

Download webpage content.

Parameters:
  • url (str) – The URL of the page to download.

  • session (requests from import requests, or requests.Session) – Can be requests or requests.Session()

  • headers (dict, Callable, None) – If you need to specify headers for current request, use this argument. Set to None (default) meaning use the headers from self.crawler_settings.download_config.result_headers

  • thread_delay (None | float | Callable) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

Returns:

The HTML content of the webpage.

Return type:

str

abstractmethod run()[source]

MUST BE OVERRIDEN. Generate a list of ImageInfo, containing image urls, names and infos.

Return type:

list[ImageInfo]

save_to_pkl(pkl_file)[source]

Save the parser in a .pkl file.

Parameters:
  • path (str) – Path to save the pkl file. Default is saving to the current path.

  • pkl_file (str, None) – Name of the pkl file. (Suffix is optional.)

Returns:

(Saved file name, Absolute path of the saved file), or None if failed.

Return type:

tuple[str, str] | None

threading_request_page_content(url_list, restriction_num=None, session=<requests.Session object>, headers=<image_crawler_utils.utils.Empty object>, thread_delay=None, batch_num=None, batch_delay=0.0)[source]

Download multiple webpage content using threading.

Parameters:
  • url_list (list[str]) – The list of URLs of the page to download.

  • restriction_num (int, None) – Only download the first restriction_num number of pages. Set to None (default) meaning no restrictions.

  • session (requests from import requests, or requests.Session) – Can be requests or requests.Session()

  • headers (dict, list, Callable, None) – If you need to specify headers for current threading requests, use this argument. Set to None (default) meaning use the headers from self.crawler_settings.download_config.result_headers + If it is a list, it should be of the same length as url_list, and for url_list[i] it will use the headers in headers[i]. The element in this list can be a dict of a function.

  • thread_delay (float, Callable, None) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

  • batch_num (int | None) – Number of pages for each batch; using it with batch_delay to wait a certain period of time after downloading each batch. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

  • batch_delay (float | Callable) – Delaying time (seconds) after each batch is downloaded. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

Returns:

A list of the HTML contents of the webpages. Its order is the same as the one of url_list.

Return type:

list[str]

KeywordParser class

If your task is to download from the result of searching with a query string, it is recommended to inherit KeywordParser, which inherits the base Parser class with several parameters and attribute functions defined specifically for this purpose.

  • For parameters and attribute functions in the original Parser class, please read the documentation above.

class image_crawler_utils.KeywordParser(station_url, crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, standard_keyword_string=None, keyword_string=None, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''), accept_empty=False)[source]

Bases: Parser

A Parser for fetching result from keyword searching.

Parameters:
  • station_url (str) –

    The URL of the main page of a website.

    • This parameter works when several websites use the same structure. For example, https://yande.re/ and https://konachan.com/ both use Moebooru to build their websites, and this parameter must be filled to deal with these sites respectively.

    • For websites like https://www.pixiv.net/, as no other website uses its structure, this parameter has already been initialized and do not need to be filled.

  • crawler_settings (image_crawler_utils.CrawlerSettings) – The CrawlerSettings used in this Parser.

  • standard_keyword_string (str) – Query keyword string using standard syntax. Refer to the documentation for detailed instructions.

  • keyword_string (str, None) –

    If you want to directly specify the keywords used in searching, set keyword_string to a custom non-empty string. It will OVERWRITE standard_keyword_string.

    • For example, set keyword_string to "kuon_(utawarerumono) rating:safe" in DanbooruKeywordParser means searching directly with this string in Danbooru, and its standard keyword string equivalent is "kuon_(utawarerumono) AND rating:safe".

  • cookies (image_crawler_utils.Cookies, list, dict, str, None) –

    Cookies used in loading websites.

  • accept_empty (bool) – If set to False (default), when both standard_keyword_string and keyword_string is an empty string (like ‘’ or ‘ ‘), a critical error will be thrown. If set to True, no error will be thrown and the parameters are accepted.

display_all_configs()[source]

Display all config info. Dataclasses will be displayed in a neater way.

generate_standard_keyword_string(keyword_tree=None)[source]

Generate a standard keyword string.

Generated result may not be the same from the standard_keyword_string input.

Parameters:

keyword_tree (KeywordLogicTree | None) –

The KeywordLogicTree that a standard keyword string will be built from. Set to None (default) will use the KeywordLogicTree generated from the standard_keyword_string parameter.

  • ATTENTION: When set to None, the standard keyword string may not be absolutely same as standard_keyword_string.

Returns:

A standard keyword string.

abstractmethod run()[source]

Generate a list of ImageInfo, containing image urls, names and infos by crawling the website.

MUST BE OVERRIDDEN if inherited from Parser or KeywordParser class.

Return type:

list[ImageInfo]

Completing the Parser Structure

About the Usage of CrawlerSettings Class

The parameters passed into CrawlerSettings class is arranged as such:

  • image_num, capacity and page_num will be stored in CrawlerSettings().capacity_count_config, which is a CapacityCountConfig class.
    • To use the parameter in the Parser with the crawler_settings parameter passed in, you need to write the code like

    self.crawler_settings.capacity_count_config.image_num
    
  • headers, proxies, thread_delay, fail_delay, randomize_delay, thread_num, timeout, max_download_time, retry_time and overwrite_images will be stored in CrawlerSettings().download_config, which is a DownloadConfig class.
    • To use the parameter in the Parser with the crawler_settings parameter passed in, you need to write the code like

    self.crawler_settings.download_config.headers
    
  • debug_config and detailed_console_log will be used to set up CrawlerSettings().log, which is a Log class that controls logging information.
    • If you use .set_logging_file() to set the logging file, the CrawlerSettings().log will be accordingly changed.

    • To log information in your custom Parser, you need to write the code like

    self.crawler_settings.log.info("LOGGING INFO")
    
    • For detailed information, check out the documentation of Log class.

  • extra_configs will be stored in CrawlerSettings().extra_configs.

KeywordParser Tips

If you inherit the KeywordParser, the first thing suggested to do is to write a function (like .generate_keyword_string()) which converts the Standard Keyword String (stored in .standard_keyword_string) to the query string for your task and store it in .keyword_string.

  1. Run the super().__init__() in the __init__() function of the inherited class to generate the self.keyword_tree attribute, which is a image_crawler_utils.keyword.KeywordLogicTree.
  2. Write the function that construct the function that generates the .keyword_string attribute from self.keyword_tree.

Also, to be consistent with preset KeywordParsers, it is suggested to use .keyword_string before converted .standard_keyword_string if .keyword_string is not empty, and an error shall be raised only if both parameters are empty.

An example (from DanbooruKeywordParser) is like:

def __init__(
    self,
    station_url: str="https://danbooru.donmai.us/",
    crawler_settings: CrawlerSettings=CrawlerSettings(),
    standard_keyword_string: Optional[str]=None,
    keyword_string: Optional[str]=None,
    cookies: Optional[Union[Cookies, list, dict, str]]=Cookies(),
    replace_url_with_source_level: str="None",
    use_keyword_include: bool=False,
):

    super().__init__(
        station_url=station_url,
        crawler_settings=crawler_settings,
        standard_keyword_string=standard_keyword_string,
        keyword_string=keyword_string,
        cookies=cookies,
    )
    self.replace_url_with_source_level = replace_url_with_source_level.lower()
    self.use_keyword_include = use_keyword_include


# Generate keyword string from keyword tree
def __build_keyword_str(self, tree: KeywordLogicTree) -> str:
    # Generate standard keyword string
    if isinstance(tree.lchild, str):
        res1 = tree.lchild
    else:
        res1 = self.__build_keyword_str(tree.lchild)
    if isinstance(tree.rchild, str):
        res2 = tree.rchild
    else:
        res2 = self.__build_keyword_str(tree.rchild)

    if tree.logic_operator == "AND":
        return f'({res1} {res2})'
    elif tree.logic_operator == "OR":
        return f'({res1} or {res2})'
    elif tree.logic_operator == "NOT":
        return f'(-{res2})'
    elif tree.logic_operator == "SINGLE":
        return f'{res2}'


# Basic keyword string
def generate_keyword_string(self) -> str:
    self.keyword_string = self.__build_keyword_str(self.keyword_tree)
    return self.keyword_string