image_crawler_utils package

class image_crawler_utils.Cookies(cookies=None)[source]

Bases: object

Convert format of cookies between selenium, requests and string.

Use Cookies(cookies_from_certain_source) or Cookies.load_from_json() to create a Cookies class.

Use .cookies_nodriver / .cookies_selenium / .cookies_dict / .cookies_string to get the cookies of suitable format.

Parameters:

cookies (list, dict, str, None) –

Cookies generated from string, dict (requests), list (selenium or nodriver).

  • Leave blank (like Cookies()) will create an empty cookies, whose .is_none() returns True.

classmethod load_from_json(json_file, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]

Load the Cookies from a json file.

ONLY WORKS IF the info can be JSON serialized.

Parameters:
  • json_file (str) – Name / path of json file. Suffix (.json) must be included.

  • encoding (str) – Encoding of JSON file.

  • log (image_crawler_utils.log.Log, None) – Logging config.

Returns:

The Cookies, or None if failed.

Return type:

Cookies | None

is_none()[source]

Check whether Cookies is empty (created by None, “”, etc.).

Returns:

A bool, telling whethers Cookies is empty.

Return type:

bool

save_to_json(json_file, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]

Save the Cookies into a json file.

Parameters:
  • json_file (str) – Name / path of json file. (Suffix is optional.)

  • encoding (str) – Encoding of JSON file.

  • log (image_crawler_utils.log.Log, None) – Logging config.

Returns:

(Saved file name, Absolute path of the saved file), or None if failed.

Return type:

tuple[str, str] | None

update_nodriver_cookies(old_nodriver_cookies)[source]

Update nodriver-form cookies. NOT SUGGESTED TO BE USED DIRECTLY.

For every cookie in the input with the same name as the one in the Cookies class, replace the values with the latter one.

Also add cookies in Cookies class which not exists in input cookies.

Parameters:

old_nodriver_cookies (list[nodriver.cdp.network.Cookie]) – Cookies from nodriver.

Returns:

New nodriver cookies (a list[nodriver.cdp.network.Cookie]).

update_selenium_cookies(old_selenium_cookies)[source]

Update selenium-form cookies.

For every cookie in the input with the same name as the one in the Cookies class, replace the values with the latter one.

Also add cookies in Cookies class which not exists in input cookies.

Parameters:

old_selenium_cookies (list[dict]) – Cookies from selenium.

Returns:

New selenium cookies (a list[dict]).

cookies_dict: dict | None

Cookies in dict form. Mostly for requests module usage.

This form of cookies can be generated by requests-related functions and classes, or other cookie functions that generates a dict, etc.

A generation example is like:

import requests
from image_crawler_utils import Cookies

session = requests.Session()
# Some process that adds cookies to session
requests_cookies = session.cookies.get_dict()  # A list
cookies = Cookies(requests_cookies)
cookies_nodriver: list[Cookie] | None

Cookies in nodriver form.

This form of cookies can be generated by nodriver-related functions and classes, etc.

A generation example is like:

import nodriver
from image_crawler_utils.utils import set_up_nodriver_browser
from image_crawler_utils import Cookies

async def nodriver_func():
    browser = await set_up_nodriver_browser()
    tab = await browser.get('https://foo.bar.com')
    # Some other process
    nodriver_cookies = await browser.cookies.get_all()
    return nodriver_cookies

nodriver_cookies = nodriver.loop().run_until_complete(nodriver_func())
cookies = Cookies(nodriver_cookies)
cookies_selenium: list[dict] | None

Cookies in selenium form.

This form of cookies can be generated by selenium-related functions and classes, etc.

A generation example is like:

from selenium.webdriver import Chrome
from image_crawler_utils import Cookies

chrome_driver_path = '/path/to/chromedriver'
chrome_browser = webdriver.Chrome(executable_path=chrome_driver_path)
chrome_browser.get('https://foo.bar.com')
# Some other process
selenium_cookies = chrome_browser.get_cookies()  # A dict
cookies = Cookies(selenium_cookies)
cookies_string: str | None

Cookies in string form.

This form of cookies can be acquired by using Developer Mode (F12) in some browsers, etc.

A generation example is like:

from image_crawler_utils import Cookies

cookies = Cookies("your_cookies_string")
class image_crawler_utils.CrawlerSettings(capacity_count_config=<image_crawler_utils.utils.Empty object>, download_config=<image_crawler_utils.utils.Empty object>, debug_config=DebugConfig(show_debug=False, show_info=True, show_warning=True, show_error=True, show_critical=True), image_num=None, capacity=None, page_num=None, headers=None, proxies=None, thread_delay=5, fail_delay=3, randomize_delay=True, thread_num=5, timeout=30.0, max_download_time=None, retry_times=5, overwrite_images=True, detailed_console_log=False, extra_configs=None)[source]

Bases: object

A general framework of settings for running a crawler.

Parameters:
  • capacity_count_config (image_crawler_utils.configs.CapacityCountConfig, None) –

    Contains configs that restricts downloading numbers and capacity.

    • If this parameter is used, the image_num, capacity and page_num parameters will be omitted.

  • download_config (image_crawler_utils.configs.DownloadConfig, None) –

    Contains configs about parameters in downloading.

    • If this parameter is used, the headers, proxies, thread_delay, fail_delay, randomize_delay, thread_num, timeout, max_download_time, retry_time and overwrite_images parameters will be omitted.

  • debug_config (image_crawler_utils.configs.DebugConfig, None) – Contains configs that define which types of messages are shown on the console.

  • image_num (int, None) – Number of images to be parsed / downloaded in total; None means no restriction.

  • capacity (float, None) – Total size of images (MB); None means no restriction.

  • page_num (int, None) – Number of gallery pages to detect images in total; None means no restriction.

  • headers (dict, Callable, None) – Headers settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.

  • proxies (dict, Callable, None) – Proxy settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.

  • thread_delay (float) – Waiting time (s) after thread start.

  • fail_delay (float) – Waiting time (s) after failing.

  • randomize_delay (bool) – Randomize delay time between 0 and delay_time.

  • thread_num (int) – Downloading thread num.

  • timeout (float, None) – Timeout for requests. Set to None means no timeout.

  • max_download_time (float, None) – Maximum download time for a image. Set to None means no timeout.

  • retry_times (int) – Times of retrying to download.

  • overwrite_images (bool) – Overwrite existing images.

  • detailed_console_log (bool) –

    Logging detailed information onto the console.

    • It means that when logging info to the console, always log msg (the messages logged into files) even if output_msg exists.

  • extra_configs (dict, None) – This optional dict is not used in any of the supported sites and crawling tasks, as it is reserved for developing your custom image crawler.

classmethod load_from_pkl(pkl_file, log=<image_crawler_utils.log.Log object>)[source]

Load CrawlerSettings from .pkl file.

Parameters:
Returns:

A CrawlerSettings class loaded from pkl file, or None if failed.

Return type:

CrawlerSettings

browser_test(url, headless=True, stay_time=30)[source]

Test whether browser works normally.

ATTENTION: This function DO NOT check the connectivity of the URL. Use connectivity_test() instead.

Parameters:
  • url (str) – Test connectivity using this URL.

  • headless (bool) –

    Whether not to display a window when testing chromdriver. You can have a quick glimpse of whether the page is correctly loaded.

    • Set headless to False will pop up the browser window to display the whole process of loading the webpage.

  • stay_time (float) – If set to not headless, the window will stay for stay_time seconds.

Returns:

A bool. Successful connection returns True, in other cases returns False.

connectivity_test(url)[source]

Test connectivity of internet.

Using config in download_config.

Parameters:

url (str) – Test connectivity using this URL.

Returns:

A bool. Successful connection returns True, in other cases returns False.

copy(capacity_count_config=<image_crawler_utils.utils.Empty object>, download_config=<image_crawler_utils.utils.Empty object>, debug_config=<image_crawler_utils.utils.Empty object>, image_num=<image_crawler_utils.utils.Empty object>, capacity=<image_crawler_utils.utils.Empty object>, page_num=<image_crawler_utils.utils.Empty object>, headers=<image_crawler_utils.utils.Empty object>, proxies=<image_crawler_utils.utils.Empty object>, thread_delay=<image_crawler_utils.utils.Empty object>, fail_delay=<image_crawler_utils.utils.Empty object>, randomize_delay=<image_crawler_utils.utils.Empty object>, thread_num=5, timeout=<image_crawler_utils.utils.Empty object>, max_download_time=<image_crawler_utils.utils.Empty object>, retry_times=<image_crawler_utils.utils.Empty object>, overwrite_images=<image_crawler_utils.utils.Empty object>, extra_configs=<image_crawler_utils.utils.Empty object>)[source]

Generate a copy of a CrawlerSettings, with (optional) parameters changed.

Parameters:
  • capacity_count_config (image_crawler_utils.configs.CapacityCountConfig, None) –

    Contains configs that restricts downloading numbers and capacity.

    • If this parameter is used, the image_num, capacity and page_num parameters will be omitted.

  • download_config (image_crawler_utils.configs.DownloadConfig, None) –

    Contains configs about parameters in downloading.

    • If this parameter is used, the headers, proxies, thread_delay, fail_delay, randomize_delay, thread_num, timeout, max_download_time, retry_time, overwrite_images parameters will be omitted.

  • debug_config (image_crawler_utils.configs.DebugConfig, None) – Contains configs that define which types of messages are shown on the console.

  • image_num (int, None) – Number of images to be parsed / downloaded in total; None means no restriction.

  • capacity (float, None) – Total size of images (MB); None means no restriction.

  • page_num (int, None) – Number of gallery pages to detect images in total; None means no restriction.

  • headers (dict, Callable, None) – Headers settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.

  • proxies (dict, Callable, None) – Proxy settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.

  • thread_delay (float) – Waiting time (s) after thread start.

  • fail_delay (float) – Waiting time (s) after failing.

  • randomize_delay (bool) – Randomize delay time between 0 and delay_time.

  • thread_num (int) – Downloading thread num.

  • timeout (float, None) – Timeout for requests. Set to None means no timeout.

  • max_download_time (float, None) – Maximum download time for a image. Set to None means no timeout.

  • retry_times (int) – Times of retrying to download.

  • overwrite_images (bool) – Overwrite existing images.

  • detailed_console_log (bool) –

    Logging detailed information onto the console.

    • It means that when logging info to the console, always log msg (the messages logged into files) even if output_msg exists.

  • extra_configs (dict, None) – This optional dict is not used in any of the supported sites and crawling tasks, as it is reserved for developing your custom image crawler.

dill_base64_sha256_data()[source]

Return the serialized bytes of current CrawlerSettings, base64 encoded str of the bytes, and sha256 str of the bytes.

Returns:

dill.dumps() bytes, “base64”: base64 encoding of the dill.dumps() bytes, “sha256”: sha256 encoding of the base64 code}

Return type:

A dict like {“bytes”

display_all_configs()[source]

Display all config info.

Dataclasses will be displayed in a neater way.

save_to_pkl(pkl_file=None)[source]

Save the CrawlerSettings in a pkl file.

It is recommended to use the default file name, which uses the sha256 encoded string generated by dill_base64_sha256_data().

Parameters:
  • path (str) – Path to save the pkl file. Default is saving to the current path.

  • pkl_file (str, None) – Name of the pkl file. (Suffix is optional.) Default is using the sha256 encoded string generated by dill_base64_sha256_data().

Returns:

(Saved file name, Absolute path of the saved file), or None if failed.

Return type:

tuple[str, str] | None

set_logging_file(log_file, logging_level=10)[source]

Set the file to be logged into.

It is recommended to add a logging file when running the crawler, as the message displayed on the console is simplified and usually not complete.

PAY ATTENTION: You cannot set logging files when creatiing a class. Setting logging files is controlled by this function.

Parameters:
  • log_file (str) – Output name for the logging file. Suffix (.json) is optional. Set to None (Default) is not to output any file.

  • logging_level (int) –

    Set the logging level of the LOGGING FILE. Select from: logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR and logging.CRITICAL (import logging library first, or use the word string directly).

    • ATTENTION: It is indepedent from the level of logging onto the console!
      • The latter is controlled by the debug_config parameter, and this parameter in turn does not affect logging into files.

Returns:

The changed CrawlerSettings itself.

class image_crawler_utils.Downloader(image_info_list, crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, store_path='./', image_info_filter=True, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''))[source]

Bases: object

Downloading images using threading method.

Parameters:
  • crawler_settings (image_crawler_utils.CrawlerSettings) – The CrawlerSettings used in this Downloader.

  • image_info_list (image_crawler_utils.ImageInfo) – A list of ImageInfo.

  • store_path (str) –

    Path to store images, or a list of storage paths respectively for every image.

    • Default is the current working directory.

    • If it set to an iterable list, then its length should be the same as image_info_list.

  • image_info_filter (callable, bool) –

    A callable function used to filter the images in the list of ImageInfo.

    • The function of image_info_filter should only accept 1 argument of ImageInfo type and returns True (download this image) or False (do not download this image), like:

      def filter_func(image_info: ImageInfo) -> bool:
          # Meet the conditions
          return True
          # Do not meet the conditions
          return False
      
    • If the function have other parameters, use lambda to exclude other parameters:

      image_info_filter=lambda info: filter_func(info, param1, param2, ...)
      
    • If you want to download all images in the ImageInfo list, set image_info_filter to True.

    • TIPS: If you want to search images with complex restrictions that the image station sites may not support (e.g. Images with many keywords and restrictions on the ratio between width and height), you can simplify the query with some keywords to get all images with Parsers, and filter them with your custom image_info_filter function.

  • cookies (image_crawler_utils.Cookies, str, dict, list, None) –

    Cookies used to access images from a website.

    • None means no cookies and works the same as Cookies().

    • Leave this parameter blank works the same as None / Cookies().

    • TIPS: You can add corresponding cookies to Downloader if there are URLs of images only accessible with an account. For example, if you have saved Pixiv and Twitter / X cookies respectively in Pixiv_cookies.json and Twitter_cookies.json, then you can use cookies=Cookies.load_from_json("Pixiv_cookies.json") + Cookies.load_from_json("Twitter_cookies.json") to add both cookies to the Downloader.

classmethod load_from_pkl(pkl_file, log=<image_crawler_utils.log.Log object>)[source]

Load parser from .pkl file.

Parameters:
Returns:

A CrawlerSettings class loaded from pkl file, or None if failed.

Return type:

CrawlerSettings

display_all_configs()[source]

Display all config info. Dataclasses will be displayed in a neater way.

run()[source]

Run the Threading Downloader Object.

Returns:

(Total size of image downloaded, Succeeded ImageInfo list, Failed ImageInfo list, Skipped ImageInfo list)

  • Total size of image downloaded: An int denoting the total size (in bytes) of images downloaded.

  • Succeeded ImageInfo list: A list of ImageInfo containing successfully downloaded images.

  • Failed ImageInfo list: A list of ImageInfo containing images failed to be downloaded.

  • Skipped ImageInfo list: A list of ImageInfo containing images skipped.

    • Images filtered out by image_info_filter, not downloaded due to the restriction of image_num in image_crawler_utils.CrawlerSettings, and skipped due to such images already exist when overwrite_images in DownloadConfig is set to False will be classified to this list.

Return type:

tuple[int, list[ImageInfo], list[ImageInfo], list[ImageInfo]]

save_to_pkl(pkl_file)[source]

Save the Downloader with settings in a pkl file.

Parameters:
  • path (str) – Path to save the pkl file. Default is saving to the current path.

  • pkl_file (str, None) – Name of the pkl file. (Suffix is optional.)

Returns:

(Saved file name, Absolute path of the saved file), or None if failed.

Return type:

tuple[str, str] | None

class image_crawler_utils.ImageInfo(url, name, info=<factory>, backup_urls=<factory>)[source]

Bases: object

A class consisting of image URL, name, info and back up URLs.

Can be used to download images and write result to files.

Parameters:
backup_urls: Iterable[str]

When downloading from .url failed, try downloading from URLs in the list of .backup_urls.

info: dict

A dict, containing information of the image.

  • info will not affect Downloader directly. It only works if you set the image_info_filter parameter in the Downloader class.

  • Different sites may have different info structures which are defined respectively by their Parsers.

  • ATTENTION: If you define you own info structure, please ENSURE it can be JSON-serialized (e.g. The values of the dict should be int, float, str, list, dict, etc.) in order to make it compatible with save_image_infos() and load_image_infos().

name: str

Name of the image when saved.

url: str

The URL used AT FIRST in downloading the image.

class image_crawler_utils.KeywordParser(station_url, crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, standard_keyword_string=None, keyword_string=None, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''), accept_empty=False)[source]

Bases: Parser

A Parser for fetching result from keyword searching.

Parameters:
  • station_url (str) –

    The URL of the main page of a website.

    • This parameter works when several websites use the same structure. For example, https://yande.re/ and https://konachan.com/ both use Moebooru to build their websites, and this parameter must be filled to deal with these sites respectively.

    • For websites like https://www.pixiv.net/, as no other website uses its structure, this parameter has already been initialized and do not need to be filled.

  • crawler_settings (image_crawler_utils.CrawlerSettings) – The CrawlerSettings used in this Parser.

  • standard_keyword_string (str) – Query keyword string using standard syntax. Refer to the documentation for detailed instructions.

  • keyword_string (str, None) –

    If you want to directly specify the keywords used in searching, set keyword_string to a custom non-empty string. It will OVERWRITE standard_keyword_string.

    • For example, set keyword_string to "kuon_(utawarerumono) rating:safe" in DanbooruKeywordParser means searching directly with this string in Danbooru, and its standard keyword string equivalent is "kuon_(utawarerumono) AND rating:safe".

  • cookies (image_crawler_utils.Cookies, list, dict, str, None) –

    Cookies used in loading websites.

  • accept_empty (bool) – If set to False (default), when both standard_keyword_string and keyword_string is an empty string (like ‘’ or ‘ ‘), a critical error will be thrown. If set to True, no error will be thrown and the parameters are accepted.

display_all_configs()[source]

Display all config info. Dataclasses will be displayed in a neater way.

generate_standard_keyword_string(keyword_tree=None)[source]

Generate a standard keyword string.

Generated result may not be the same from the standard_keyword_string input.

Parameters:

keyword_tree (KeywordLogicTree | None) –

The KeywordLogicTree that a standard keyword string will be built from. Set to None (default) will use the KeywordLogicTree generated from the standard_keyword_string parameter.

  • ATTENTION: When set to None, the standard keyword string may not be absolutely same as standard_keyword_string.

Returns:

A standard keyword string.

abstractmethod run()[source]

Generate a list of ImageInfo, containing image urls, names and infos by crawling the website.

MUST BE OVERRIDDEN if inherited from Parser or KeywordParser class.

Return type:

list[ImageInfo]

class image_crawler_utils.Parser(station_url, crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''))[source]

Bases: ABC

A Parser include several basic functions.

Parameters:
classmethod load_from_pkl(pkl_file, log=<image_crawler_utils.log.Log object>)[source]

Load the parser from .pkl file.

ATTENTION: You should use the correspondent Parser class when loading. For example, loading DanbooruKeywordParser should use DanbooruKeywordParser.load_from_pkl().

Parameters:
Returns:

A CrawlerSettings class loaded from pkl file, or None if failed.

Return type:

CrawlerSettings

display_all_configs()[source]

Display all config info. Dataclasses will be displayed in a neater way.

get_cloudflare_cookies(url=None, headless=False, timeout=60, save_cookies_file=None, try_clicking=False)[source]

Bypass Cloudflare check and get its cookies.

Parameters:
  • url (str) – Get Cloudflare cookies using this URL. Set to None (default) will use the station_url in this class.

  • headless (bool) – Whether to display a browser window. Recommend setting to True in case you need to manually bypass Cloudflare.

  • save_cookies_file (str, None) – Path to save the new cookies. Default set to None, meaning not saving cookies.

  • timeout (float) – Try to finish Cloudflare test in timeout seconds.

  • try_clicking (bool) – Try to repeatedly click the verification box. MAY CAUSE THE WEBSITE TO GET STUCK IN THE VERIFICATION PAGE.

nodriver_request_page_content(url, browser=None, headless=True, is_json=False, thread_delay=None, page_stay_time=None)[source]

Download webpage content with nodriver.

For those sites having strong anti-crawling measures, try using this function to bypass them.

Parameters:
  • url (str) – The URL of the page to download.

  • browser (nodriver.Browser, None) – Whether to use an existing browser instance.

  • headless (bool) – Whether to set the browser in headless mode. Default set to True. Only works when browser is None.

  • is_json (bool) – Whether the result is a JSON text. Default set to False.

  • thread_delay (float, Callable, None) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

  • page_stay_time (float, None) – Force the page to stay for page_stay_time seconds so that it can be fully loaded. Default set to None meaning no restrictions in time.

Returns:

The HTML content of the webpage.

nodriver_threading_request_page_content(url_list, restriction_num=None, is_json=False, thread_delay=None, batch_num=None, batch_delay=0.0, headless=True, deconstruct_browser=False, page_stay_time=None)[source]

Download multiple webpage content using asynchronous coroutines (similar to threads) with nodriver.

For those sites having strong anti-crawling measures, try using this function to bypass them.

Parameters:
  • url_list (list[str]) – The list of URLs of the page to download.

  • restriction_num (int, None) – Only download the first restriction_num number of pages. Set to None (default) meaning no restrictions.

  • is_json (bool or Iterable instance) – Whether the result is a JSON text. Can be a bool or a iterable object with the same length as url_list. Default set to False.

  • thread_delay (float, Callable, None) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

  • batch_num (int) – Number of pages for each batch; using it with batch_delay to wait a certain period of time after downloading each batch. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

  • batch_delay (float, Callable) – Delaying time (seconds) after each batch is downloaded. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

  • headless (bool) – Display a browser window or not. Default set to True, and setting it to False is helpful for debugging and bypassing some anti-crawling measures.

  • deconstruct_browser (int) – Whether to deconstruct all instances and clear caches upon finishing. Can improve performances in restricted environments.

  • page_stay_time (float, None) – Force the page to stay for page_stay_time seconds so that it can be fully loaded. Default set to None meaning no restrictions in time.

Returns:

A list of the HTML contents of the webpages. Its order is the same as the one of url_list.

Return type:

list[str]

request_page_content(url, session=<requests.Session object>, headers=<image_crawler_utils.utils.Empty object>, thread_delay=None)[source]

Download webpage content.

Parameters:
  • url (str) – The URL of the page to download.

  • session (requests from import requests, or requests.Session) – Can be requests or requests.Session()

  • headers (dict, Callable, None) – If you need to specify headers for current request, use this argument. Set to None (default) meaning use the headers from self.crawler_settings.download_config.result_headers

  • thread_delay (None | float | Callable) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

Returns:

The HTML content of the webpage.

Return type:

str

abstractmethod run()[source]

MUST BE OVERRIDEN. Generate a list of ImageInfo, containing image urls, names and infos.

Return type:

list[ImageInfo]

save_to_pkl(pkl_file)[source]

Save the parser in a .pkl file.

Parameters:
  • path (str) – Path to save the pkl file. Default is saving to the current path.

  • pkl_file (str, None) – Name of the pkl file. (Suffix is optional.)

Returns:

(Saved file name, Absolute path of the saved file), or None if failed.

Return type:

tuple[str, str] | None

threading_request_page_content(url_list, restriction_num=None, session=<requests.Session object>, headers=<image_crawler_utils.utils.Empty object>, thread_delay=None, batch_num=None, batch_delay=0.0)[source]

Download multiple webpage content using threading.

Parameters:
  • url_list (list[str]) – The list of URLs of the page to download.

  • restriction_num (int, None) – Only download the first restriction_num number of pages. Set to None (default) meaning no restrictions.

  • session (requests from import requests, or requests.Session) – Can be requests or requests.Session()

  • headers (dict, list, Callable, None) – If you need to specify headers for current threading requests, use this argument. Set to None (default) meaning use the headers from self.crawler_settings.download_config.result_headers + If it is a list, it should be of the same length as url_list, and for url_list[i] it will use the headers in headers[i]. The element in this list can be a dict of a function.

  • thread_delay (float, Callable, None) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

  • batch_num (int | None) – Number of pages for each batch; using it with batch_delay to wait a certain period of time after downloading each batch. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

  • batch_delay (float | Callable) – Delaying time (seconds) after each batch is downloaded. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.

Returns:

A list of the HTML contents of the webpages. Its order is the same as the one of url_list.

Return type:

list[str]

image_crawler_utils.load_image_infos(json_file, encoding='UTF-8', display_progress=True, log=<image_crawler_utils.log.Log object>)[source]

Load the ImageInfo list from a JSON file.

ONLY WORKS IF the info can be JSON serialized.

Parameters:
  • json_file (str) – Name / Path of the JSON file.

  • encoding (str) – Encoding of the JSON file.

  • display_progress (bool) – Display a rich progress bar when running. Progress bar will be hidden after finishing.

  • log (image_crawler_utils.log.Log, None) – Logging config.

Returns:

List of ImageInfo, or None if failed.

Return type:

list[ImageInfo] | None

image_crawler_utils.save_image_infos(image_info_list, json_file, encoding='UTF-8', display_progress=True, log=<image_crawler_utils.log.Log object>)[source]

Save the ImageInfo list into a JSON file.

ONLY WORKS IF the info can be JSON serialized.

Parameters:
Returns:

(Saved file name, Absolute path of the saved file), or None if failed.

Return type:

tuple[str, str] | None

async image_crawler_utils.update_nodriver_browser_cookies(browser, cookies)[source]

This function will update nodriver browser cookies with Cookies provided.

As nodriver includes a browser.cookies.set_all() but it has a critical bug that stay unfixed for a long time, I’ll do it myself!

Parameters:

Subpackages

Submodules

image_crawler_utils.configs module

class image_crawler_utils.configs.CapacityCountConfig(image_num=None, capacity=None, page_num=None)[source]

Bases: object

Contains config for restrictions of images number, total size or webpage number.

Parameters:
  • image_num (int | None)

  • capacity (float | None)

  • page_num (int | None)

capacity: float | None = None

Total size of images (bytes) to be downloaded.

  • Default is set to None, meaning no restrictions.

  • When capacity is reached, no new downloading threads will be added. However, downloading threads that already started will not be affected, which means actual image size will be larger than the capacity.

image_num: int | None = None

The number of images to be parsed from websites or downloaded.

  • Default is set to None, meaning no restrictions.

  • Mostly only used in the Downloader to control the number of images to be downloaded, but some Parsers may also use this parameter.

page_num: int | None = None

Number of gallery pages to detect images in total. None means no restriction.

  • Default is set to None, meaning no restrictions.

  • Some websites, like Twitter / X, do not use gallery pages or JSON-API pages (Image Crawler Utils uses the method of scrolling webpages to get Twitter / X images), and this parameter is not used.

class image_crawler_utils.configs.DebugConfig(show_debug=False, show_info=True, show_warning=True, show_error=True, show_critical=True)[source]

Bases: object

Contains config for whether displaying a certain level of debugging messages in console.

Default set to “info” level.

Parameters:
classmethod level(level_str)[source]

Create a DebugConfig that is set to display messages over the level. For example, set to “warning” will display warning, error and critical messages.

Parameters:

level_str (str) –

Must be one of (from lower to higher) “debug”, “info”, “warning”, “error”, “critical” or “silenced”.

  • Set a logging level will display messages including and above this level. For example, .set_level("warning") will only display messages with “warning”, “error” and “critical” levels.

  • Set to “silenced” level will not display any messages except those generated by the progress bars.

Returns:

Created DebugConfig.

Return type:

DebugConfig

set_level(level_str)[source]

Set current DebugConfig to display messages over the level. For example, set to “warning” will display warning, error and critical messages.

Parameters:

level_str (str) –

Must be one of (from lower to higher) “debug”, “info”, “warning”, “error”, “critical” or “silenced”.

  • Set a logging level will display messages including and above this level. For example, .set_level("warning") will only display messages with “warning”, “error” and “critical” levels.

  • Set to “silenced” level will not display any messages except those generated by the progress bars.

show_critical: bool = True

Display critical-level messages.

  • Default set to True.

  • Include messages of errors that interrupt the crawler. Usually a Python error will be raised when critical errors happen.

show_debug: bool = False

Display debug-level messages.

  • Default set to False.

  • Include messages of many detailed information about running the crawler, especially connections with websites.

  • Set show_debug to False will not stop displaying debug messages from any .display_all_configs().

show_error: bool = True

Display error-level messages.

  • Default set to True.

  • Include messages of errors that may affect the final results but do not interrupt the crawler.

show_info: bool = True

Display info-level messages.

  • Default set to True.

  • Include messages of basic information indicating the progress of the crawler.

show_warning: bool = True

Display warning-level messages.

  • Default set to True.

  • Include messages of errors that basically do not affect the final results, mostly connection failures with the websites.

class image_crawler_utils.configs.DownloadConfig(headers=None, proxies=None, thread_delay=5, fail_delay=3, randomize_delay=True, thread_num=5, timeout=10, max_download_time=None, retry_times=5, overwrite_images=True)[source]

Bases: object

Contains config for downloading.

Parameters:
fail_delay: float = 3

Delaying time (seconds) after every failure.

  • Both fetching webpages and downloading images will use this parameter.

  • Some Parsers may use different parameters to control their delaying time when a failure happens.

headers: dict | Callable | None = None

Headers of the requests.

  • Both fetching webpages and downloading images will use this parameter.

  • Headers should be None, a dict or a callable function that returns a dict.
    • If you want to have a random header with every request, you can set headers to a callable function. This function should not accept any parameters (which can be implemented by lambda) and returns a dict.

  • This only works when the requests is sent by requests (like requests.get()). For webpages loaded by browsers, this parameter is omitted.

  • Basically, this contains the user agent of the requests.
    • ATTENTION: Not all user agents are supported by the websites you are accessing!

max_download_time: float | None = None

When no new data is fetched when downloading images in max_download_time seconds, a failure will happen.

  • Only downloading images will use this parameter.

  • Default is set to None, meaning no restrictions.

overwrite_images: bool = True

Overwrite existing images when downloading.

  • Only downloading images will use this parameter.

proxies: dict | Callable | None = None

Proxies used by the crawler.

  • Both fetching webpages and downloading images will use this parameter.

  • Proxies should be None, a dict or a callable function that returns a dict.
    • Set to None (Default) will let the crawler use system proxies.

    • If you want to have a random proxy with every request, you can set proxies to a callable function. This function should not accept any parameters (which can be implemented by lambda) and returns a dict.

  • Both requests and browsers use these proxies, but the structure should be in a requests-acceptable form like:
    • HTTP type: {'http': '127.0.0.1:7890'}

    • HTTPS type: {'https': '127.0.0.1:7890'}

    • SOCKS type: {'https': 'socks5://127.0.0.1:7890'}

    • If you input 'https' proxies, 'http' proxies will be automatically generated.

    • ATTENTION: Using usernames and passwords is currently not supported.

randomize_delay: bool = True

Randomize thread_delay and fail_delay between 0 and their values.

  • For example, thread_delay=5.0 and randomize_delay=False will cause the thread_delay to choose a random value between 0 and 5.0 every time.

property result_fail_delay: float

Generate the fail delay. If the randomize_delay attribute is set to True, the delay will be randomized between 0 and fail_delay for every usage.

property result_headers: dict | None

Generate the headers. If the headers attribute is callable, it will be called for every usage.

property result_proxies: dict | None

Generate the headers. If the proxies attribute is callable, it will be called for every usage.

property result_thread_delay: float

Generate the thread delay. If the randomize_delay attribute is set to True, the delay will be randomized between 0 and thread_delay for every usage.

retry_times: int = 5

Total times of retrying to fetch a webpage / download an image.

  • Both fetching webpages and downloading images will use this parameter.

thread_delay: float = 5

Delaying time (seconds) before every thread starts.

  • Both fetching webpages and downloading images will use this parameter.

  • Some Parsers may use different parameters to control their delaying time.

thread_num: int = 5

Total number of threads.

  • Both fetching webpages and downloading images will use this parameter.

  • Some Parsers do not use threading to fetching pages, and this parameter is not used.

timeout: float | None = 10

Timeout for connection. When no response is returned in timeout seconds, a failure will happen.

  • Both fetching webpages and downloading images will use this parameter.

  • Setting to None means (barely) no restrictions.

image_crawler_utils.keyword module

class image_crawler_utils.keyword.KeywordLogicTree(lchild='', rchild='', logic_operator='SINGLE')[source]

Bases: object

A binary tree to record the logic structure of keywords.

Parameters:
all_keywords()[source]

Return all keywords in this tree in a list.

Returns:

A list with all the keywords in this tree.

Return type:

list[str]

is_empty()[source]

Check whether current KeywordLogicTree is empty.

Returns:

A boolean denoting whether current tree is empty.

Return type:

bool

is_leaf()[source]

Whether current tree is a leaf node.

Returns:

A boolean denoting whether current node is a leaf node.

Return type:

bool

keyword_include_group_list()[source]

Returns a list of keyword groups (list of keywords) which are minimal supersets of current tree.

  • For example:

    • For “A AND B OR C”, its minimal supersets are ['A', 'C'] and ['B', 'C'].

      • That is, if you search “A OR C” or “B OR C”, you can get all results that match “A AND B OR C”.

    • For “A AND [B OR C]”, its minimal supersets are ['A'] and ['B', 'C'].

  • Useful for websites that have a restriction on the number of keywords when seaching.

Returns:

A list of keyword groups (i.e. lists of keywords).

Return type:

list[list[str]]

keyword_list_check(keyword_list)[source]

Check whether the keyword list matches this tree.

  • For example, keyword list ['A', 'B'], ['C', 'D'] and ['A', 'B', 'C'] match “A AND B OR C”, while keyword list ['B', 'D'] cannot match “A AND B OR C”.

Parameters:

keyword_list (list[str]) – The keyword list to check.

Returns:

A boolean denoting if the keyword list matches this tree.

Return type:

bool

list_struct()[source]

Returns the structure of current tree as a recursive list.

  • For example, standard keyword string “A AND B OR C” will be returned as [['A', 'AND', 'B'], 'OR', 'C'].

Returns:

A list with the structure of this keyword tree.

Return type:

list

simplify_tree()[source]

Simplify the tree structure, including: NOT NOT key -> key and SINGLE key -> key.

  • If you create a KeywordLogicTree through the functions provided, .simplify_tree() will be automatically executed.

Return type:

None

standard_keyword_string()[source]

Returns the reconstructed standard keyword string.

  • The result may not be the same as the string that is used to construct the KeywordLogicTree.

  • For example, standard keyword string “A AND B OR C” will be returned as “[[A AND B] OR C]”.

Returns:

A standard keyword string.

Return type:

str

lchild: str | KeywordLogicTree = ''

Left child.

logic_operator: str = 'SINGLE'

Logic operator. Can be one of “AND”, “OR”, “NOT” or “SINGLE”.

When it is “NOT” or “SINGLE”, lchild should be omiited.

“SINGLE” means this node has only one element rchild. After building a tree, use simplify_tree() to simplify these nodes.

rchild: str | KeywordLogicTree = ''

Right child.

image_crawler_utils.keyword.construct_keyword_tree(keyword_str, log=<image_crawler_utils.log.Log object>)[source]

Use a standard syntax to represent logic relationship of keywords. Use ‘ AND ‘ / ‘&’, ‘ OR ‘ / ‘|’, ‘ NOT ‘ / ‘!’ to represent logic operators.

Use ‘[’, ‘]’ to increase logic priority.

  • Any space between two keywords will be replaced with ‘_’ and thus be considered as one keyword.

    • Example: “A B & [C (extra) OR NOT D]” -> “A_B AND [C_(extra) OR NOT D]”

Parameters:
Returns:

If successful, returns a KeywordLogicTree. If failed, return None.

Return type:

KeywordLogicTree

image_crawler_utils.keyword.construct_keyword_tree_from_list(keyword_list, connect_symbol='OR', log=<image_crawler_utils.log.Log object>)[source]

Convert a list of keywords into a keyword tree connected by connect_symbol (default is “OR”).

e.g. ['A', 'B', 'C'] -> [['A' OR 'B'] OR 'C']

Parameters:
  • keyword_str (Iterable(str)) – A list of strings.

  • connect_symbol (str) – Logic symbol of connection. Must be one of ‘AND’, ‘OR’, ‘&’ or ‘|’.

  • log (image_crawler_utils.log.Log, None) – The logging config.

  • keyword_list (Iterable[str])

Returns:

If successful, returns a KeywordLogicTree. If failed, return None.

Return type:

KeywordLogicTree

image_crawler_utils.keyword.min_len_keyword_group(keyword_group_list, below=None)[source]

For a list of keyword groups (i.e. lists of keywords), get a list of keyword group with the smallest length, or all keyword groups whose length are no larger than below.

Parameters:
  • keyword_group_list (list[list[str]]) – A list of keyword groups.

  • below (int) – If not None, try return all keyword group with length below “below” parameter. If such groups don’t exist, return the one with the smallest length.

Returns:

A list of keyword groups (i.e. lists of keywords)

Return type:

list[list]

image_crawler_utils.log module

class image_crawler_utils.log.Log(log_file=None, debug_config=DebugConfig(show_debug=False, show_info=True, show_warning=True, show_error=True, show_critical=True), logging_level=10, detailed_console_log=False)[source]

Bases: object

Class provided for logging messages onto the console and into the file.

Parameters:
  • log_file (str) – Output name for the logging file. NO SUFFIX APPENDED. Set to None (Default) is not to output any file.

  • debug_config (image_crawler_utils.configs.DebugConfig) – Set the OUTPUT MESSAGE TO CONSOLE level. Default is not to output any message.

  • logging_level (str, int) – Set the logging level of the LOGGING FILE. - Select from: logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR and logging.CRITICAL

  • detailed_console_log (bool) – When logging info to the console, always log msg (the messages logged into files) even if output_msg exists.

critical(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]

Output critical messages of errors that interrupt the crawler. Usually a Python error will be raised when critical errors happen.

Parameters:
debug(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]

Output debug messages of many detailed information about running the crawler, especially connections with websites.

Parameters:
error(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]

Output error messages of errors that may affect the final results but do not interrupt the crawler.

Parameters:
info(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]

Output info messages of basic information indicating the progress of the crawler.

Parameters:
logging_file_handler()[source]

Return the file handler if logging into file, or None if not.

logging_file_path()[source]

Output the absolute path of logging file if exists, or None if not.

warning(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]

Output warning messages of errors that basically do not affect the final results, mostly connection failures with the websites.

Parameters:
image_crawler_utils.log.print_logging_msg(msg, level='', debug_config=DebugConfig(show_debug=True, show_info=True, show_warning=True, show_error=True, show_critical=True), exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]

Print time and message according to its logging level. If debug_config is used and the logging level is not allowed to show, the message will not be output.

Parameters:
  • level (str) – Level of messages. - Should be one of “debug”, “info”, “warning”, “error”, “critical”. - Set it to other string or leave it blank will always output msg string without any prefix.

  • msg (str) – The message string to output.

  • debug_config (image_crawler_utils.configs.DebugConfig) – DebugConfig that controls output level. Default set to debug-level (output all).

  • exc_info – Please refer to the logging and rich.logging documentation.

  • stack_info (bool) – Please refer to the logging and rich.logging documentation.

  • stacklevel (int) – Please refer to the logging and rich.logging documentation.

  • extra (Mapping[str, object] | None) – Please refer to the logging and rich.logging documentation.

image_crawler_utils.progress_bar module

class image_crawler_utils.progress_bar.CountColumn(table_column=None, has_unit=False)[source]

Bases: ProgressColumn

A rich.progress.ProgressColumn class, which displays current progress number in integer.

Parameters:
  • table_column – Table column of the ProgressColumn.

  • has_unit (bool) –

    Set to True will shorten the number with a unit. Default is False, meaning the number will be displayed directly.

    • For example, “10” will be displayed as “10” will “12120” be displayed as “12.1×10³”.

render(task)[source]

Show count completed.

Parameters:

task (Task)

Return type:

Text

class image_crawler_utils.progress_bar.CustomProgress(*columns, text_only=False, has_spinner=False, spinner_name='dots', has_total=True, is_file=False, time_format='%H:%M:%S', is_compact_time_format=True, is_sub_process=False, console=None, auto_refresh=True, refresh_per_second=20, speed_estimate_period=30, transient=False, redirect_stdout=True, redirect_stderr=True, get_time=None, disable=False, expand=False)[source]

Bases: Progress

A wrapped Progress class with specific format-controlling parameters.

If you add ProgressColumns to CustomProgress like normal rich.progress.Progress class, it will be placed between BarColumn & TaskProgressColumn (i.e. the progress bar and percentage) and TimeColumnLeft.
  • These ProgressColumns classes will be omitted if text_only is set to True.

Parameters:
  • text_only (bool) –

    If set to True, Progress bars will only display descriptions. Default is False.

    • When set to True, all other columns except rich.progress.TextColumn("[progress.description]{task.description}[reset]") will be omitted!

  • has_spinner (bool) – If set to True, a spinner will be added to the left. Default is False.

  • spinner_name (str) – The type of the spinner, which can be referred from https://jsfiddle.net/sindresorhus/2eLtsbey/embedded/result/. Default is :py:data:”dots”.

  • has_total (bool) – Set to True if involved tasks have total numbers. Default is False.

  • is_file (bool) – Set to True if involved tasks deal with files. Default is False.

  • time_format (str) –

    A string that controls the time format. Default is “%H:%M:%S”.

    • ’%H’ will be replaced with hours.

    • ’%M’ will be replaced minutes.

    • ’%S’ will be replaced seconds.

    • ’%L’ will be replaced miliseconds.

  • is_compact_time_format (bool) –

    When set to True (default), the time_format will be truncated to start from ‘%M’ when time is lower than 1 hour.

    • For example, “%H:%M:%S.%L” will be truncated to “%M:%S.%L” when time is lower than 1 hour.

  • is_sub_process (bool) – Set to True if it is a subprocess of a certain ProgressGroup. Default is False.

  • console (Console, None) – Optional Console instance. Defaults to an internal Console instance writing to stdout.

  • auto_refresh (bool, None) – Enable auto refresh. If disabled, you will need to call refresh().

  • refresh_per_second (Optional[float], None) – Number of times per second to refresh the progress information or None to use default (10). Defaults to None.

  • speed_estimate_period – (float, None): Period (in seconds) used to calculate the speed estimate. Defaults to 30.

  • transient – (bool, None): Clear the progress on exit. Defaults to False.

  • redirect_stdout – (bool, None): Enable redirection of stdout, so print may be used. Defaults to True.

  • redirect_stderr – (bool, None): Enable redirection of stderr. Defaults to True.

  • get_time – (Callable, None): A callable that gets the current time, or None to use Console.get_time. Defaults to None.

  • disable (bool, None) – Disable progress display. Defaults to False

  • expand (bool, None) – Expand tasks table to fit width. Defaults to False.

finish_task(task, hide=True)[source]

Finish a task within the CustomProgress. Unless this CustomProgress is a preset attribute of :class`ProgressGroup` or its is_sub_process is set to True, running this function will stop the whole Progress; otherwise it will just stop the task.

Parameters:
  • task (rich.progress.Task) – The Task class that is created under this CustomProgress.

  • hide (bool) – Set to True (default) to hide the progress bar of this task.

class image_crawler_utils.progress_bar.ProgressGroup(progress_list=[], has_panel=True, panel_title=None, panel_subtitle=None, refresh_per_second=10)[source]

Bases: object

A Group of Progress, which can simplify building multiple Progress bars.

Parameters:
  • progress_list (list[rich.progress.Progress]) – An iterable list of rich.progress.Progress classes which will be added to the ProgressGroup when created. Default is [] (an empty list).

  • has_panel (bool) – When set to True (default), a rich.panel.Panel will be wrapped around all of the progress bars.

  • panel_title (str) –

    When set to a str, the title will be displayed at the top middle of the panel.

    • Works only if has_panel is set to True.

  • panel_subtitle (str) –

    When set to a str, the title will be displayed at the bottom middle of the panel.

    • Works only if has_panel is set to True.

  • refresh_per_second (int) – Refreshing the progress bars for refresh_per_second times in a second. Default is 10.

start()[source]

Start the ProgressGroup. That is, start the ProgressGroup().live attribute.

Return type:

None

stop()[source]

Stop the ProgressGroup. That is, stop the ProgressGroup().live attribute.

Return type:

None

class image_crawler_utils.progress_bar.SpeedColumnRight(table_column=None, is_file=False)[source]

Bases: ProgressColumn

A rich.progress.ProgressColumn class, which displays speed in “1.23 MB/s]” format. It is suggested to put it to the right of TimeColumnLeft.

Parameters:
  • table_column – Table column of the ProgressColumn.

  • is_file (bool) – Set to True if involved tasks deal with files. When set to True, the units will use ‘KB’, ‘MB’, etc. Default is False.

render(task)[source]

Renders speed.

Parameters:

task (Task)

Return type:

Text

class image_crawler_utils.progress_bar.TimeColumnLeft(table_column=None, has_total=True, time_format='%H:%M:%S', is_compact_time_format=True)[source]

Bases: ProgressColumn

A rich.progress.ProgressColumn class, which displays elapsed time and remaining time in “[00:08<00:03,” format. It is suggested to put it to the left of SpeedColumnRight.

Parameters:
  • table_column – Table column of the ProgressColumn.

  • has_total (bool) – Set to True if involved tasks has a total number. When set to False, remaining time will not be displayed. Default is False.

  • time_format (str) –

    A string that controls the time format. Default is “%H:%M:%S”.

    • ’%H’ will be replaced with hours.

    • ’%M’ will be replaced minutes.

    • ’%S’ will be replaced seconds.

    • ’%L’ will be replaced miliseconds.

  • is_compact_time_format (bool) –

    When set to True (default), the time_format will be truncated to start from ‘%M’ when time is lower than 1 hour.

    • For example, “%H:%M:%S.%L” will be truncated to “%M:%S.%L” when time is lower than 1 hour.

render(task)[source]

Renders elapsed time and remaining time.

Parameters:

task (Task)

Return type:

Text

image_crawler_utils.utils module

class image_crawler_utils.utils.Empty[source]

Bases: object

An empty placeholder class, mainly for checking if a parameter is used.

image_crawler_utils.utils.attempt_name_len()[source]

Try to calculate the length of names.

Returns:

The length of shortened file name. If terminal size is available, the result will be \(\left\lfloor\frac{\text{terminal_size} - 10}{5}\right\rfloor\). Otherwise, the result will be 10.

Return type:

int

image_crawler_utils.utils.check_dir(dir_path, log=<image_crawler_utils.log.Log object>)[source]

This function will check whether a directory exists, and try to create it when not existing. A logging message will be print to console when succeeded, and a critical error will be thrown when failed.

Parameters:
Return type:

None

image_crawler_utils.utils.load_dataclass(dataclass_to_load, file_name, file_type=None, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]

Load the file containing a dataclass into the dataclass_to_load parameter.

The dataclass should be the same as the one you once saved.

Parameters:
  • dataclass (A dataclass) – The dataclass to be loaded.

  • file_name (str) – Name of the file.

  • file_type (str, Optional) –

    If suffix not found in file_name, designate file type manually.

    • Set file_type parameter to json or pkl will force the function to consider the file as this type, or leaving this parameter blank will cause the funtion to determine file type according to file_name.

      • That is, load_dataclass(dataclass, 'foo.json') works the same as load_dataclass(dataclass, 'foo.json', 'json').

  • encoding (str) – Encoding of JSON file. Only works when loading from .json.

  • log (image_crawler_utils.log.Log, None) – Logging config.

Returns:

Loaded dataclass_to_load, or None if failed.

Return type:

Any

image_crawler_utils.utils.save_dataclass(dataclass, file_name, file_type=None, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]

Save the dataclass parameter into a file.

Parameters:
  • dataclass (A dataclass) – The dataclass to be saved.

  • file_name (str) – Name of file. Suffix (.json / .pkl) is optional.

  • file_type (str, Optional) –

    If suffix not found in file_name, designate file type manually.

    • Set file_type parameter to json or pkl will force the function to save the dataclass (config) into this type, or leaving this parameter blank will cause the funtion to determine the file type according to file_name.

      • That is, save_dataclass(dataclass, 'foo.json') works the same as save_dataclass(dataclass, 'foo', 'json').

      • .json is suggested when your dataclasses (configs) do not include objects that cannot be JSON-serialized (e.g. a function), while serialized data file .pkl can support most data types but the saved file is not readable.

  • encoding (str) – Encoding of JSON file. Only works when saving as .json.

  • log (image_crawler_utils.log.Log, None) –

    Logging config.

    • You can use log=crawler_settings.log to make it the same as the CrawlerSettings you set up.

Returns:

(Saved file name, Absolute path of the saved file), or None if failed.

Return type:

tuple[str, str] | None

async image_crawler_utils.utils.set_up_nodriver_browser(proxies=None, headless=True, window_width=None, window_height=None, no_image_stylesheet=False)[source]

Set up a nodriver in a more convenient way.

WARNING: nodriver use async functions. This function is async as well!

Parameters:
  • proxies (dict, Callable, None) –

    The proxies used in nodriver browser.

    • The pattern should be in a requests-acceptable form like:

      • HTTP type: {'http': '127.0.0.1:7890'}

      • HTTPS type: {'https': '127.0.0.1:7890'}, or {'https': '127.0.0.1:7890', 'http': '127.0.0.1:7890'}

      • SOCKS type: {'https': 'socks5://127.0.0.1:7890'}

  • headless (bool) – Do not display browsers window when a browser is started. Set to False will pop up browser windows.

  • window_width (int, None) –

    Width of browser window. Set to None will maximize window.

    • Set headless to True will omit this parameter.

  • window_height (int, None) –

    Height of window when displayed. Set to None will maximize window.

    • Set headless to True will omit this parameter.

  • no_image_stylesheet (bool) –

    Do not load any images or stylesheet when loading webpages in this browser.

    • Set this parameter to True can reduce the traffic when loading pages and accelerate loading speed.

Returns:

The nodriver.

Return type:

Browser

image_crawler_utils.utils.shorten_file_name(file_name, name_len=10)[source]

Shorten file name for displaying on console.

Parameters:
  • file_name (str) – Name of the file.

  • name_len (int, None) – Maximum length allowed. Set to None (default) will use IMAGE_NAME_LEN above.

Returns:

Shortened file name.

Return type:

str

image_crawler_utils.utils.silent_deconstruct_browser(log=<image_crawler_utils.log.Log object>)[source]

I have had enough with nodriver’s annoying removing temp file messages. This function will do the same thing without those spamming messages.

Parameters:

log (image_crawler_utils.log.Log) – Displaying those spamming messages.

image_crawler_utils.utils.suppress_print()[source]

Suppress built-in print so that it may not output anything.

An example is like:

with suppress_print():
    # suppressed print()