CrawlerSettings Class

A CrawlerSettings class contains basic information and configurations for Parsers and Downloaders. It can be imported from the image_crawler_utils package.

You should always prepare a CrawlerSettings before starting the crawling.

Detailed Information

class image_crawler_utils.CrawlerSettings(capacity_count_config=<image_crawler_utils.utils.Empty object>, download_config=<image_crawler_utils.utils.Empty object>, debug_config=DebugConfig(show_debug=False, show_info=True, show_warning=True, show_error=True, show_critical=True), image_num=None, capacity=None, page_num=None, headers=None, proxies=None, thread_delay=5, fail_delay=3, randomize_delay=True, thread_num=5, timeout=30.0, max_download_time=None, retry_times=5, overwrite_images=True, detailed_console_log=False, extra_configs=None)[source]

Bases: object

A general framework of settings for running a crawler.

Parameters:
  • capacity_count_config (image_crawler_utils.configs.CapacityCountConfig, None) –

    Contains configs that restricts downloading numbers and capacity.

    • If this parameter is used, the image_num, capacity and page_num parameters will be omitted.

  • download_config (image_crawler_utils.configs.DownloadConfig, None) –

    Contains configs about parameters in downloading.

    • If this parameter is used, the headers, proxies, thread_delay, fail_delay, randomize_delay, thread_num, timeout, max_download_time, retry_time and overwrite_images parameters will be omitted.

  • debug_config (image_crawler_utils.configs.DebugConfig, None) – Contains configs that define which types of messages are shown on the console.

  • image_num (int, None) – Number of images to be parsed / downloaded in total; None means no restriction.

  • capacity (float, None) – Total size of images (MB); None means no restriction.

  • page_num (int, None) – Number of gallery pages to detect images in total; None means no restriction.

  • headers (dict, Callable, None) – Headers settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.

  • proxies (dict, Callable, None) – Proxy settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.

  • thread_delay (float) – Waiting time (s) after thread start.

  • fail_delay (float) – Waiting time (s) after failing.

  • randomize_delay (bool) – Randomize delay time between 0 and delay_time.

  • thread_num (int) – Downloading thread num.

  • timeout (float, None) – Timeout for requests. Set to None means no timeout.

  • max_download_time (float, None) – Maximum download time for a image. Set to None means no timeout.

  • retry_times (int) – Times of retrying to download.

  • overwrite_images (bool) – Overwrite existing images.

  • detailed_console_log (bool) –

    Logging detailed information onto the console.

    • It means that when logging info to the console, always log msg (the messages logged into files) even if output_msg exists.

  • extra_configs (dict, None) – This optional dict is not used in any of the supported sites and crawling tasks, as it is reserved for developing your custom image crawler.

classmethod load_from_pkl(pkl_file, log=<image_crawler_utils.log.Log object>)[source]

Load CrawlerSettings from .pkl file.

Parameters:
Returns:

A CrawlerSettings class loaded from pkl file, or None if failed.

Return type:

CrawlerSettings

browser_test(url, headless=True, stay_time=30)[source]

Test whether browser works normally.

ATTENTION: This function DO NOT check the connectivity of the URL. Use connectivity_test() instead.

Parameters:
  • url (str) – Test connectivity using this URL.

  • headless (bool) –

    Whether not to display a window when testing chromdriver. You can have a quick glimpse of whether the page is correctly loaded.

    • Set headless to False will pop up the browser window to display the whole process of loading the webpage.

  • stay_time (float) – If set to not headless, the window will stay for stay_time seconds.

Returns:

A bool. Successful connection returns True, in other cases returns False.

connectivity_test(url)[source]

Test connectivity of internet.

Using config in download_config.

Parameters:

url (str) – Test connectivity using this URL.

Returns:

A bool. Successful connection returns True, in other cases returns False.

copy(capacity_count_config=<image_crawler_utils.utils.Empty object>, download_config=<image_crawler_utils.utils.Empty object>, debug_config=<image_crawler_utils.utils.Empty object>, image_num=<image_crawler_utils.utils.Empty object>, capacity=<image_crawler_utils.utils.Empty object>, page_num=<image_crawler_utils.utils.Empty object>, headers=<image_crawler_utils.utils.Empty object>, proxies=<image_crawler_utils.utils.Empty object>, thread_delay=<image_crawler_utils.utils.Empty object>, fail_delay=<image_crawler_utils.utils.Empty object>, randomize_delay=<image_crawler_utils.utils.Empty object>, thread_num=5, timeout=<image_crawler_utils.utils.Empty object>, max_download_time=<image_crawler_utils.utils.Empty object>, retry_times=<image_crawler_utils.utils.Empty object>, overwrite_images=<image_crawler_utils.utils.Empty object>, extra_configs=<image_crawler_utils.utils.Empty object>)[source]

Generate a copy of a CrawlerSettings, with (optional) parameters changed.

Parameters:
  • capacity_count_config (image_crawler_utils.configs.CapacityCountConfig, None) –

    Contains configs that restricts downloading numbers and capacity.

    • If this parameter is used, the image_num, capacity and page_num parameters will be omitted.

  • download_config (image_crawler_utils.configs.DownloadConfig, None) –

    Contains configs about parameters in downloading.

    • If this parameter is used, the headers, proxies, thread_delay, fail_delay, randomize_delay, thread_num, timeout, max_download_time, retry_time, overwrite_images parameters will be omitted.

  • debug_config (image_crawler_utils.configs.DebugConfig, None) – Contains configs that define which types of messages are shown on the console.

  • image_num (int, None) – Number of images to be parsed / downloaded in total; None means no restriction.

  • capacity (float, None) – Total size of images (MB); None means no restriction.

  • page_num (int, None) – Number of gallery pages to detect images in total; None means no restriction.

  • headers (dict, Callable, None) – Headers settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.

  • proxies (dict, Callable, None) – Proxy settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.

  • thread_delay (float) – Waiting time (s) after thread start.

  • fail_delay (float) – Waiting time (s) after failing.

  • randomize_delay (bool) – Randomize delay time between 0 and delay_time.

  • thread_num (int) – Downloading thread num.

  • timeout (float, None) – Timeout for requests. Set to None means no timeout.

  • max_download_time (float, None) – Maximum download time for a image. Set to None means no timeout.

  • retry_times (int) – Times of retrying to download.

  • overwrite_images (bool) – Overwrite existing images.

  • detailed_console_log (bool) –

    Logging detailed information onto the console.

    • It means that when logging info to the console, always log msg (the messages logged into files) even if output_msg exists.

  • extra_configs (dict, None) – This optional dict is not used in any of the supported sites and crawling tasks, as it is reserved for developing your custom image crawler.

dill_base64_sha256_data()[source]

Return the serialized bytes of current CrawlerSettings, base64 encoded str of the bytes, and sha256 str of the bytes.

Returns:

dill.dumps() bytes, “base64”: base64 encoding of the dill.dumps() bytes, “sha256”: sha256 encoding of the base64 code}

Return type:

A dict like {“bytes”

display_all_configs()[source]

Display all config info.

Dataclasses will be displayed in a neater way.

save_to_pkl(pkl_file=None)[source]

Save the CrawlerSettings in a pkl file.

It is recommended to use the default file name, which uses the sha256 encoded string generated by dill_base64_sha256_data().

Parameters:
  • path (str) – Path to save the pkl file. Default is saving to the current path.

  • pkl_file (str, None) – Name of the pkl file. (Suffix is optional.) Default is using the sha256 encoded string generated by dill_base64_sha256_data().

Returns:

(Saved file name, Absolute path of the saved file), or None if failed.

Return type:

tuple[str, str] | None

set_logging_file(log_file, logging_level=10)[source]

Set the file to be logged into.

It is recommended to add a logging file when running the crawler, as the message displayed on the console is simplified and usually not complete.

PAY ATTENTION: You cannot set logging files when creatiing a class. Setting logging files is controlled by this function.

Parameters:
  • log_file (str) – Output name for the logging file. Suffix (.json) is optional. Set to None (Default) is not to output any file.

  • logging_level (int) –

    Set the logging level of the LOGGING FILE. Select from: logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR and logging.CRITICAL (import logging library first, or use the word string directly).

    • ATTENTION: It is indepedent from the level of logging onto the console!
      • The latter is controlled by the debug_config parameter, and this parameter in turn does not affect logging into files.

Returns:

The changed CrawlerSettings itself.

Configs, and how to Save and Load them

When setting up a CrawlerSettings class, the parameters capacity_count_config, download_config and debug_config can be respectively filled with classes CapacityCountConfig, DownloadConfig and DebugConfig. You can also save or load them for later uses.

class image_crawler_utils.configs.CapacityCountConfig(image_num=None, capacity=None, page_num=None)[source]

Bases: object

Contains config for restrictions of images number, total size or webpage number.

Parameters:
  • image_num (int | None)

  • capacity (float | None)

  • page_num (int | None)

capacity: float | None = None

Total size of images (bytes) to be downloaded.

  • Default is set to None, meaning no restrictions.

  • When capacity is reached, no new downloading threads will be added. However, downloading threads that already started will not be affected, which means actual image size will be larger than the capacity.

image_num: int | None = None

The number of images to be parsed from websites or downloaded.

  • Default is set to None, meaning no restrictions.

  • Mostly only used in the Downloader to control the number of images to be downloaded, but some Parsers may also use this parameter.

page_num: int | None = None

Number of gallery pages to detect images in total. None means no restriction.

  • Default is set to None, meaning no restrictions.

  • Some websites, like Twitter / X, do not use gallery pages or JSON-API pages (Image Crawler Utils uses the method of scrolling webpages to get Twitter / X images), and this parameter is not used.

class image_crawler_utils.configs.DownloadConfig(headers=None, proxies=None, thread_delay=5, fail_delay=3, randomize_delay=True, thread_num=5, timeout=10, max_download_time=None, retry_times=5, overwrite_images=True)[source]

Bases: object

Contains config for downloading.

Parameters:
fail_delay: float = 3

Delaying time (seconds) after every failure.

  • Both fetching webpages and downloading images will use this parameter.

  • Some Parsers may use different parameters to control their delaying time when a failure happens.

headers: dict | Callable | None = None

Headers of the requests.

  • Both fetching webpages and downloading images will use this parameter.

  • Headers should be None, a dict or a callable function that returns a dict.
    • If you want to have a random header with every request, you can set headers to a callable function. This function should not accept any parameters (which can be implemented by lambda) and returns a dict.

  • This only works when the requests is sent by requests (like requests.get()). For webpages loaded by browsers, this parameter is omitted.

  • Basically, this contains the user agent of the requests.
    • ATTENTION: Not all user agents are supported by the websites you are accessing!

max_download_time: float | None = None

When no new data is fetched when downloading images in max_download_time seconds, a failure will happen.

  • Only downloading images will use this parameter.

  • Default is set to None, meaning no restrictions.

overwrite_images: bool = True

Overwrite existing images when downloading.

  • Only downloading images will use this parameter.

proxies: dict | Callable | None = None

Proxies used by the crawler.

  • Both fetching webpages and downloading images will use this parameter.

  • Proxies should be None, a dict or a callable function that returns a dict.
    • Set to None (Default) will let the crawler use system proxies.

    • If you want to have a random proxy with every request, you can set proxies to a callable function. This function should not accept any parameters (which can be implemented by lambda) and returns a dict.

  • Both requests and browsers use these proxies, but the structure should be in a requests-acceptable form like:
    • HTTP type: {'http': '127.0.0.1:7890'}

    • HTTPS type: {'https': '127.0.0.1:7890'}

    • SOCKS type: {'https': 'socks5://127.0.0.1:7890'}

    • If you input 'https' proxies, 'http' proxies will be automatically generated.

    • ATTENTION: Using usernames and passwords is currently not supported.

randomize_delay: bool = True

Randomize thread_delay and fail_delay between 0 and their values.

  • For example, thread_delay=5.0 and randomize_delay=False will cause the thread_delay to choose a random value between 0 and 5.0 every time.

property result_fail_delay: float

Generate the fail delay. If the randomize_delay attribute is set to True, the delay will be randomized between 0 and fail_delay for every usage.

property result_headers: dict | None

Generate the headers. If the headers attribute is callable, it will be called for every usage.

property result_proxies: dict | None

Generate the headers. If the proxies attribute is callable, it will be called for every usage.

property result_thread_delay: float

Generate the thread delay. If the randomize_delay attribute is set to True, the delay will be randomized between 0 and thread_delay for every usage.

retry_times: int = 5

Total times of retrying to fetch a webpage / download an image.

  • Both fetching webpages and downloading images will use this parameter.

thread_delay: float = 5

Delaying time (seconds) before every thread starts.

  • Both fetching webpages and downloading images will use this parameter.

  • Some Parsers may use different parameters to control their delaying time.

thread_num: int = 5

Total number of threads.

  • Both fetching webpages and downloading images will use this parameter.

  • Some Parsers do not use threading to fetching pages, and this parameter is not used.

timeout: float | None = 10

Timeout for connection. When no response is returned in timeout seconds, a failure will happen.

  • Both fetching webpages and downloading images will use this parameter.

  • Setting to None means (barely) no restrictions.

class image_crawler_utils.configs.DebugConfig(show_debug=False, show_info=True, show_warning=True, show_error=True, show_critical=True)[source]

Bases: object

Contains config for whether displaying a certain level of debugging messages in console.

Default set to “info” level.

Parameters:
classmethod level(level_str)[source]

Create a DebugConfig that is set to display messages over the level. For example, set to “warning” will display warning, error and critical messages.

Parameters:

level_str (str) –

Must be one of (from lower to higher) “debug”, “info”, “warning”, “error”, “critical” or “silenced”.

  • Set a logging level will display messages including and above this level. For example, .set_level("warning") will only display messages with “warning”, “error” and “critical” levels.

  • Set to “silenced” level will not display any messages except those generated by the progress bars.

Returns:

Created DebugConfig.

Return type:

DebugConfig

set_level(level_str)[source]

Set current DebugConfig to display messages over the level. For example, set to “warning” will display warning, error and critical messages.

Parameters:

level_str (str) –

Must be one of (from lower to higher) “debug”, “info”, “warning”, “error”, “critical” or “silenced”.

  • Set a logging level will display messages including and above this level. For example, .set_level("warning") will only display messages with “warning”, “error” and “critical” levels.

  • Set to “silenced” level will not display any messages except those generated by the progress bars.

show_critical: bool = True

Display critical-level messages.

  • Default set to True.

  • Include messages of errors that interrupt the crawler. Usually a Python error will be raised when critical errors happen.

show_debug: bool = False

Display debug-level messages.

  • Default set to False.

  • Include messages of many detailed information about running the crawler, especially connections with websites.

  • Set show_debug to False will not stop displaying debug messages from any .display_all_configs().

show_error: bool = True

Display error-level messages.

  • Default set to True.

  • Include messages of errors that may affect the final results but do not interrupt the crawler.

show_info: bool = True

Display info-level messages.

  • Default set to True.

  • Include messages of basic information indicating the progress of the crawler.

show_warning: bool = True

Display warning-level messages.

  • Default set to True.

  • Include messages of errors that basically do not affect the final results, mostly connection failures with the websites.

Saving and Loading Configs

Use the 2 functions below to save and load these Configs.

image_crawler_utils.utils.save_dataclass(dataclass, file_name, file_type=None, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]

Save the dataclass parameter into a file.

Parameters:
  • dataclass (A dataclass) – The dataclass to be saved.

  • file_name (str) – Name of file. Suffix (.json / .pkl) is optional.

  • file_type (str, Optional) –

    If suffix not found in file_name, designate file type manually.

    • Set file_type parameter to json or pkl will force the function to save the dataclass (config) into this type, or leaving this parameter blank will cause the funtion to determine the file type according to file_name.

      • That is, save_dataclass(dataclass, 'foo.json') works the same as save_dataclass(dataclass, 'foo', 'json').

      • .json is suggested when your dataclasses (configs) do not include objects that cannot be JSON-serialized (e.g. a function), while serialized data file .pkl can support most data types but the saved file is not readable.

  • encoding (str) – Encoding of JSON file. Only works when saving as .json.

  • log (image_crawler_utils.log.Log, None) –

    Logging config.

    • You can use log=crawler_settings.log to make it the same as the CrawlerSettings you set up.

Returns:

(Saved file name, Absolute path of the saved file), or None if failed.

Return type:

tuple[str, str] | None

image_crawler_utils.utils.load_dataclass(dataclass_to_load, file_name, file_type=None, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]

Load the file containing a dataclass into the dataclass_to_load parameter.

The dataclass should be the same as the one you once saved.

Parameters:
  • dataclass (A dataclass) – The dataclass to be loaded.

  • file_name (str) – Name of the file.

  • file_type (str, Optional) –

    If suffix not found in file_name, designate file type manually.

    • Set file_type parameter to json or pkl will force the function to consider the file as this type, or leaving this parameter blank will cause the funtion to determine file type according to file_name.

      • That is, load_dataclass(dataclass, 'foo.json') works the same as load_dataclass(dataclass, 'foo.json', 'json').

  • encoding (str) – Encoding of JSON file. Only works when loading from .json.

  • log (image_crawler_utils.log.Log, None) – Logging config.

Returns:

Loaded dataclass_to_load, or None if failed.

Return type:

Any