CrawlerSettings Class
A CrawlerSettings class contains basic information and configurations for Parsers and Downloaders. It can be imported from the image_crawler_utils package.
You should always prepare a CrawlerSettings before starting the crawling.
Detailed Information
- class image_crawler_utils.CrawlerSettings(capacity_count_config=<image_crawler_utils.utils.Empty object>, download_config=<image_crawler_utils.utils.Empty object>, debug_config=DebugConfig(show_debug=False, show_info=True, show_warning=True, show_error=True, show_critical=True), image_num=None, capacity=None, page_num=None, headers=None, proxies=None, thread_delay=5, fail_delay=3, randomize_delay=True, thread_num=5, timeout=30.0, max_download_time=None, retry_times=5, overwrite_images=True, detailed_console_log=False, extra_configs=None)[source]
Bases:
objectA general framework of settings for running a crawler.
- Parameters:
capacity_count_config (image_crawler_utils.configs.CapacityCountConfig, None) –
Contains configs that restricts downloading numbers and capacity.
If this parameter is used, the
image_num,capacityandpage_numparameters will be omitted.
download_config (image_crawler_utils.configs.DownloadConfig, None) –
Contains configs about parameters in downloading.
If this parameter is used, the
headers,proxies,thread_delay,fail_delay,randomize_delay,thread_num,timeout,max_download_time,retry_timeandoverwrite_imagesparameters will be omitted.
debug_config (image_crawler_utils.configs.DebugConfig, None) – Contains configs that define which types of messages are shown on the console.
image_num (int, None) – Number of images to be parsed / downloaded in total; None means no restriction.
capacity (float, None) – Total size of images (MB); None means no restriction.
page_num (int, None) – Number of gallery pages to detect images in total; None means no restriction.
headers (dict, Callable, None) – Headers settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.
proxies (dict, Callable, None) – Proxy settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.
thread_delay (float) – Waiting time (s) after thread start.
fail_delay (float) – Waiting time (s) after failing.
randomize_delay (bool) – Randomize delay time between 0 and delay_time.
thread_num (int) – Downloading thread num.
timeout (float, None) – Timeout for requests. Set to None means no timeout.
max_download_time (float, None) – Maximum download time for a image. Set to None means no timeout.
retry_times (int) – Times of retrying to download.
overwrite_images (bool) – Overwrite existing images.
detailed_console_log (bool) –
Logging detailed information onto the console.
It means that when logging info to the console, always log
msg(the messages logged into files) even ifoutput_msgexists.
extra_configs (dict, None) – This optional
dictis not used in any of the supported sites and crawling tasks, as it is reserved for developing your custom image crawler.
- classmethod load_from_pkl(pkl_file, log=<image_crawler_utils.log.Log object>)[source]
Load CrawlerSettings from .pkl file.
- Parameters:
pkl_file (str, None) – Name of the pkl file. Suffix (.pkl) must be included.
log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
A CrawlerSettings class loaded from pkl file, or None if failed.
- Return type:
- browser_test(url, headless=True, stay_time=30)[source]
Test whether browser works normally.
ATTENTION: This function DO NOT check the connectivity of the URL. Use connectivity_test() instead.
- Parameters:
url (str) – Test connectivity using this URL.
headless (bool) –
Whether not to display a window when testing chromdriver. You can have a quick glimpse of whether the page is correctly loaded.
Set
headlesstoFalsewill pop up the browser window to display the whole process of loading the webpage.
stay_time (float) – If set to not headless, the window will stay for stay_time seconds.
- Returns:
A bool. Successful connection returns
True, in other cases returnsFalse.
- connectivity_test(url)[source]
Test connectivity of internet.
Using config in download_config.
- copy(capacity_count_config=<image_crawler_utils.utils.Empty object>, download_config=<image_crawler_utils.utils.Empty object>, debug_config=<image_crawler_utils.utils.Empty object>, image_num=<image_crawler_utils.utils.Empty object>, capacity=<image_crawler_utils.utils.Empty object>, page_num=<image_crawler_utils.utils.Empty object>, headers=<image_crawler_utils.utils.Empty object>, proxies=<image_crawler_utils.utils.Empty object>, thread_delay=<image_crawler_utils.utils.Empty object>, fail_delay=<image_crawler_utils.utils.Empty object>, randomize_delay=<image_crawler_utils.utils.Empty object>, thread_num=5, timeout=<image_crawler_utils.utils.Empty object>, max_download_time=<image_crawler_utils.utils.Empty object>, retry_times=<image_crawler_utils.utils.Empty object>, overwrite_images=<image_crawler_utils.utils.Empty object>, extra_configs=<image_crawler_utils.utils.Empty object>)[source]
Generate a copy of a CrawlerSettings, with (optional) parameters changed.
- Parameters:
capacity_count_config (image_crawler_utils.configs.CapacityCountConfig, None) –
Contains configs that restricts downloading numbers and capacity.
If this parameter is used, the
image_num,capacityandpage_numparameters will be omitted.
download_config (image_crawler_utils.configs.DownloadConfig, None) –
Contains configs about parameters in downloading.
If this parameter is used, the
headers,proxies,thread_delay,fail_delay,randomize_delay,thread_num,timeout,max_download_time,retry_time,overwrite_imagesparameters will be omitted.
debug_config (image_crawler_utils.configs.DebugConfig, None) – Contains configs that define which types of messages are shown on the console.
image_num (int, None) – Number of images to be parsed / downloaded in total; None means no restriction.
capacity (float, None) – Total size of images (MB); None means no restriction.
page_num (int, None) – Number of gallery pages to detect images in total; None means no restriction.
headers (dict, Callable, None) – Headers settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.
proxies (dict, Callable, None) – Proxy settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.
thread_delay (float) – Waiting time (s) after thread start.
fail_delay (float) – Waiting time (s) after failing.
randomize_delay (bool) – Randomize delay time between 0 and delay_time.
thread_num (int) – Downloading thread num.
timeout (float, None) – Timeout for requests. Set to None means no timeout.
max_download_time (float, None) – Maximum download time for a image. Set to None means no timeout.
retry_times (int) – Times of retrying to download.
overwrite_images (bool) – Overwrite existing images.
detailed_console_log (bool) –
Logging detailed information onto the console.
It means that when logging info to the console, always log
msg(the messages logged into files) even ifoutput_msgexists.
extra_configs (dict, None) – This optional
dictis not used in any of the supported sites and crawling tasks, as it is reserved for developing your custom image crawler.
- dill_base64_sha256_data()[source]
Return the serialized bytes of current CrawlerSettings, base64 encoded str of the bytes, and sha256 str of the bytes.
- Returns:
dill.dumps() bytes, “base64”: base64 encoding of the dill.dumps() bytes, “sha256”: sha256 encoding of the base64 code}
- Return type:
A dict like {“bytes”
- display_all_configs()[source]
Display all config info.
Dataclasses will be displayed in a neater way.
- save_to_pkl(pkl_file=None)[source]
Save the CrawlerSettings in a pkl file.
It is recommended to use the default file name, which uses the sha256 encoded string generated by dill_base64_sha256_data().
- Parameters:
- Returns:
(Saved file name, Absolute path of the saved file), or None if failed.
- Return type:
- set_logging_file(log_file, logging_level=10)[source]
Set the file to be logged into.
It is recommended to add a logging file when running the crawler, as the message displayed on the console is simplified and usually not complete.
PAY ATTENTION: You cannot set logging files when creatiing a class. Setting logging files is controlled by this function.
- Parameters:
log_file (str) – Output name for the logging file. Suffix (.json) is optional. Set to None (Default) is not to output any file.
logging_level (int) –
Set the logging level of the LOGGING FILE. Select from:
logging.DEBUG,logging.INFO,logging.WARNING,logging.ERRORandlogging.CRITICAL(importlogginglibrary first, or use the word string directly).- ATTENTION: It is indepedent from the level of logging onto the console!
The latter is controlled by the
debug_configparameter, and this parameter in turn does not affect logging into files.
- Returns:
The changed CrawlerSettings itself.
Configs, and how to Save and Load them
When setting up a CrawlerSettings class, the parameters capacity_count_config, download_config and debug_config can be respectively filled with classes CapacityCountConfig, DownloadConfig and DebugConfig. You can also save or load them for later uses.
- class image_crawler_utils.configs.CapacityCountConfig(image_num=None, capacity=None, page_num=None)[source]
Bases:
objectContains config for restrictions of images number, total size or webpage number.
- capacity: float | None = None
Total size of images (bytes) to be downloaded.
Default is set to
None, meaning no restrictions.When capacity is reached, no new downloading threads will be added. However, downloading threads that already started will not be affected, which means actual image size will be larger than the capacity.
- image_num: int | None = None
The number of images to be parsed from websites or downloaded.
Default is set to
None, meaning no restrictions.Mostly only used in the Downloader to control the number of images to be downloaded, but some Parsers may also use this parameter.
- page_num: int | None = None
Number of gallery pages to detect images in total. None means no restriction.
Default is set to
None, meaning no restrictions.Some websites, like Twitter / X, do not use gallery pages or JSON-API pages (Image Crawler Utils uses the method of scrolling webpages to get Twitter / X images), and this parameter is not used.
- class image_crawler_utils.configs.DownloadConfig(headers=None, proxies=None, thread_delay=5, fail_delay=3, randomize_delay=True, thread_num=5, timeout=10, max_download_time=None, retry_times=5, overwrite_images=True)[source]
Bases:
objectContains config for downloading.
- Parameters:
- fail_delay: float = 3
Delaying time (seconds) after every failure.
Both fetching webpages and downloading images will use this parameter.
Some Parsers may use different parameters to control their delaying time when a failure happens.
- headers: dict | Callable | None = None
Headers of the requests.
Both fetching webpages and downloading images will use this parameter.
This only works when the requests is sent by
requests(likerequests.get()). For webpages loaded by browsers, this parameter is omitted.- Basically, this contains the user agent of the requests.
ATTENTION: Not all user agents are supported by the websites you are accessing!
- max_download_time: float | None = None
When no new data is fetched when downloading images in
max_download_timeseconds, a failure will happen.Only downloading images will use this parameter.
Default is set to
None, meaning no restrictions.
- overwrite_images: bool = True
Overwrite existing images when downloading.
Only downloading images will use this parameter.
- proxies: dict | Callable | None = None
Proxies used by the crawler.
Both fetching webpages and downloading images will use this parameter.
- Both
requestsand browsers use these proxies, but the structure should be in arequests-acceptable form like: HTTP type:
{'http': '127.0.0.1:7890'}HTTPS type:
{'https': '127.0.0.1:7890'}SOCKS type:
{'https': 'socks5://127.0.0.1:7890'}If you input
'https'proxies,'http'proxies will be automatically generated.ATTENTION: Using usernames and passwords is currently not supported.
- Both
- randomize_delay: bool = True
Randomize
thread_delayandfail_delaybetween 0 and their values.For example,
thread_delay=5.0andrandomize_delay=Falsewill cause thethread_delayto choose a random value between 0 and 5.0 every time.
- property result_fail_delay: float
Generate the fail delay. If the randomize_delay attribute is set to
True, the delay will be randomized between 0 and fail_delay for every usage.
- property result_headers: dict | None
Generate the headers. If the headers attribute is callable, it will be called for every usage.
- property result_proxies: dict | None
Generate the headers. If the proxies attribute is callable, it will be called for every usage.
- property result_thread_delay: float
Generate the thread delay. If the randomize_delay attribute is set to
True, the delay will be randomized between 0 and thread_delay for every usage.
- retry_times: int = 5
Total times of retrying to fetch a webpage / download an image.
Both fetching webpages and downloading images will use this parameter.
- thread_delay: float = 5
Delaying time (seconds) before every thread starts.
Both fetching webpages and downloading images will use this parameter.
Some Parsers may use different parameters to control their delaying time.
- thread_num: int = 5
Total number of threads.
Both fetching webpages and downloading images will use this parameter.
Some Parsers do not use threading to fetching pages, and this parameter is not used.
- class image_crawler_utils.configs.DebugConfig(show_debug=False, show_info=True, show_warning=True, show_error=True, show_critical=True)[source]
Bases:
objectContains config for whether displaying a certain level of debugging messages in console.
Default set to “info” level.
- Parameters:
- classmethod level(level_str)[source]
Create a DebugConfig that is set to display messages over the level. For example, set to “warning” will display warning, error and critical messages.
- Parameters:
level_str (str) –
Must be one of (from lower to higher) “debug”, “info”, “warning”, “error”, “critical” or “silenced”.
Set a logging level will display messages including and above this level. For example,
.set_level("warning")will only display messages with “warning”, “error” and “critical” levels.Set to “silenced” level will not display any messages except those generated by the progress bars.
- Returns:
Created DebugConfig.
- Return type:
- set_level(level_str)[source]
Set current DebugConfig to display messages over the level. For example, set to “warning” will display warning, error and critical messages.
- Parameters:
level_str (str) –
Must be one of (from lower to higher) “debug”, “info”, “warning”, “error”, “critical” or “silenced”.
Set a logging level will display messages including and above this level. For example,
.set_level("warning")will only display messages with “warning”, “error” and “critical” levels.Set to “silenced” level will not display any messages except those generated by the progress bars.
- show_critical: bool = True
Display critical-level messages.
Default set to
True.Include messages of errors that interrupt the crawler. Usually a Python error will be raised when critical errors happen.
- show_debug: bool = False
Display debug-level messages.
- show_error: bool = True
Display error-level messages.
Default set to
True.Include messages of errors that may affect the final results but do not interrupt the crawler.
Saving and Loading Configs
Use the 2 functions below to save and load these Configs.
- image_crawler_utils.utils.save_dataclass(dataclass, file_name, file_type=None, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]
Save the
dataclassparameter into a file.- Parameters:
dataclass (A dataclass) – The dataclass to be saved.
file_name (str) – Name of file. Suffix (.json / .pkl) is optional.
file_type (str, Optional) –
If suffix not found in
file_name, designate file type manually.Set
file_typeparameter tojsonorpklwill force the function to save the dataclass (config) into this type, or leaving this parameter blank will cause the funtion to determine the file type according tofile_name.That is,
save_dataclass(dataclass, 'foo.json')works the same assave_dataclass(dataclass, 'foo', 'json')..jsonis suggested when your dataclasses (configs) do not include objects that cannot be JSON-serialized (e.g. a function), while serialized data file.pklcan support most data types but the saved file is not readable.
encoding (str) – Encoding of JSON file. Only works when saving as
.json.log (image_crawler_utils.log.Log, None) –
Logging config.
You can use
log=crawler_settings.logto make it the same as the CrawlerSettings you set up.
- Returns:
(Saved file name, Absolute path of the saved file), or None if failed.
- Return type:
- image_crawler_utils.utils.load_dataclass(dataclass_to_load, file_name, file_type=None, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]
Load the file containing a dataclass into the
dataclass_to_loadparameter.The dataclass should be the same as the one you once saved.
- Parameters:
dataclass (A dataclass) – The dataclass to be loaded.
file_name (str) – Name of the file.
file_type (str, Optional) –
If suffix not found in
file_name, designate file type manually.Set
file_typeparameter tojsonorpklwill force the function to consider the file as this type, or leaving this parameter blank will cause the funtion to determine file type according tofile_name.That is,
load_dataclass(dataclass, 'foo.json')works the same asload_dataclass(dataclass, 'foo.json', 'json').
encoding (str) – Encoding of JSON file. Only works when loading from
.json.log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
Loaded dataclass_to_load, or None if failed.
- Return type: