image_crawler_utils package
- class image_crawler_utils.Cookies(cookies=None)[source]
Bases:
objectConvert format of cookies between selenium, requests and string.
Use
Cookies(cookies_from_certain_source)orCookies.load_from_json()to create a Cookies class.Use
.cookies_nodriver/.cookies_selenium/.cookies_dict/.cookies_stringto get the cookies of suitable format.- Parameters:
cookies (list, dict, str, None) –
Cookies generated from string, dict (requests), list (selenium or nodriver).
Leave blank (like
Cookies()) will create an empty cookies, whose.is_none()returnsTrue.
- classmethod load_from_json(json_file, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]
Load the Cookies from a json file.
ONLY WORKS IF the info can be JSON serialized.
- Parameters:
json_file (str) – Name / path of json file. Suffix (.json) must be included.
encoding (str) – Encoding of JSON file.
log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
The Cookies, or None if failed.
- Return type:
Cookies | None
- is_none()[source]
Check whether Cookies is empty (created by
None, “”, etc.).- Returns:
A bool, telling whethers Cookies is empty.
- Return type:
- save_to_json(json_file, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]
Save the Cookies into a json file.
- update_nodriver_cookies(old_nodriver_cookies)[source]
Update nodriver-form cookies. NOT SUGGESTED TO BE USED DIRECTLY.
For every cookie in the input with the same name as the one in the Cookies class, replace the values with the latter one.
Also add cookies in Cookies class which not exists in input cookies.
- Parameters:
old_nodriver_cookies (list[nodriver.cdp.network.Cookie]) – Cookies from nodriver.
- Returns:
New nodriver cookies (a list[nodriver.cdp.network.Cookie]).
- update_selenium_cookies(old_selenium_cookies)[source]
Update selenium-form cookies.
For every cookie in the input with the same name as the one in the Cookies class, replace the values with the latter one.
Also add cookies in Cookies class which not exists in input cookies.
- cookies_dict: dict | None
Cookies in dict form. Mostly for requests module usage.
This form of cookies can be generated by
requests-related functions and classes, or other cookie functions that generates adict, etc.A generation example is like:
import requests from image_crawler_utils import Cookies session = requests.Session() # Some process that adds cookies to session requests_cookies = session.cookies.get_dict() # A list cookies = Cookies(requests_cookies)
- cookies_nodriver: list[Cookie] | None
Cookies in nodriver form.
This form of cookies can be generated by nodriver-related functions and classes, etc.
A generation example is like:
import nodriver from image_crawler_utils.utils import set_up_nodriver_browser from image_crawler_utils import Cookies async def nodriver_func(): browser = await set_up_nodriver_browser() tab = await browser.get('https://foo.bar.com') # Some other process nodriver_cookies = await browser.cookies.get_all() return nodriver_cookies nodriver_cookies = nodriver.loop().run_until_complete(nodriver_func()) cookies = Cookies(nodriver_cookies)
- cookies_selenium: list[dict] | None
Cookies in selenium form.
This form of cookies can be generated by selenium-related functions and classes, etc.
A generation example is like:
from selenium.webdriver import Chrome from image_crawler_utils import Cookies chrome_driver_path = '/path/to/chromedriver' chrome_browser = webdriver.Chrome(executable_path=chrome_driver_path) chrome_browser.get('https://foo.bar.com') # Some other process selenium_cookies = chrome_browser.get_cookies() # A dict cookies = Cookies(selenium_cookies)
- class image_crawler_utils.CrawlerSettings(capacity_count_config=<image_crawler_utils.utils.Empty object>, download_config=<image_crawler_utils.utils.Empty object>, debug_config=DebugConfig(show_debug=False, show_info=True, show_warning=True, show_error=True, show_critical=True), image_num=None, capacity=None, page_num=None, headers=None, proxies=None, thread_delay=5, fail_delay=3, randomize_delay=True, thread_num=5, timeout=30.0, max_download_time=None, retry_times=5, overwrite_images=True, detailed_console_log=False, extra_configs=None)[source]
Bases:
objectA general framework of settings for running a crawler.
- Parameters:
capacity_count_config (image_crawler_utils.configs.CapacityCountConfig, None) –
Contains configs that restricts downloading numbers and capacity.
If this parameter is used, the
image_num,capacityandpage_numparameters will be omitted.
download_config (image_crawler_utils.configs.DownloadConfig, None) –
Contains configs about parameters in downloading.
If this parameter is used, the
headers,proxies,thread_delay,fail_delay,randomize_delay,thread_num,timeout,max_download_time,retry_timeandoverwrite_imagesparameters will be omitted.
debug_config (image_crawler_utils.configs.DebugConfig, None) – Contains configs that define which types of messages are shown on the console.
image_num (int, None) – Number of images to be parsed / downloaded in total; None means no restriction.
capacity (float, None) – Total size of images (MB); None means no restriction.
page_num (int, None) – Number of gallery pages to detect images in total; None means no restriction.
headers (dict, Callable, None) – Headers settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.
proxies (dict, Callable, None) – Proxy settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.
thread_delay (float) – Waiting time (s) after thread start.
fail_delay (float) – Waiting time (s) after failing.
randomize_delay (bool) – Randomize delay time between 0 and delay_time.
thread_num (int) – Downloading thread num.
timeout (float, None) – Timeout for requests. Set to None means no timeout.
max_download_time (float, None) – Maximum download time for a image. Set to None means no timeout.
retry_times (int) – Times of retrying to download.
overwrite_images (bool) – Overwrite existing images.
detailed_console_log (bool) –
Logging detailed information onto the console.
It means that when logging info to the console, always log
msg(the messages logged into files) even ifoutput_msgexists.
extra_configs (dict, None) – This optional
dictis not used in any of the supported sites and crawling tasks, as it is reserved for developing your custom image crawler.
- classmethod load_from_pkl(pkl_file, log=<image_crawler_utils.log.Log object>)[source]
Load CrawlerSettings from .pkl file.
- Parameters:
pkl_file (str, None) – Name of the pkl file. Suffix (.pkl) must be included.
log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
A CrawlerSettings class loaded from pkl file, or None if failed.
- Return type:
- browser_test(url, headless=True, stay_time=30)[source]
Test whether browser works normally.
ATTENTION: This function DO NOT check the connectivity of the URL. Use connectivity_test() instead.
- Parameters:
url (str) – Test connectivity using this URL.
headless (bool) –
Whether not to display a window when testing chromdriver. You can have a quick glimpse of whether the page is correctly loaded.
Set
headlesstoFalsewill pop up the browser window to display the whole process of loading the webpage.
stay_time (float) – If set to not headless, the window will stay for stay_time seconds.
- Returns:
A bool. Successful connection returns
True, in other cases returnsFalse.
- copy(capacity_count_config=<image_crawler_utils.utils.Empty object>, download_config=<image_crawler_utils.utils.Empty object>, debug_config=<image_crawler_utils.utils.Empty object>, image_num=<image_crawler_utils.utils.Empty object>, capacity=<image_crawler_utils.utils.Empty object>, page_num=<image_crawler_utils.utils.Empty object>, headers=<image_crawler_utils.utils.Empty object>, proxies=<image_crawler_utils.utils.Empty object>, thread_delay=<image_crawler_utils.utils.Empty object>, fail_delay=<image_crawler_utils.utils.Empty object>, randomize_delay=<image_crawler_utils.utils.Empty object>, thread_num=5, timeout=<image_crawler_utils.utils.Empty object>, max_download_time=<image_crawler_utils.utils.Empty object>, retry_times=<image_crawler_utils.utils.Empty object>, overwrite_images=<image_crawler_utils.utils.Empty object>, extra_configs=<image_crawler_utils.utils.Empty object>)[source]
Generate a copy of a CrawlerSettings, with (optional) parameters changed.
- Parameters:
capacity_count_config (image_crawler_utils.configs.CapacityCountConfig, None) –
Contains configs that restricts downloading numbers and capacity.
If this parameter is used, the
image_num,capacityandpage_numparameters will be omitted.
download_config (image_crawler_utils.configs.DownloadConfig, None) –
Contains configs about parameters in downloading.
If this parameter is used, the
headers,proxies,thread_delay,fail_delay,randomize_delay,thread_num,timeout,max_download_time,retry_time,overwrite_imagesparameters will be omitted.
debug_config (image_crawler_utils.configs.DebugConfig, None) – Contains configs that define which types of messages are shown on the console.
image_num (int, None) – Number of images to be parsed / downloaded in total; None means no restriction.
capacity (float, None) – Total size of images (MB); None means no restriction.
page_num (int, None) – Number of gallery pages to detect images in total; None means no restriction.
headers (dict, Callable, None) – Headers settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.
proxies (dict, Callable, None) – Proxy settings. Can be a function (should return a dict), a dict or nothing. If it is a function, it will be called at every usage.
thread_delay (float) – Waiting time (s) after thread start.
fail_delay (float) – Waiting time (s) after failing.
randomize_delay (bool) – Randomize delay time between 0 and delay_time.
thread_num (int) – Downloading thread num.
timeout (float, None) – Timeout for requests. Set to None means no timeout.
max_download_time (float, None) – Maximum download time for a image. Set to None means no timeout.
retry_times (int) – Times of retrying to download.
overwrite_images (bool) – Overwrite existing images.
detailed_console_log (bool) –
Logging detailed information onto the console.
It means that when logging info to the console, always log
msg(the messages logged into files) even ifoutput_msgexists.
extra_configs (dict, None) – This optional
dictis not used in any of the supported sites and crawling tasks, as it is reserved for developing your custom image crawler.
- dill_base64_sha256_data()[source]
Return the serialized bytes of current CrawlerSettings, base64 encoded str of the bytes, and sha256 str of the bytes.
- Returns:
dill.dumps() bytes, “base64”: base64 encoding of the dill.dumps() bytes, “sha256”: sha256 encoding of the base64 code}
- Return type:
A dict like {“bytes”
- display_all_configs()[source]
Display all config info.
Dataclasses will be displayed in a neater way.
- save_to_pkl(pkl_file=None)[source]
Save the CrawlerSettings in a pkl file.
It is recommended to use the default file name, which uses the sha256 encoded string generated by dill_base64_sha256_data().
- Parameters:
- Returns:
(Saved file name, Absolute path of the saved file), or None if failed.
- Return type:
- set_logging_file(log_file, logging_level=10)[source]
Set the file to be logged into.
It is recommended to add a logging file when running the crawler, as the message displayed on the console is simplified and usually not complete.
PAY ATTENTION: You cannot set logging files when creatiing a class. Setting logging files is controlled by this function.
- Parameters:
log_file (str) – Output name for the logging file. Suffix (.json) is optional. Set to None (Default) is not to output any file.
logging_level (int) –
Set the logging level of the LOGGING FILE. Select from:
logging.DEBUG,logging.INFO,logging.WARNING,logging.ERRORandlogging.CRITICAL(importlogginglibrary first, or use the word string directly).- ATTENTION: It is indepedent from the level of logging onto the console!
The latter is controlled by the
debug_configparameter, and this parameter in turn does not affect logging into files.
- Returns:
The changed CrawlerSettings itself.
- class image_crawler_utils.Downloader(image_info_list, crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, store_path='./', image_info_filter=True, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''))[source]
Bases:
objectDownloading images using threading method.
- Parameters:
crawler_settings (image_crawler_utils.CrawlerSettings) – The CrawlerSettings used in this Downloader.
image_info_list (image_crawler_utils.ImageInfo) – A list of ImageInfo.
store_path (str) –
Path to store images, or a list of storage paths respectively for every image.
Default is the current working directory.
If it set to an iterable list, then its length should be the same as
image_info_list.
image_info_filter (callable, bool) –
A callable function used to filter the images in the list of ImageInfo.
The function of
image_info_filtershould only accept 1 argument of ImageInfo type and returns True (download this image) or False (do not download this image), like:def filter_func(image_info: ImageInfo) -> bool: # Meet the conditions return True # Do not meet the conditions return False
If the function have other parameters, use
lambdato exclude other parameters:image_info_filter=lambda info: filter_func(info, param1, param2, ...)
If you want to download all images in the ImageInfo list, set
image_info_filtertoTrue.TIPS: If you want to search images with complex restrictions that the image station sites may not support (e.g. Images with many keywords and restrictions on the ratio between width and height), you can simplify the query with some keywords to get all images with Parsers, and filter them with your custom
image_info_filterfunction.
cookies (image_crawler_utils.Cookies, str, dict, list, None) –
Cookies used to access images from a website.
Nonemeans no cookies and works the same asCookies().Leave this parameter blank works the same as
None/Cookies().TIPS: You can add corresponding cookies to Downloader if there are URLs of images only accessible with an account. For example, if you have saved Pixiv and Twitter / X cookies respectively in
Pixiv_cookies.jsonandTwitter_cookies.json, then you can usecookies=Cookies.load_from_json("Pixiv_cookies.json") + Cookies.load_from_json("Twitter_cookies.json")to add both cookies to the Downloader.
- classmethod load_from_pkl(pkl_file, log=<image_crawler_utils.log.Log object>)[source]
Load parser from .pkl file.
- Parameters:
pkl_file (str, None) – Name of the pkl file.
log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
A CrawlerSettings class loaded from pkl file, or None if failed.
- Return type:
- display_all_configs()[source]
Display all config info. Dataclasses will be displayed in a neater way.
- run()[source]
Run the Threading Downloader Object.
- Returns:
(Total size of image downloaded, Succeeded ImageInfo list, Failed ImageInfo list, Skipped ImageInfo list)
Total size of image downloaded: An int denoting the total size (in bytes) of images downloaded.
Succeeded ImageInfo list: A list of ImageInfo containing successfully downloaded images.
Failed ImageInfo list: A list of ImageInfo containing images failed to be downloaded.
Images not downloaded due to reaching
capacitydefined inimage_crawler_utils.CrawlerSettingswill be classified to this list.
Skipped ImageInfo list: A list of ImageInfo containing images skipped.
Images filtered out by
image_info_filter, not downloaded due to the restriction ofimage_numinimage_crawler_utils.CrawlerSettings, and skipped due to such images already exist whenoverwrite_imagesin DownloadConfig is set toFalsewill be classified to this list.
- Return type:
tuple[int, list[ImageInfo], list[ImageInfo], list[ImageInfo]]
- class image_crawler_utils.ImageInfo(url, name, info=<factory>, backup_urls=<factory>)[source]
Bases:
objectA class consisting of image URL, name, info and back up URLs.
Can be used to download images and write result to files.
- backup_urls: Iterable[str]
When downloading from
.urlfailed, try downloading from URLs in the list of.backup_urls.
- info: dict
A
dict, containing information of the image.infowill not affect Downloader directly. It only works if you set theimage_info_filterparameter in the Downloader class.Different sites may have different
infostructures which are defined respectively by their Parsers.ATTENTION: If you define you own
infostructure, please ENSURE it can be JSON-serialized (e.g. The values of thedictshould beint,float,str,list,dict, etc.) in order to make it compatible withsave_image_infos()andload_image_infos().
- class image_crawler_utils.KeywordParser(station_url, crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, standard_keyword_string=None, keyword_string=None, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''), accept_empty=False)[source]
Bases:
ParserA Parser for fetching result from keyword searching.
- Parameters:
station_url (str) –
The URL of the main page of a website.
This parameter works when several websites use the same structure. For example, https://yande.re/ and https://konachan.com/ both use Moebooru to build their websites, and this parameter must be filled to deal with these sites respectively.
For websites like https://www.pixiv.net/, as no other website uses its structure, this parameter has already been initialized and do not need to be filled.
crawler_settings (image_crawler_utils.CrawlerSettings) – The CrawlerSettings used in this Parser.
standard_keyword_string (str) – Query keyword string using standard syntax. Refer to the documentation for detailed instructions.
keyword_string (str, None) –
If you want to directly specify the keywords used in searching, set
keyword_stringto a custom non-empty string. It will OVERWRITEstandard_keyword_string.For example, set
keyword_stringto"kuon_(utawarerumono) rating:safe"in DanbooruKeywordParser means searching directly with this string in Danbooru, and its standard keyword string equivalent is"kuon_(utawarerumono) AND rating:safe".
cookies (image_crawler_utils.Cookies, list, dict, str, None) –
Cookies used in loading websites.
accept_empty (bool) – If set to
False(default), when bothstandard_keyword_stringandkeyword_stringis an empty string (like ‘’ or ‘ ‘), a critical error will be thrown. If set toTrue, no error will be thrown and the parameters are accepted.
- display_all_configs()[source]
Display all config info. Dataclasses will be displayed in a neater way.
- generate_standard_keyword_string(keyword_tree=None)[source]
Generate a standard keyword string.
Generated result may not be the same from the standard_keyword_string input.
- Parameters:
keyword_tree (KeywordLogicTree | None) –
The KeywordLogicTree that a standard keyword string will be built from. Set to
None(default) will use the KeywordLogicTree generated from thestandard_keyword_stringparameter.ATTENTION: When set to
None, the standard keyword string may not be absolutely same asstandard_keyword_string.
- Returns:
A standard keyword string.
- class image_crawler_utils.Parser(station_url, crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''))[source]
Bases:
ABCA Parser include several basic functions.
- Parameters:
station_url (str) –
The URL of the main page of a website.
This parameter works when several websites use the same structure. For example, https://yande.re/ and https://konachan.com/ both use Moebooru to build their websites, and this parameter must be filled to deal with these sites respectively.
For websites like https://www.pixiv.net/, as no other website uses its structure, this parameter has already been initialized and do not need to be filled.
crawler_settings (image_crawler_utils.CrawlerSettings) – The CrawlerSettings used in this Parser.
cookies (image_crawler_utils.Cookies, list, dict, str, None) –
Cookies used in loading websites.
- classmethod load_from_pkl(pkl_file, log=<image_crawler_utils.log.Log object>)[source]
Load the parser from .pkl file.
ATTENTION: You should use the correspondent Parser class when loading. For example, loading DanbooruKeywordParser should use
DanbooruKeywordParser.load_from_pkl().- Parameters:
pkl_file (str, None) – Name of the pkl file.
log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
A CrawlerSettings class loaded from pkl file, or None if failed.
- Return type:
- display_all_configs()[source]
Display all config info. Dataclasses will be displayed in a neater way.
- get_cloudflare_cookies(url=None, headless=False, timeout=60, save_cookies_file=None, try_clicking=False)[source]
Bypass Cloudflare check and get its cookies.
- Parameters:
url (str) – Get Cloudflare cookies using this URL. Set to None (default) will use the station_url in this class.
headless (bool) – Whether to display a browser window. Recommend setting to True in case you need to manually bypass Cloudflare.
save_cookies_file (str, None) – Path to save the new cookies. Default set to
None, meaning not saving cookies.timeout (float) – Try to finish Cloudflare test in timeout seconds.
try_clicking (bool) – Try to repeatedly click the verification box. MAY CAUSE THE WEBSITE TO GET STUCK IN THE VERIFICATION PAGE.
- nodriver_request_page_content(url, browser=None, headless=True, is_json=False, thread_delay=None, page_stay_time=None)[source]
Download webpage content with nodriver.
For those sites having strong anti-crawling measures, try using this function to bypass them.
- Parameters:
url (str) – The URL of the page to download.
browser (nodriver.Browser, None) – Whether to use an existing browser instance.
headless (bool) – Whether to set the browser in headless mode. Default set to
True. Only works when browser is None.is_json (bool) – Whether the result is a JSON text. Default set to False.
thread_delay (float, Callable, None) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
page_stay_time (float, None) – Force the page to stay for page_stay_time seconds so that it can be fully loaded. Default set to None meaning no restrictions in time.
- Returns:
The HTML content of the webpage.
- nodriver_threading_request_page_content(url_list, restriction_num=None, is_json=False, thread_delay=None, batch_num=None, batch_delay=0.0, headless=True, deconstruct_browser=False, page_stay_time=None)[source]
Download multiple webpage content using asynchronous coroutines (similar to threads) with nodriver.
For those sites having strong anti-crawling measures, try using this function to bypass them.
- Parameters:
url_list (list[str]) – The list of URLs of the page to download.
restriction_num (int, None) – Only download the first restriction_num number of pages. Set to None (default) meaning no restrictions.
is_json (bool or Iterable instance) – Whether the result is a JSON text. Can be a bool or a iterable object with the same length as url_list. Default set to False.
thread_delay (float, Callable, None) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
batch_num (int) – Number of pages for each batch; using it with batch_delay to wait a certain period of time after downloading each batch. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
batch_delay (float, Callable) – Delaying time (seconds) after each batch is downloaded. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
headless (bool) – Display a browser window or not. Default set to
True, and setting it toFalseis helpful for debugging and bypassing some anti-crawling measures.deconstruct_browser (int) – Whether to deconstruct all instances and clear caches upon finishing. Can improve performances in restricted environments.
page_stay_time (float, None) – Force the page to stay for page_stay_time seconds so that it can be fully loaded. Default set to None meaning no restrictions in time.
- Returns:
A list of the HTML contents of the webpages. Its order is the same as the one of url_list.
- Return type:
- request_page_content(url, session=<requests.Session object>, headers=<image_crawler_utils.utils.Empty object>, thread_delay=None)[source]
Download webpage content.
- Parameters:
url (str) – The URL of the page to download.
session (requests from import requests, or requests.Session) – Can be requests or requests.Session()
headers (dict, Callable, None) – If you need to specify headers for current request, use this argument. Set to None (default) meaning use the headers from self.crawler_settings.download_config.result_headers
thread_delay (None | float | Callable) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
- Returns:
The HTML content of the webpage.
- Return type:
- abstractmethod run()[source]
MUST BE OVERRIDEN. Generate a list of ImageInfo, containing image urls, names and infos.
- threading_request_page_content(url_list, restriction_num=None, session=<requests.Session object>, headers=<image_crawler_utils.utils.Empty object>, thread_delay=None, batch_num=None, batch_delay=0.0)[source]
Download multiple webpage content using threading.
- Parameters:
url_list (list[str]) – The list of URLs of the page to download.
restriction_num (int, None) – Only download the first restriction_num number of pages. Set to None (default) meaning no restrictions.
session (requests from import requests, or requests.Session) – Can be requests or requests.Session()
headers (dict, list, Callable, None) – If you need to specify headers for current threading requests, use this argument. Set to None (default) meaning use the headers from self.crawler_settings.download_config.result_headers + If it is a list, it should be of the same length as url_list, and for url_list[i] it will use the headers in headers[i]. The element in this list can be a dict of a function.
thread_delay (float, Callable, None) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
batch_num (int | None) – Number of pages for each batch; using it with batch_delay to wait a certain period of time after downloading each batch. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
batch_delay (float | Callable) – Delaying time (seconds) after each batch is downloaded. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
- Returns:
A list of the HTML contents of the webpages. Its order is the same as the one of url_list.
- Return type:
- image_crawler_utils.load_image_infos(json_file, encoding='UTF-8', display_progress=True, log=<image_crawler_utils.log.Log object>)[source]
Load the ImageInfo list from a JSON file.
ONLY WORKS IF the info can be JSON serialized.
- Parameters:
json_file (str) – Name / Path of the JSON file.
encoding (str) – Encoding of the JSON file.
display_progress (bool) – Display a
richprogress bar when running. Progress bar will be hidden after finishing.log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
List of ImageInfo, or None if failed.
- Return type:
- image_crawler_utils.save_image_infos(image_info_list, json_file, encoding='UTF-8', display_progress=True, log=<image_crawler_utils.log.Log object>)[source]
Save the ImageInfo list into a JSON file.
ONLY WORKS IF the info can be JSON serialized.
- Parameters:
image_info_list (Iterable[image_crawler_utils.ImageInfo]) – An iterable list (e.g.
listortuple) ofimage_crawler_utils.ImageInfo.json_file (str) – Name / Path of the JSON file. Suffix (.json) is optional.
encoding (str) – Encoding of the JSON file.
display_progress (bool) – Display a
richprogress bar when running. Progress bar will be hidden after finishing.log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
(Saved file name, Absolute path of the saved file), or
Noneif failed.- Return type:
- async image_crawler_utils.update_nodriver_browser_cookies(browser, cookies)[source]
This function will update nodriver browser cookies with Cookies provided.
As nodriver includes a browser.cookies.set_all() but it has a critical bug that stay unfixed for a long time, I’ll do it myself!
- Parameters:
browser (nodriver.Browser) – The browser created by nodriver.
cookies (image_crawler_utils.Cookies) – The cookies containing account information.
Subpackages
Submodules
image_crawler_utils.configs module
- class image_crawler_utils.configs.CapacityCountConfig(image_num=None, capacity=None, page_num=None)[source]
Bases:
objectContains config for restrictions of images number, total size or webpage number.
- capacity: float | None = None
Total size of images (bytes) to be downloaded.
Default is set to
None, meaning no restrictions.When capacity is reached, no new downloading threads will be added. However, downloading threads that already started will not be affected, which means actual image size will be larger than the capacity.
- image_num: int | None = None
The number of images to be parsed from websites or downloaded.
Default is set to
None, meaning no restrictions.Mostly only used in the Downloader to control the number of images to be downloaded, but some Parsers may also use this parameter.
- page_num: int | None = None
Number of gallery pages to detect images in total. None means no restriction.
Default is set to
None, meaning no restrictions.Some websites, like Twitter / X, do not use gallery pages or JSON-API pages (Image Crawler Utils uses the method of scrolling webpages to get Twitter / X images), and this parameter is not used.
- class image_crawler_utils.configs.DebugConfig(show_debug=False, show_info=True, show_warning=True, show_error=True, show_critical=True)[source]
Bases:
objectContains config for whether displaying a certain level of debugging messages in console.
Default set to “info” level.
- Parameters:
- classmethod level(level_str)[source]
Create a DebugConfig that is set to display messages over the level. For example, set to “warning” will display warning, error and critical messages.
- Parameters:
level_str (str) –
Must be one of (from lower to higher) “debug”, “info”, “warning”, “error”, “critical” or “silenced”.
Set a logging level will display messages including and above this level. For example,
.set_level("warning")will only display messages with “warning”, “error” and “critical” levels.Set to “silenced” level will not display any messages except those generated by the progress bars.
- Returns:
Created DebugConfig.
- Return type:
- set_level(level_str)[source]
Set current DebugConfig to display messages over the level. For example, set to “warning” will display warning, error and critical messages.
- Parameters:
level_str (str) –
Must be one of (from lower to higher) “debug”, “info”, “warning”, “error”, “critical” or “silenced”.
Set a logging level will display messages including and above this level. For example,
.set_level("warning")will only display messages with “warning”, “error” and “critical” levels.Set to “silenced” level will not display any messages except those generated by the progress bars.
- show_critical: bool = True
Display critical-level messages.
Default set to
True.Include messages of errors that interrupt the crawler. Usually a Python error will be raised when critical errors happen.
- show_error: bool = True
Display error-level messages.
Default set to
True.Include messages of errors that may affect the final results but do not interrupt the crawler.
- class image_crawler_utils.configs.DownloadConfig(headers=None, proxies=None, thread_delay=5, fail_delay=3, randomize_delay=True, thread_num=5, timeout=10, max_download_time=None, retry_times=5, overwrite_images=True)[source]
Bases:
objectContains config for downloading.
- Parameters:
- fail_delay: float = 3
Delaying time (seconds) after every failure.
Both fetching webpages and downloading images will use this parameter.
Some Parsers may use different parameters to control their delaying time when a failure happens.
- headers: dict | Callable | None = None
Headers of the requests.
Both fetching webpages and downloading images will use this parameter.
This only works when the requests is sent by
requests(likerequests.get()). For webpages loaded by browsers, this parameter is omitted.- Basically, this contains the user agent of the requests.
ATTENTION: Not all user agents are supported by the websites you are accessing!
- max_download_time: float | None = None
When no new data is fetched when downloading images in
max_download_timeseconds, a failure will happen.Only downloading images will use this parameter.
Default is set to
None, meaning no restrictions.
- overwrite_images: bool = True
Overwrite existing images when downloading.
Only downloading images will use this parameter.
- proxies: dict | Callable | None = None
Proxies used by the crawler.
Both fetching webpages and downloading images will use this parameter.
- Both
requestsand browsers use these proxies, but the structure should be in arequests-acceptable form like: HTTP type:
{'http': '127.0.0.1:7890'}HTTPS type:
{'https': '127.0.0.1:7890'}SOCKS type:
{'https': 'socks5://127.0.0.1:7890'}If you input
'https'proxies,'http'proxies will be automatically generated.ATTENTION: Using usernames and passwords is currently not supported.
- Both
- randomize_delay: bool = True
Randomize
thread_delayandfail_delaybetween 0 and their values.For example,
thread_delay=5.0andrandomize_delay=Falsewill cause thethread_delayto choose a random value between 0 and 5.0 every time.
- property result_fail_delay: float
Generate the fail delay. If the randomize_delay attribute is set to
True, the delay will be randomized between 0 and fail_delay for every usage.
- property result_headers: dict | None
Generate the headers. If the headers attribute is callable, it will be called for every usage.
- property result_proxies: dict | None
Generate the headers. If the proxies attribute is callable, it will be called for every usage.
- property result_thread_delay: float
Generate the thread delay. If the randomize_delay attribute is set to
True, the delay will be randomized between 0 and thread_delay for every usage.
- retry_times: int = 5
Total times of retrying to fetch a webpage / download an image.
Both fetching webpages and downloading images will use this parameter.
- thread_delay: float = 5
Delaying time (seconds) before every thread starts.
Both fetching webpages and downloading images will use this parameter.
Some Parsers may use different parameters to control their delaying time.
image_crawler_utils.keyword module
- class image_crawler_utils.keyword.KeywordLogicTree(lchild='', rchild='', logic_operator='SINGLE')[source]
Bases:
objectA binary tree to record the logic structure of keywords.
- Parameters:
lchild (str | KeywordLogicTree)
rchild (str | KeywordLogicTree)
logic_operator (str)
- is_empty()[source]
Check whether current KeywordLogicTree is empty.
- Returns:
A boolean denoting whether current tree is empty.
- Return type:
- is_leaf()[source]
Whether current tree is a leaf node.
- Returns:
A boolean denoting whether current node is a leaf node.
- Return type:
- keyword_include_group_list()[source]
Returns a list of keyword groups (list of keywords) which are minimal supersets of current tree.
For example:
For “A AND B OR C”, its minimal supersets are
['A', 'C']and['B', 'C'].That is, if you search “A OR C” or “B OR C”, you can get all results that match “A AND B OR C”.
For “A AND [B OR C]”, its minimal supersets are
['A']and['B', 'C'].
Useful for websites that have a restriction on the number of keywords when seaching.
- keyword_list_check(keyword_list)[source]
Check whether the keyword list matches this tree.
For example, keyword list
['A', 'B'],['C', 'D']and['A', 'B', 'C']match “A AND B OR C”, while keyword list['B', 'D']cannot match “A AND B OR C”.
- list_struct()[source]
Returns the structure of current tree as a recursive
list.For example, standard keyword string “A AND B OR C” will be returned as
[['A', 'AND', 'B'], 'OR', 'C'].
- Returns:
A list with the structure of this keyword tree.
- Return type:
- simplify_tree()[source]
Simplify the tree structure, including: NOT NOT key -> key and SINGLE key -> key.
If you create a KeywordLogicTree through the functions provided,
.simplify_tree()will be automatically executed.
- Return type:
None
- standard_keyword_string()[source]
Returns the reconstructed standard keyword string.
The result may not be the same as the string that is used to construct the KeywordLogicTree.
For example, standard keyword string “A AND B OR C” will be returned as “[[A AND B] OR C]”.
- Returns:
A standard keyword string.
- Return type:
- lchild: str | KeywordLogicTree = ''
Left child.
- logic_operator: str = 'SINGLE'
Logic operator. Can be one of “AND”, “OR”, “NOT” or “SINGLE”.
When it is “NOT” or “SINGLE”, lchild should be omiited.
“SINGLE” means this node has only one element rchild. After building a tree, use simplify_tree() to simplify these nodes.
- rchild: str | KeywordLogicTree = ''
Right child.
- image_crawler_utils.keyword.construct_keyword_tree(keyword_str, log=<image_crawler_utils.log.Log object>)[source]
Use a standard syntax to represent logic relationship of keywords. Use ‘ AND ‘ / ‘&’, ‘ OR ‘ / ‘|’, ‘ NOT ‘ / ‘!’ to represent logic operators.
Use ‘[’, ‘]’ to increase logic priority.
Any space between two keywords will be replaced with ‘_’ and thus be considered as one keyword.
Example: “A B & [C (extra) OR NOT D]” -> “A_B AND [C_(extra) OR NOT D]”
- Parameters:
keyword_str (str) – A string of keywords.
log (image_crawler_utils.log.Log, None) – The logging config.
- Returns:
If successful, returns a KeywordLogicTree. If failed, return None.
- Return type:
- image_crawler_utils.keyword.construct_keyword_tree_from_list(keyword_list, connect_symbol='OR', log=<image_crawler_utils.log.Log object>)[source]
Convert a list of keywords into a keyword tree connected by connect_symbol (default is “OR”).
e.g.
['A', 'B', 'C']->[['A' OR 'B'] OR 'C']- Parameters:
keyword_str (Iterable(str)) – A list of strings.
connect_symbol (str) – Logic symbol of connection. Must be one of ‘AND’, ‘OR’, ‘&’ or ‘|’.
log (image_crawler_utils.log.Log, None) – The logging config.
- Returns:
If successful, returns a KeywordLogicTree. If failed, return None.
- Return type:
- image_crawler_utils.keyword.min_len_keyword_group(keyword_group_list, below=None)[source]
For a list of keyword groups (i.e. lists of keywords), get a list of keyword group with the smallest length, or all keyword groups whose length are no larger than
below.- Parameters:
- Returns:
A list of keyword groups (i.e. lists of keywords)
- Return type:
image_crawler_utils.log module
- class image_crawler_utils.log.Log(log_file=None, debug_config=DebugConfig(show_debug=False, show_info=True, show_warning=True, show_error=True, show_critical=True), logging_level=10, detailed_console_log=False)[source]
Bases:
objectClass provided for logging messages onto the console and into the file.
- Parameters:
log_file (str) – Output name for the logging file. NO SUFFIX APPENDED. Set to None (Default) is not to output any file.
debug_config (image_crawler_utils.configs.DebugConfig) – Set the OUTPUT MESSAGE TO CONSOLE level. Default is not to output any message.
logging_level (str, int) – Set the logging level of the LOGGING FILE. - Select from: logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR and logging.CRITICAL
detailed_console_log (bool) – When logging info to the console, always log
msg(the messages logged into files) even ifoutput_msgexists.
- critical(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]
Output critical messages of errors that interrupt the crawler. Usually a Python error will be raised when critical errors happen.
- Parameters:
msg (str) – Logging message.
output_msg (str, None) – Message to be output to console. Set to None will output the string in
msgparameter.exc_info – Please refer to the
loggingandrich.loggingdocumentation.stack_info (bool) – Please refer to the
loggingandrich.loggingdocumentation.stacklevel (int) – Please refer to the
loggingandrich.loggingdocumentation.extra (Mapping[str, object] | None) – Please refer to the
loggingandrich.loggingdocumentation.
- debug(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]
Output debug messages of many detailed information about running the crawler, especially connections with websites.
- Parameters:
msg (str) – Logging message.
output_msg (str, None) – Message to be output to console. Set to None will output the string in
msgparameter.exc_info – Please refer to the
loggingandrich.loggingdocumentation.stack_info (bool) – Please refer to the
loggingandrich.loggingdocumentation.stacklevel (int) – Please refer to the
loggingandrich.loggingdocumentation.extra (Mapping[str, object] | None) – Please refer to the
loggingandrich.loggingdocumentation.
- error(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]
Output error messages of errors that may affect the final results but do not interrupt the crawler.
- Parameters:
msg (str) – Logging message.
output_msg (str, None) – Message to be output to console. Set to None will output the string in
msgparameter.exc_info – Please refer to the
loggingandrich.loggingdocumentation.stack_info (bool) – Please refer to the
loggingandrich.loggingdocumentation.stacklevel (int) – Please refer to the
loggingandrich.loggingdocumentation.extra (Mapping[str, object] | None) – Please refer to the
loggingandrich.loggingdocumentation.
- info(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]
Output info messages of basic information indicating the progress of the crawler.
- Parameters:
msg (str) – Logging message.
output_msg (str, None) – Message to be output to console. Set to None will output the string in
msgparameter.exc_info – Please refer to the
loggingandrich.loggingdocumentation.stack_info (bool) – Please refer to the
loggingandrich.loggingdocumentation.stacklevel (int) – Please refer to the
loggingandrich.loggingdocumentation.extra (Mapping[str, object] | None) – Please refer to the
loggingandrich.loggingdocumentation.
- warning(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]
Output warning messages of errors that basically do not affect the final results, mostly connection failures with the websites.
- Parameters:
msg (str) – Logging message.
output_msg (str, None) – Message to be output to console. Set to None will output the string in
msgparameter.exc_info – Please refer to the
loggingandrich.loggingdocumentation.stack_info (bool) – Please refer to the
loggingandrich.loggingdocumentation.stacklevel (int) – Please refer to the
loggingandrich.loggingdocumentation.extra (Mapping[str, object] | None) – Please refer to the
loggingandrich.loggingdocumentation.
- image_crawler_utils.log.print_logging_msg(msg, level='', debug_config=DebugConfig(show_debug=True, show_info=True, show_warning=True, show_error=True, show_critical=True), exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]
Print time and message according to its logging level. If debug_config is used and the logging level is not allowed to show, the message will not be output.
- Parameters:
level (str) – Level of messages. - Should be one of “debug”, “info”, “warning”, “error”, “critical”. - Set it to other string or leave it blank will always output msg string without any prefix.
msg (str) – The message string to output.
debug_config (image_crawler_utils.configs.DebugConfig) – DebugConfig that controls output level. Default set to debug-level (output all).
exc_info – Please refer to the
loggingandrich.loggingdocumentation.stack_info (bool) – Please refer to the
loggingandrich.loggingdocumentation.stacklevel (int) – Please refer to the
loggingandrich.loggingdocumentation.extra (Mapping[str, object] | None) – Please refer to the
loggingandrich.loggingdocumentation.
image_crawler_utils.progress_bar module
- class image_crawler_utils.progress_bar.CountColumn(table_column=None, has_unit=False)[source]
Bases:
ProgressColumnA
rich.progress.ProgressColumnclass, which displays current progress number in integer.- Parameters:
- class image_crawler_utils.progress_bar.CustomProgress(*columns, text_only=False, has_spinner=False, spinner_name='dots', has_total=True, is_file=False, time_format='%H:%M:%S', is_compact_time_format=True, is_sub_process=False, console=None, auto_refresh=True, refresh_per_second=20, speed_estimate_period=30, transient=False, redirect_stdout=True, redirect_stderr=True, get_time=None, disable=False, expand=False)[source]
Bases:
ProgressA wrapped Progress class with specific format-controlling parameters.
- If you add ProgressColumns to CustomProgress like normal
rich.progress.Progressclass, it will be placed betweenBarColumn&TaskProgressColumn(i.e. the progress bar and percentage) andTimeColumnLeft. These ProgressColumns classes will be omitted if
text_onlyis set toTrue.
- Parameters:
text_only (bool) –
If set to
True, Progress bars will only display descriptions. Default isFalse.When set to
True, all other columns exceptrich.progress.TextColumn("[progress.description]{task.description}[reset]")will be omitted!
has_spinner (bool) – If set to
True, a spinner will be added to the left. Default isFalse.spinner_name (str) – The type of the spinner, which can be referred from https://jsfiddle.net/sindresorhus/2eLtsbey/embedded/result/. Default is :py:data:”dots”.
has_total (bool) – Set to
Trueif involved tasks have total numbers. Default isFalse.is_file (bool) – Set to
Trueif involved tasks deal with files. Default isFalse.time_format (str) –
A string that controls the time format. Default is “%H:%M:%S”.
’%H’ will be replaced with hours.
’%M’ will be replaced minutes.
’%S’ will be replaced seconds.
’%L’ will be replaced miliseconds.
is_compact_time_format (bool) –
When set to
True(default), thetime_formatwill be truncated to start from ‘%M’ when time is lower than 1 hour.For example, “%H:%M:%S.%L” will be truncated to “%M:%S.%L” when time is lower than 1 hour.
is_sub_process (bool) – Set to
Trueif it is a subprocess of a certainProgressGroup. Default isFalse.console (Console, None) – Optional Console instance. Defaults to an internal Console instance writing to stdout.
auto_refresh (bool, None) – Enable auto refresh. If disabled, you will need to call
refresh().refresh_per_second (Optional[float], None) – Number of times per second to refresh the progress information or None to use default (10). Defaults to None.
speed_estimate_period – (float, None): Period (in seconds) used to calculate the speed estimate. Defaults to 30.
transient – (bool, None): Clear the progress on exit. Defaults to False.
redirect_stdout – (bool, None): Enable redirection of stdout, so
printmay be used. Defaults to True.redirect_stderr – (bool, None): Enable redirection of stderr. Defaults to True.
get_time – (Callable, None): A callable that gets the current time, or None to use Console.get_time. Defaults to None.
disable (bool, None) – Disable progress display. Defaults to False
expand (bool, None) – Expand tasks table to fit width. Defaults to False.
- finish_task(task, hide=True)[source]
Finish a task within the CustomProgress. Unless this CustomProgress is a preset attribute of :class`ProgressGroup` or its
is_sub_processis set toTrue, running this function will stop the whole Progress; otherwise it will just stop the task.- Parameters:
task (rich.progress.Task) – The Task class that is created under this CustomProgress.
hide (bool) – Set to
True(default) to hide the progress bar of this task.
- If you add ProgressColumns to CustomProgress like normal
- class image_crawler_utils.progress_bar.ProgressGroup(progress_list=[], has_panel=True, panel_title=None, panel_subtitle=None, refresh_per_second=10)[source]
Bases:
objectA Group of Progress, which can simplify building multiple Progress bars.
- Parameters:
progress_list (list[rich.progress.Progress]) – An iterable list of
rich.progress.Progressclasses which will be added to the ProgressGroup when created. Default is[](an emptylist).has_panel (bool) – When set to
True(default), arich.panel.Panelwill be wrapped around all of the progress bars.panel_title (str) –
When set to a
str, the title will be displayed at the top middle of the panel.Works only if
has_panelis set toTrue.
panel_subtitle (str) –
When set to a
str, the title will be displayed at the bottom middle of the panel.Works only if
has_panelis set toTrue.
refresh_per_second (int) – Refreshing the progress bars for
refresh_per_secondtimes in a second. Default is 10.
- class image_crawler_utils.progress_bar.SpeedColumnRight(table_column=None, is_file=False)[source]
Bases:
ProgressColumnA
rich.progress.ProgressColumnclass, which displays speed in “1.23 MB/s]” format. It is suggested to put it to the right ofTimeColumnLeft.- Parameters:
- class image_crawler_utils.progress_bar.TimeColumnLeft(table_column=None, has_total=True, time_format='%H:%M:%S', is_compact_time_format=True)[source]
Bases:
ProgressColumnA
rich.progress.ProgressColumnclass, which displays elapsed time and remaining time in “[00:08<00:03,” format. It is suggested to put it to the left ofSpeedColumnRight.- Parameters:
table_column – Table column of the ProgressColumn.
has_total (bool) – Set to
Trueif involved tasks has a total number. When set toFalse, remaining time will not be displayed. Default isFalse.time_format (str) –
A string that controls the time format. Default is “%H:%M:%S”.
’%H’ will be replaced with hours.
’%M’ will be replaced minutes.
’%S’ will be replaced seconds.
’%L’ will be replaced miliseconds.
is_compact_time_format (bool) –
When set to
True(default), thetime_formatwill be truncated to start from ‘%M’ when time is lower than 1 hour.For example, “%H:%M:%S.%L” will be truncated to “%M:%S.%L” when time is lower than 1 hour.
image_crawler_utils.utils module
- class image_crawler_utils.utils.Empty[source]
Bases:
objectAn empty placeholder class, mainly for checking if a parameter is used.
- image_crawler_utils.utils.attempt_name_len()[source]
Try to calculate the length of names.
- Returns:
The length of shortened file name. If terminal size is available, the result will be \(\left\lfloor\frac{\text{terminal_size} - 10}{5}\right\rfloor\). Otherwise, the result will be 10.
- Return type:
- image_crawler_utils.utils.check_dir(dir_path, log=<image_crawler_utils.log.Log object>)[source]
This function will check whether a directory exists, and try to create it when not existing. A logging message will be print to console when succeeded, and a critical error will be thrown when failed.
- Parameters:
dir_path (str) – Directory path.
log (image_crawler_utils.log.Log, None) – Logging config.
- Return type:
None
- image_crawler_utils.utils.load_dataclass(dataclass_to_load, file_name, file_type=None, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]
Load the file containing a dataclass into the
dataclass_to_loadparameter.The dataclass should be the same as the one you once saved.
- Parameters:
dataclass (A dataclass) – The dataclass to be loaded.
file_name (str) – Name of the file.
file_type (str, Optional) –
If suffix not found in
file_name, designate file type manually.Set
file_typeparameter tojsonorpklwill force the function to consider the file as this type, or leaving this parameter blank will cause the funtion to determine file type according tofile_name.That is,
load_dataclass(dataclass, 'foo.json')works the same asload_dataclass(dataclass, 'foo.json', 'json').
encoding (str) – Encoding of JSON file. Only works when loading from
.json.log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
Loaded dataclass_to_load, or None if failed.
- Return type:
- image_crawler_utils.utils.save_dataclass(dataclass, file_name, file_type=None, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]
Save the
dataclassparameter into a file.- Parameters:
dataclass (A dataclass) – The dataclass to be saved.
file_name (str) – Name of file. Suffix (.json / .pkl) is optional.
file_type (str, Optional) –
If suffix not found in
file_name, designate file type manually.Set
file_typeparameter tojsonorpklwill force the function to save the dataclass (config) into this type, or leaving this parameter blank will cause the funtion to determine the file type according tofile_name.That is,
save_dataclass(dataclass, 'foo.json')works the same assave_dataclass(dataclass, 'foo', 'json')..jsonis suggested when your dataclasses (configs) do not include objects that cannot be JSON-serialized (e.g. a function), while serialized data file.pklcan support most data types but the saved file is not readable.
encoding (str) – Encoding of JSON file. Only works when saving as
.json.log (image_crawler_utils.log.Log, None) –
Logging config.
You can use
log=crawler_settings.logto make it the same as the CrawlerSettings you set up.
- Returns:
(Saved file name, Absolute path of the saved file), or None if failed.
- Return type:
- async image_crawler_utils.utils.set_up_nodriver_browser(proxies=None, headless=True, window_width=None, window_height=None, no_image_stylesheet=False)[source]
Set up a nodriver in a more convenient way.
WARNING: nodriver use async functions. This function is async as well!
- Parameters:
proxies (dict, Callable, None) –
The proxies used in nodriver browser.
The pattern should be in a
requests-acceptable form like:HTTP type:
{'http': '127.0.0.1:7890'}HTTPS type:
{'https': '127.0.0.1:7890'}, or{'https': '127.0.0.1:7890', 'http': '127.0.0.1:7890'}SOCKS type:
{'https': 'socks5://127.0.0.1:7890'}
headless (bool) – Do not display browsers window when a browser is started. Set to
Falsewill pop up browser windows.window_width (int, None) –
Width of browser window. Set to
Nonewill maximize window.Set
headlesstoTruewill omit this parameter.
window_height (int, None) –
Height of window when displayed. Set to
Nonewill maximize window.Set
headlesstoTruewill omit this parameter.
no_image_stylesheet (bool) –
Do not load any images or stylesheet when loading webpages in this browser.
Set this parameter to
Truecan reduce the traffic when loading pages and accelerate loading speed.
- Returns:
The nodriver.
- Return type:
- image_crawler_utils.utils.shorten_file_name(file_name, name_len=10)[source]
Shorten file name for displaying on console.
- image_crawler_utils.utils.silent_deconstruct_browser(log=<image_crawler_utils.log.Log object>)[source]
I have had enough with nodriver’s annoying removing temp file messages. This function will do the same thing without those spamming messages.
- Parameters:
log (image_crawler_utils.log.Log) – Displaying those spamming messages.