Construct a Custom Parser
Basicallly, constructing a custom Parser should follow 3 rules:
Inherit the
ParserorKeywordParserclass, according to your task requirements.Use the parameters provided in the base class with additional parameters for your task to finish the custom Parser class.
Override the
.run()attribute with its returning value following the form of a list ofImageInfo. An error will be raised if you do not override it!
Inherit a Parser Class
Parser Class
For most tasks, you can just inherit the Parser class.
You can utilize the attribute functions provided to simplify your programs, especially these functions for fetching websites:
image_crawler_utils.Parser.request_page_content()for fetching content for one single page.image_crawler_utils.Parser.threading_request_page_content()for fetching contents for multiple pages.- Their nodriver counterpart,
image_crawler_utils.Parser.nodriver_request_page_content()andimage_crawler_utils.Parser.nodriver_threading_request_page_content(), for websites having rather stronger anti-crawling measures. Google Chrome is required for running these 2 functions.
- Their nodriver counterpart,
All parameters and attribute functions are listed here:
- class image_crawler_utils.Parser(station_url, crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''))[source]
Bases:
ABCA Parser include several basic functions.
- Parameters:
station_url (str) –
The URL of the main page of a website.
This parameter works when several websites use the same structure. For example, https://yande.re/ and https://konachan.com/ both use Moebooru to build their websites, and this parameter must be filled to deal with these sites respectively.
For websites like https://www.pixiv.net/, as no other website uses its structure, this parameter has already been initialized and do not need to be filled.
crawler_settings (image_crawler_utils.CrawlerSettings) – The CrawlerSettings used in this Parser.
cookies (image_crawler_utils.Cookies, list, dict, str, None) –
Cookies used in loading websites.
- classmethod load_from_pkl(pkl_file, log=<image_crawler_utils.log.Log object>)[source]
Load the parser from .pkl file.
ATTENTION: You should use the correspondent Parser class when loading. For example, loading DanbooruKeywordParser should use
DanbooruKeywordParser.load_from_pkl().- Parameters:
pkl_file (str, None) – Name of the pkl file.
log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
A CrawlerSettings class loaded from pkl file, or None if failed.
- Return type:
- display_all_configs()[source]
Display all config info. Dataclasses will be displayed in a neater way.
- get_cloudflare_cookies(url=None, headless=False, timeout=60, save_cookies_file=None, try_clicking=False)[source]
Bypass Cloudflare check and get its cookies.
- Parameters:
url (str) – Get Cloudflare cookies using this URL. Set to None (default) will use the station_url in this class.
headless (bool) – Whether to display a browser window. Recommend setting to True in case you need to manually bypass Cloudflare.
save_cookies_file (str, None) – Path to save the new cookies. Default set to
None, meaning not saving cookies.timeout (float) – Try to finish Cloudflare test in timeout seconds.
try_clicking (bool) – Try to repeatedly click the verification box. MAY CAUSE THE WEBSITE TO GET STUCK IN THE VERIFICATION PAGE.
- nodriver_request_page_content(url, browser=None, headless=True, is_json=False, thread_delay=None, page_stay_time=None)[source]
Download webpage content with nodriver.
For those sites having strong anti-crawling measures, try using this function to bypass them.
- Parameters:
url (str) – The URL of the page to download.
browser (nodriver.Browser, None) – Whether to use an existing browser instance.
headless (bool) – Whether to set the browser in headless mode. Default set to
True. Only works when browser is None.is_json (bool) – Whether the result is a JSON text. Default set to False.
thread_delay (float, Callable, None) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
page_stay_time (float, None) – Force the page to stay for page_stay_time seconds so that it can be fully loaded. Default set to None meaning no restrictions in time.
- Returns:
The HTML content of the webpage.
- nodriver_threading_request_page_content(url_list, restriction_num=None, is_json=False, thread_delay=None, batch_num=None, batch_delay=0.0, headless=True, deconstruct_browser=False, page_stay_time=None)[source]
Download multiple webpage content using asynchronous coroutines (similar to threads) with nodriver.
For those sites having strong anti-crawling measures, try using this function to bypass them.
- Parameters:
url_list (list[str]) – The list of URLs of the page to download.
restriction_num (int, None) – Only download the first restriction_num number of pages. Set to None (default) meaning no restrictions.
is_json (bool or Iterable instance) – Whether the result is a JSON text. Can be a bool or a iterable object with the same length as url_list. Default set to False.
thread_delay (float, Callable, None) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
batch_num (int) – Number of pages for each batch; using it with batch_delay to wait a certain period of time after downloading each batch. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
batch_delay (float, Callable) – Delaying time (seconds) after each batch is downloaded. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
headless (bool) – Display a browser window or not. Default set to
True, and setting it toFalseis helpful for debugging and bypassing some anti-crawling measures.deconstruct_browser (int) – Whether to deconstruct all instances and clear caches upon finishing. Can improve performances in restricted environments.
page_stay_time (float, None) – Force the page to stay for page_stay_time seconds so that it can be fully loaded. Default set to None meaning no restrictions in time.
- Returns:
A list of the HTML contents of the webpages. Its order is the same as the one of url_list.
- Return type:
- request_page_content(url, session=<requests.Session object>, headers=<image_crawler_utils.utils.Empty object>, thread_delay=None)[source]
Download webpage content.
- Parameters:
url (str) – The URL of the page to download.
session (requests from import requests, or requests.Session) – Can be requests or requests.Session()
headers (dict, Callable, None) – If you need to specify headers for current request, use this argument. Set to None (default) meaning use the headers from self.crawler_settings.download_config.result_headers
thread_delay (None | float | Callable) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
- Returns:
The HTML content of the webpage.
- Return type:
- abstractmethod run()[source]
MUST BE OVERRIDEN. Generate a list of ImageInfo, containing image urls, names and infos.
- save_to_pkl(pkl_file)[source]
Save the parser in a .pkl file.
- threading_request_page_content(url_list, restriction_num=None, session=<requests.Session object>, headers=<image_crawler_utils.utils.Empty object>, thread_delay=None, batch_num=None, batch_delay=0.0)[source]
Download multiple webpage content using threading.
- Parameters:
url_list (list[str]) – The list of URLs of the page to download.
restriction_num (int, None) – Only download the first restriction_num number of pages. Set to None (default) meaning no restrictions.
session (requests from import requests, or requests.Session) – Can be requests or requests.Session()
headers (dict, list, Callable, None) – If you need to specify headers for current threading requests, use this argument. Set to None (default) meaning use the headers from self.crawler_settings.download_config.result_headers + If it is a list, it should be of the same length as url_list, and for url_list[i] it will use the headers in headers[i]. The element in this list can be a dict of a function.
thread_delay (float, Callable, None) – Delay before thread running. Default set to None. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
batch_num (int | None) – Number of pages for each batch; using it with batch_delay to wait a certain period of time after downloading each batch. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
batch_delay (float | Callable) – Delaying time (seconds) after each batch is downloaded. Used to deal with websites like Pixiv which has a restriction on requests in a certain period of time.
- Returns:
A list of the HTML contents of the webpages. Its order is the same as the one of url_list.
- Return type:
KeywordParser class
If your task is to download from the result of searching with a query string, it is recommended to inherit KeywordParser, which inherits the base Parser class with several parameters and attribute functions defined specifically for this purpose.
For parameters and attribute functions in the original
Parserclass, please read the documentation above.
- class image_crawler_utils.KeywordParser(station_url, crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, standard_keyword_string=None, keyword_string=None, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''), accept_empty=False)[source]
Bases:
ParserA Parser for fetching result from keyword searching.
- Parameters:
station_url (str) –
The URL of the main page of a website.
This parameter works when several websites use the same structure. For example, https://yande.re/ and https://konachan.com/ both use Moebooru to build their websites, and this parameter must be filled to deal with these sites respectively.
For websites like https://www.pixiv.net/, as no other website uses its structure, this parameter has already been initialized and do not need to be filled.
crawler_settings (image_crawler_utils.CrawlerSettings) – The CrawlerSettings used in this Parser.
standard_keyword_string (str) – Query keyword string using standard syntax. Refer to the documentation for detailed instructions.
keyword_string (str, None) –
If you want to directly specify the keywords used in searching, set
keyword_stringto a custom non-empty string. It will OVERWRITEstandard_keyword_string.For example, set
keyword_stringto"kuon_(utawarerumono) rating:safe"in DanbooruKeywordParser means searching directly with this string in Danbooru, and its standard keyword string equivalent is"kuon_(utawarerumono) AND rating:safe".
cookies (image_crawler_utils.Cookies, list, dict, str, None) –
Cookies used in loading websites.
accept_empty (bool) – If set to
False(default), when bothstandard_keyword_stringandkeyword_stringis an empty string (like ‘’ or ‘ ‘), a critical error will be thrown. If set toTrue, no error will be thrown and the parameters are accepted.
- display_all_configs()[source]
Display all config info. Dataclasses will be displayed in a neater way.
- generate_standard_keyword_string(keyword_tree=None)[source]
Generate a standard keyword string.
Generated result may not be the same from the standard_keyword_string input.
- Parameters:
keyword_tree (KeywordLogicTree | None) –
The KeywordLogicTree that a standard keyword string will be built from. Set to
None(default) will use the KeywordLogicTree generated from thestandard_keyword_stringparameter.ATTENTION: When set to
None, the standard keyword string may not be absolutely same asstandard_keyword_string.
- Returns:
A standard keyword string.
Completing the Parser Structure
About the Usage of CrawlerSettings Class
The parameters passed into CrawlerSettings class is arranged as such:
image_num,capacityandpage_numwill be stored inCrawlerSettings().capacity_count_config, which is aCapacityCountConfigclass.To use the parameter in the Parser with the
crawler_settingsparameter passed in, you need to write the code like
self.crawler_settings.capacity_count_config.image_num
headers,proxies,thread_delay,fail_delay,randomize_delay,thread_num,timeout,max_download_time,retry_timeandoverwrite_imageswill be stored inCrawlerSettings().download_config, which is aDownloadConfigclass.To use the parameter in the Parser with the
crawler_settingsparameter passed in, you need to write the code like
self.crawler_settings.download_config.headers
debug_configanddetailed_console_logwill be used to set upCrawlerSettings().log, which is aLogclass that controls logging information.If you use
.set_logging_file()to set the logging file, theCrawlerSettings().logwill be accordingly changed.To log information in your custom Parser, you need to write the code like
self.crawler_settings.log.info("LOGGING INFO")
For detailed information, check out the documentation of
Logclass.
extra_configswill be stored inCrawlerSettings().extra_configs.
KeywordParser Tips
If you inherit the KeywordParser, the first thing suggested to do is to write a function (like .generate_keyword_string()) which converts the Standard Keyword String (stored in .standard_keyword_string) to the query string for your task and store it in .keyword_string.
- Run the
super().__init__()in the__init__()function of the inherited class to generate theself.keyword_treeattribute, which is aimage_crawler_utils.keyword.KeywordLogicTree. It is suggested to read the documentation of
image_crawler_utils.keyword.KeywordLogicTreefirst.
- Run the
Write the function that construct the function that generates the
.keyword_stringattribute fromself.keyword_tree.
Also, to be consistent with preset KeywordParsers, it is suggested to use .keyword_string before converted .standard_keyword_string if .keyword_string is not empty, and an error shall be raised only if both parameters are empty.
An example (from DanbooruKeywordParser) is like:
def __init__(
self,
station_url: str="https://danbooru.donmai.us/",
crawler_settings: CrawlerSettings=CrawlerSettings(),
standard_keyword_string: Optional[str]=None,
keyword_string: Optional[str]=None,
cookies: Optional[Union[Cookies, list, dict, str]]=Cookies(),
replace_url_with_source_level: str="None",
use_keyword_include: bool=False,
):
super().__init__(
station_url=station_url,
crawler_settings=crawler_settings,
standard_keyword_string=standard_keyword_string,
keyword_string=keyword_string,
cookies=cookies,
)
self.replace_url_with_source_level = replace_url_with_source_level.lower()
self.use_keyword_include = use_keyword_include
# Generate keyword string from keyword tree
def __build_keyword_str(self, tree: KeywordLogicTree) -> str:
# Generate standard keyword string
if isinstance(tree.lchild, str):
res1 = tree.lchild
else:
res1 = self.__build_keyword_str(tree.lchild)
if isinstance(tree.rchild, str):
res2 = tree.rchild
else:
res2 = self.__build_keyword_str(tree.rchild)
if tree.logic_operator == "AND":
return f'({res1} {res2})'
elif tree.logic_operator == "OR":
return f'({res1} or {res2})'
elif tree.logic_operator == "NOT":
return f'(-{res2})'
elif tree.logic_operator == "SINGLE":
return f'{res2}'
# Basic keyword string
def generate_keyword_string(self) -> str:
self.keyword_string = self.__build_keyword_str(self.keyword_tree)
return self.keyword_string