Parser Class (DanbooruKeywordParser)
Basic Usage
A Parser will parse image information from websites.
Parsers for a certain website are provided in image_crawler_utils.stations.certain_website; for example, to import the keyword Parser for Danbooru, use from image_crawler_utils.stations.booru import DanbooruKeywordParser.
Parsers should be configured when created, and once you set up a Parser, use image_info_list = Parser.run() to get a list of image information, which can be passed on to Downloader.
DanbooruKeywordParser Class
DanbooruKeywordParser can be a typical example of showing how a Parser works.
The most used attributes of DanbooruKeywordParser are like:
- class image_crawler_utils.stations.booru.DanbooruKeywordParser(station_url='https://danbooru.donmai.us/', crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, standard_keyword_string=None, keyword_string=None, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''), replace_url_with_source_level='None', use_keyword_include=False)[source]
Bases:
KeywordParser- Parameters:
crawler_settings (image_crawler_utils.CrawlerSettings) – The CrawlerSettings used in this Parser.
station_url (str) –
The URL of the main page of a website.
This parameter works when several websites use the same structure. For example, https://yande.re/ and https://konachan.com/ both use Moebooru to build their websites, and this parameter must be filled to deal with these sites respectively.
For websites like https://www.pixiv.net/, as no other website uses its structure, this parameter has already been initialized and do not need to be filled.
standard_keyword_string (str) – Query keyword string using standard syntax. Refer to the documentation for detailed instructions.
cookies (image_crawler_utils.Cookies, list, dict, str, None) –
Cookies used in loading websites.
keyword_string (str, None) –
If you want to directly specify the keywords used in searching, set
keyword_stringto a custom non-empty string. It will OVERWRITEstandard_keyword_string.For example, set
keyword_stringto"kuon_(utawarerumono) rating:safe"in DanbooruKeywordParser means searching directly with this string in Danbooru, and its standard keyword string equivalent is"kuon_(utawarerumono) AND rating:safe".
replace_url_with_source_level (str, must be one of "All", "File", and "None") –
A level controlling whether the Parser will try to download from the source URL of images instead of from the current website.
- It has 3 available levels, and default is “None”:
”All” or “all” (NOT SUGGESTED): As long as the image has a source URL, try to download from this URL first.
”File” or “file”: If the source URL looks like a file (e.g. https://foo.bar/image.png) or it is one of several special websites (e.g. Pixiv or Twitter / X status), try to download from this URL first.
”None” or “none”: Do not try to download from any source URL first.
Both source URLs and Danbooru URLs are stored in ImageInfo class and will be used when downloading. This parameters only controls the priority of URLs.
Set to a level other than “None” / “none” will reduce the pressure on Danbooru server but cost longer time (as source URLs may not be directly accessible, or they are absolutely unavailable).
use_keyword_include (bool) –
If this parameter is set to
True, KeywordParser will try to find keyword / tag subgroups with lowest number of keywords / tags (or subgroups with number of keywords / tags lower than a threshold, like 2 in Danbooru for those without an account) that contain all searching results with the least page number.Only works when
standard_keyword_stringis used. Whenkeyword_stringis specified, this parameter is omitted.For example, if the
standard_keyword_stringis set to “kuon_(utawarerumono) AND rating:safe OR utawarerumono”, then the Parser will check “kuon_(utawarerumono) OR utawarerumono” and “rating:safe OR utawarerumono” and select the group with the least page number of results as the keyword string in later queries.If no subgroup with less than 2 keywords / tags exists (e.g. “kuon_(utawarerumono) OR rating:safe OR utawarerumono”), the Parser will try to find keyword / tag subgroups with the least keyword / tag number. This may often CAUSE ERRORS, so make a quick check of your keywords before setting this parameter to
True.
- classmethod load_from_pkl(pkl_file, log=<image_crawler_utils.log.Log object>)
Load the parser from .pkl file.
ATTENTION: You should use the correspondent Parser class when loading. For example, loading DanbooruKeywordParser should use
DanbooruKeywordParser.load_from_pkl().- Parameters:
pkl_file (str, None) – Name of the pkl file.
log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
A CrawlerSettings class loaded from pkl file, or None if failed.
- Return type:
- display_all_configs()
Display all config info. Dataclasses will be displayed in a neater way.
- run()[source]
The main function that runs the Parser and returns a list of
image_crawler_utils.ImageInfo.
- save_to_pkl(pkl_file)
Save the parser in a .pkl file.
Examples of DanbooruKeywordParser
An example of parsing information of images with keyword kuon_(utawarerumono) and rating:safe from Danbooru is like:
from image_crawler_utils.stations.booru import DanbooruKeywordParser
parser = DanbooruKeywordParser(
crawler_settings=crawler_settings, # Need to be defined in advance
standard_keyword_string="kuon_(utawarerumono) AND rating:safe",
)
image_info_list = parser.run()
A Parser class can be saved by .save_to_pkl(), or loaded with .load_from_pkl() from its corresponding class (e.g. DanbooruKeywordParser.load_from_pkl()), like:
from image_crawler_utils.stations.booru import DanbooruKeywordParser
parser = DanbooruKeywordParser(
crawler_settings=crawler_settings, # Must be defined in advance
standard_keyword_string="kuon_(utawarerumono) AND rating:safe",
)
# Save a DanbooruKeywordParser
parser.save_to_pkl('parser.pkl')
# Load a DanbooruKeywordParser
new_parser = DanbooruKeywordParser.load_from_pkl('parser.pkl')
Use .display_all_configs() to check all parameters of current Parser.
Standard Keyword String
As different stations may have different syntaxes for keyword searching, Image Crawler Utils uses a standard syntax to parse the keyword string. It is can be used in most preset Parsers for the standard_keyword_string parameter.
The grammar is like:
Logic symbols:
AND/&means searching images with both keywords / tags.OR/|means searching images with either of the keywords / tags.NOT/!means searching images without this keyword / tag.[and]works like brackets in normal expressions, increasing the priority of the keyword / tag string included.It is STRONGLY recommended to use
[and]in order to avoid ambiguity.
Priority of logic symbols is the same as C language, which is: OR < AND < NOT < [ = ]
Important
( and ) are considered part of the keywords / tags instead of a logic symbol.
Escape characters: Add
\before any of the characters above except(and)to represent itself (like\&), while\\represents\.
Tip
\[ and \] are not escape characters in Python.
If two keywords / tags have no logic symbols in between, they will be considered one keyword / tag connected by
_. For example,kuon (utawarerumono)works the same askuon_(utawarerumono).Keyword wildcards:
*can be replaced with any string (include empty string).*keymeans all keywords / tags that end withkey. For example,*dresscan matchdressandchinadress.key*means all keywords / tags that start withkey. For example,dress*can matchdressanddress_shirt.*key*means all keywords / tags that containkey. For example,*dress*can matchdress,chinadressanddress_shirt.ke*ymeans all keywords / tags that start withkeand end withy. For example,satono*(umamusume)can matchsatono_diamond_(umamusume)andsatono_crown_(umamusume).These wildcards can be combined, like
*ke*y.
Example: *dress AND NOT [kuon (utawarerumono) OR chinadress] means search for images with keywords including ones ending with dress while excluding those having keywords kuon_(utawarerumono) and chinadress.
Important
Some sites may not support all of the syntaxes above, or have restrictions on keyword searching. Refer to the corresponding Parser class documentation for more details.
ImageInfo class
The result of Parsers is (and must be) a list of ImageInfo. The structure of ImageInfo class is like:
- class image_crawler_utils.ImageInfo(url, name, info=<factory>, backup_urls=<factory>)[source]
Bases:
objectA class consisting of image URL, name, info and back up URLs.
Can be used to download images and write result to files.
- backup_urls: Iterable[str]
When downloading from
.urlfailed, try downloading from URLs in the list of.backup_urls.
- info: dict
A
dict, containing information of the image.infowill not affect Downloader directly. It only works if you set theimage_info_filterparameter in the Downloader class.Different sites may have different
infostructures which are defined respectively by their Parsers.ATTENTION: If you define you own
infostructure, please ENSURE it can be JSON-serialized (e.g. The values of thedictshould beint,float,str,list,dict, etc.) in order to make it compatible withsave_image_infos()andload_image_infos().
- name: str
Name of the image when saved.
- url: str
The URL used AT FIRST in downloading the image.
Save and Load the List of ImageInfo Class
The list of ImageInfo class can be saved with image_crawler_utils.save_image_infos() and loaded with image_crawler_utils.load_image_infos():
- image_crawler_utils.save_image_infos(image_info_list, json_file, encoding='UTF-8', display_progress=True, log=<image_crawler_utils.log.Log object>)[source]
Save the ImageInfo list into a JSON file.
ONLY WORKS IF the info can be JSON serialized.
- Parameters:
image_info_list (Iterable[image_crawler_utils.ImageInfo]) – An iterable list (e.g.
listortuple) ofimage_crawler_utils.ImageInfo.json_file (str) – Name / Path of the JSON file. Suffix (.json) is optional.
encoding (str) – Encoding of the JSON file.
display_progress (bool) – Display a
richprogress bar when running. Progress bar will be hidden after finishing.log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
(Saved file name, Absolute path of the saved file), or
Noneif failed.- Return type:
- image_crawler_utils.load_image_infos(json_file, encoding='UTF-8', display_progress=True, log=<image_crawler_utils.log.Log object>)[source]
Load the ImageInfo list from a JSON file.
ONLY WORKS IF the info can be JSON serialized.
- Parameters:
json_file (str) – Name / Path of the JSON file.
encoding (str) – Encoding of the JSON file.
display_progress (bool) – Display a
richprogress bar when running. Progress bar will be hidden after finishing.log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
List of ImageInfo, or None if failed.
- Return type:
Examples of ImageInfo
A JSON-converted example of ImageInfo generated by DanbooruKeywordParser from image ID 4994142 is like:
CLICK HERE TO DISPLAY
{
"url": "https://cdn.donmai.us/original/cd/91/cd91f0000b9574bf142d125a1e886e5c.png",
"name": "Danbooru 4994142 cd91f0000b9574bf142d125a1e886e5c.png",
"info": {
"info": {
"id": 4994142,
"created_at": "2021-12-21T08:02:13.706-05:00",
"uploader_id": 772564,
"score": 10,
"source": "https://i.pximg.net/img-original/img/2020/08/11/12/41/43/83599609_p0.png",
"md5": "cd91f0000b9574bf142d125a1e886e5c",
"last_comment_bumped_at": null,
"rating": "s",
"image_width": 2000,
"image_height": 2828,
"tag_string": "1girl absurdres animal_ears black_eyes black_hair coat grabbing_own_breast hair_ornament hairband highres holding holding_mask japanese_clothes kuon_(utawarerumono) long_hair looking_at_viewer mask ponytail shirokuro_neko_(ouma_haruka) smile solo utawarerumono utawarerumono:_itsuwari_no_kamen",
"fav_count": 10,
"file_ext": "png",
"last_noted_at": null,
"parent_id": null,
"has_children": false,
"approver_id": null,
"tag_count_general": 17,
"tag_count_artist": 1,
"tag_count_character": 1,
"tag_count_copyright": 2,
"file_size": 4527472,
"up_score": 10,
"down_score": 0,
"is_pending": false,
"is_flagged": false,
"is_deleted": false,
"tag_count": 23,
"updated_at": "2024-07-10T12:21:31.782-04:00",
"is_banned": false,
"pixiv_id": 83599609,
"last_commented_at": null,
"has_active_children": false,
"bit_flags": 0,
"tag_count_meta": 2,
"has_large": true,
"has_visible_children": false,
"media_asset": {
"id": 5056745,
"created_at": "2021-12-21T08:02:04.132-05:00",
"updated_at": "2023-03-02T04:43:15.608-05:00",
"md5": "cd91f0000b9574bf142d125a1e886e5c",
"file_ext": "png",
"file_size": 4527472,
"image_width": 2000,
"image_height": 2828,
"duration": null,
"status": "active",
"file_key": "nxj2jBet8",
"is_public": true,
"pixel_hash": "5d34bcf53ddde76fd723f29aae5ebc53",
"variants": [
{
"type": "180x180",
"url": "https://cdn.donmai.us/180x180/cd/91/cd91f0000b9574bf142d125a1e886e5c.jpg",
"width": 127,
"height": 180,
"file_ext": "jpg"
},
{
"type": "360x360",
"url": "https://cdn.donmai.us/360x360/cd/91/cd91f0000b9574bf142d125a1e886e5c.jpg",
"width": 255,
"height": 360,
"file_ext": "jpg"
},
{
"type": "720x720",
"url": "https://cdn.donmai.us/720x720/cd/91/cd91f0000b9574bf142d125a1e886e5c.webp",
"width": 509,
"height": 720,
"file_ext": "webp"
},
{
"type": "sample",
"url": "https://cdn.donmai.us/sample/cd/91/sample-cd91f0000b9574bf142d125a1e886e5c.jpg",
"width": 850,
"height": 1202,
"file_ext": "jpg"
},
{
"type": "original",
"url": "https://cdn.donmai.us/original/cd/91/cd91f0000b9574bf142d125a1e886e5c.png",
"width": 2000,
"height": 2828,
"file_ext": "png"
}
]
},
"tag_string_general": "1girl animal_ears black_eyes black_hair coat grabbing_own_breast hair_ornament hairband holding holding_mask japanese_clothes long_hair looking_at_viewer mask ponytail smile solo",
"tag_string_character": "kuon_(utawarerumono)",
"tag_string_copyright": "utawarerumono utawarerumono:_itsuwari_no_kamen",
"tag_string_artist": "shirokuro_neko_(ouma_haruka)",
"tag_string_meta": "absurdres highres",
"file_url": "https://cdn.donmai.us/original/cd/91/cd91f0000b9574bf142d125a1e886e5c.png",
"large_file_url": "https://cdn.donmai.us/sample/cd/91/sample-cd91f0000b9574bf142d125a1e886e5c.jpg",
"preview_file_url": "https://cdn.donmai.us/180x180/cd/91/cd91f0000b9574bf142d125a1e886e5c.jpg"
},
"family_group": null,
"tags": [
"1girl",
"absurdres",
"animal_ears",
"black_eyes",
"black_hair",
"coat",
"grabbing_own_breast",
"hair_ornament",
"hairband",
"highres",
"holding",
"holding_mask",
"japanese_clothes",
"kuon_(utawarerumono)",
"long_hair",
"looking_at_viewer",
"mask",
"ponytail",
"shirokuro_neko_(ouma_haruka)",
"smile",
"solo",
"utawarerumono",
"utawarerumono:_itsuwari_no_kamen"
],
"tags_class": {
"1girl": "general",
"animal_ears": "general",
"black_eyes": "general",
"black_hair": "general",
"coat": "general",
"grabbing_own_breast": "general",
"hair_ornament": "general",
"hairband": "general",
"holding": "general",
"holding_mask": "general",
"japanese_clothes": "general",
"long_hair": "general",
"looking_at_viewer": "general",
"mask": "general",
"ponytail": "general",
"smile": "general",
"solo": "general",
"kuon_(utawarerumono)": "character",
"utawarerumono": "copyright",
"utawarerumono:_itsuwari_no_kamen": "copyright",
"shirokuro_neko_(ouma_haruka)": "artist",
"absurdres": "meta",
"highres": "meta"
}
},
"backup_urls": [
"https://i.pximg.net/img-original/img/2020/08/11/12/41/43/83599609_p0.png"
]
}
If you want to get the tags of this image (assume its image_info is an ImageInfo class), you should use image_info.info["tags"] instead of image_info["info"]["tags"] or image_info.info.tags.