Parser Class (DanbooruKeywordParser)

Basic Usage

A Parser will parse image information from websites.

Parsers for a certain website are provided in image_crawler_utils.stations.certain_website; for example, to import the keyword Parser for Danbooru, use from image_crawler_utils.stations.booru import DanbooruKeywordParser.

Parsers should be configured when created, and once you set up a Parser, use image_info_list = Parser.run() to get a list of image information, which can be passed on to Downloader.

DanbooruKeywordParser Class

DanbooruKeywordParser can be a typical example of showing how a Parser works.

The most used attributes of DanbooruKeywordParser are like:

class image_crawler_utils.stations.booru.DanbooruKeywordParser(station_url='https://danbooru.donmai.us/', crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, standard_keyword_string=None, keyword_string=None, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''), replace_url_with_source_level='None', use_keyword_include=False)[source]

Bases: KeywordParser

Parameters:
  • crawler_settings (image_crawler_utils.CrawlerSettings) – The CrawlerSettings used in this Parser.

  • station_url (str) –

    The URL of the main page of a website.

    • This parameter works when several websites use the same structure. For example, https://yande.re/ and https://konachan.com/ both use Moebooru to build their websites, and this parameter must be filled to deal with these sites respectively.

    • For websites like https://www.pixiv.net/, as no other website uses its structure, this parameter has already been initialized and do not need to be filled.

  • standard_keyword_string (str) – Query keyword string using standard syntax. Refer to the documentation for detailed instructions.

  • cookies (image_crawler_utils.Cookies, list, dict, str, None) –

    Cookies used in loading websites.

  • keyword_string (str, None) –

    If you want to directly specify the keywords used in searching, set keyword_string to a custom non-empty string. It will OVERWRITE standard_keyword_string.

    • For example, set keyword_string to "kuon_(utawarerumono) rating:safe" in DanbooruKeywordParser means searching directly with this string in Danbooru, and its standard keyword string equivalent is "kuon_(utawarerumono) AND rating:safe".

  • replace_url_with_source_level (str, must be one of "All", "File", and "None") –

    A level controlling whether the Parser will try to download from the source URL of images instead of from the current website.

    • It has 3 available levels, and default is “None”:
      • ”All” or “all” (NOT SUGGESTED): As long as the image has a source URL, try to download from this URL first.

      • ”File” or “file”: If the source URL looks like a file (e.g. https://foo.bar/image.png) or it is one of several special websites (e.g. Pixiv or Twitter / X status), try to download from this URL first.

      • ”None” or “none”: Do not try to download from any source URL first.

    • Both source URLs and Danbooru URLs are stored in ImageInfo class and will be used when downloading. This parameters only controls the priority of URLs.

    • Set to a level other than “None” / “none” will reduce the pressure on Danbooru server but cost longer time (as source URLs may not be directly accessible, or they are absolutely unavailable).

  • use_keyword_include (bool) –

    If this parameter is set to True, KeywordParser will try to find keyword / tag subgroups with lowest number of keywords / tags (or subgroups with number of keywords / tags lower than a threshold, like 2 in Danbooru for those without an account) that contain all searching results with the least page number.

    • Only works when standard_keyword_string is used. When keyword_string is specified, this parameter is omitted.

    • For example, if the standard_keyword_string is set to “kuon_(utawarerumono) AND rating:safe OR utawarerumono”, then the Parser will check “kuon_(utawarerumono) OR utawarerumono” and “rating:safe OR utawarerumono” and select the group with the least page number of results as the keyword string in later queries.

    • If no subgroup with less than 2 keywords / tags exists (e.g. “kuon_(utawarerumono) OR rating:safe OR utawarerumono”), the Parser will try to find keyword / tag subgroups with the least keyword / tag number. This may often CAUSE ERRORS, so make a quick check of your keywords before setting this parameter to True.

classmethod load_from_pkl(pkl_file, log=<image_crawler_utils.log.Log object>)

Load the parser from .pkl file.

ATTENTION: You should use the correspondent Parser class when loading. For example, loading DanbooruKeywordParser should use DanbooruKeywordParser.load_from_pkl().

Parameters:
Returns:

A CrawlerSettings class loaded from pkl file, or None if failed.

Return type:

CrawlerSettings

display_all_configs()

Display all config info. Dataclasses will be displayed in a neater way.

run()[source]

The main function that runs the Parser and returns a list of image_crawler_utils.ImageInfo.

Return type:

list[ImageInfo]

save_to_pkl(pkl_file)

Save the parser in a .pkl file.

Parameters:
  • path (str) – Path to save the pkl file. Default is saving to the current path.

  • pkl_file (str, None) – Name of the pkl file. (Suffix is optional.)

Returns:

(Saved file name, Absolute path of the saved file), or None if failed.

Return type:

tuple[str, str] | None

Examples of DanbooruKeywordParser

An example of parsing information of images with keyword kuon_(utawarerumono) and rating:safe from Danbooru is like:

from image_crawler_utils.stations.booru import DanbooruKeywordParser

parser = DanbooruKeywordParser(
    crawler_settings=crawler_settings,  # Need to be defined in advance
    standard_keyword_string="kuon_(utawarerumono) AND rating:safe",
)
image_info_list = parser.run()

A Parser class can be saved by .save_to_pkl(), or loaded with .load_from_pkl() from its corresponding class (e.g. DanbooruKeywordParser.load_from_pkl()), like:

from image_crawler_utils.stations.booru import DanbooruKeywordParser

parser = DanbooruKeywordParser(
    crawler_settings=crawler_settings,  # Must be defined in advance
    standard_keyword_string="kuon_(utawarerumono) AND rating:safe",
)
# Save a DanbooruKeywordParser
parser.save_to_pkl('parser.pkl')

# Load a DanbooruKeywordParser
new_parser = DanbooruKeywordParser.load_from_pkl('parser.pkl')

Use .display_all_configs() to check all parameters of current Parser.

Standard Keyword String

As different stations may have different syntaxes for keyword searching, Image Crawler Utils uses a standard syntax to parse the keyword string. It is can be used in most preset Parsers for the standard_keyword_string parameter.

The grammar is like:

  • Logic symbols:

    • AND / & means searching images with both keywords / tags.

    • OR / | means searching images with either of the keywords / tags.

    • NOT / ! means searching images without this keyword / tag.

    • [ and ] works like brackets in normal expressions, increasing the priority of the keyword / tag string included.

      • It is STRONGLY recommended to use [ and ] in order to avoid ambiguity.

    • Priority of logic symbols is the same as C language, which is: OR < AND < NOT < [ = ]

Important

( and ) are considered part of the keywords / tags instead of a logic symbol.

  • Escape characters: Add \ before any of the characters above except ( and ) to represent itself (like \&), while \\ represents \.

Tip

\[ and \] are not escape characters in Python.

  • If two keywords / tags have no logic symbols in between, they will be considered one keyword / tag connected by _. For example, kuon (utawarerumono) works the same as kuon_(utawarerumono).

  • Keyword wildcards: * can be replaced with any string (include empty string).

    • *key means all keywords / tags that end with key. For example, *dress can match dress and chinadress.

    • key* means all keywords / tags that start with key. For example, dress* can match dress and dress_shirt.

    • *key* means all keywords / tags that contain key. For example, *dress* can match dress, chinadress and dress_shirt.

    • ke*y means all keywords / tags that start with ke and end with y. For example, satono*(umamusume) can match satono_diamond_(umamusume) and satono_crown_(umamusume).

    • These wildcards can be combined, like *ke*y.

Example: *dress AND NOT [kuon (utawarerumono) OR chinadress] means search for images with keywords including ones ending with dress while excluding those having keywords kuon_(utawarerumono) and chinadress.

Important

Some sites may not support all of the syntaxes above, or have restrictions on keyword searching. Refer to the corresponding Parser class documentation for more details.

Cookies Class

Cookies are frequently used in Parsers and (sometimes) Downloader class to obtain some information and images. Image Crawler Utils provides Cookies to provide a unified class for loading and utilizing cookies.

Cookies class can be directly used as parameters in Parsers and Downloader classes, saved (image_crawler_utils.Cookies.save_to_json()) and loaded (image_crawler_utils.Cookies.load_from_json()) for later use, and use its different forms (attributes, like .cookies_string) for other uses.

Important

Once a Cookies class is created, its attributes cannot be changed.

Some functions are provided to fetch cookies from certain websites (usually requires manual operations due to protections like Cloudflare), like get_pixiv_cookies and get_twitter_cookies. Their return value is a Cookies class. Please check out their documentation and Notes for Tasks for more details.

class image_crawler_utils.Cookies(cookies=None)[source]

Bases: object

Convert format of cookies between selenium, requests and string.

Use Cookies(cookies_from_certain_source) or Cookies.load_from_json() to create a Cookies class.

Use .cookies_nodriver / .cookies_selenium / .cookies_dict / .cookies_string to get the cookies of suitable format.

Parameters:

cookies (list, dict, str, None) –

Cookies generated from string, dict (requests), list (selenium or nodriver).

  • Leave blank (like Cookies()) will create an empty cookies, whose .is_none() returns True.

classmethod load_from_json(json_file, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]

Load the Cookies from a json file.

ONLY WORKS IF the info can be JSON serialized.

Parameters:
  • json_file (str) – Name / path of json file. Suffix (.json) must be included.

  • encoding (str) – Encoding of JSON file.

  • log (image_crawler_utils.log.Log, None) – Logging config.

Returns:

The Cookies, or None if failed.

Return type:

Cookies | None

is_none()[source]

Check whether Cookies is empty (created by None, “”, etc.).

Returns:

A bool, telling whethers Cookies is empty.

Return type:

bool

save_to_json(json_file, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]

Save the Cookies into a json file.

Parameters:
  • json_file (str) – Name / path of json file. (Suffix is optional.)

  • encoding (str) – Encoding of JSON file.

  • log (image_crawler_utils.log.Log, None) – Logging config.

Returns:

(Saved file name, Absolute path of the saved file), or None if failed.

Return type:

tuple[str, str] | None

update_nodriver_cookies(old_nodriver_cookies)[source]

Update nodriver-form cookies. NOT SUGGESTED TO BE USED DIRECTLY.

For every cookie in the input with the same name as the one in the Cookies class, replace the values with the latter one.

Also add cookies in Cookies class which not exists in input cookies.

Parameters:

old_nodriver_cookies (list[nodriver.cdp.network.Cookie]) – Cookies from nodriver.

Returns:

New nodriver cookies (a list[nodriver.cdp.network.Cookie]).

update_selenium_cookies(old_selenium_cookies)[source]

Update selenium-form cookies.

For every cookie in the input with the same name as the one in the Cookies class, replace the values with the latter one.

Also add cookies in Cookies class which not exists in input cookies.

Parameters:

old_selenium_cookies (list[dict]) – Cookies from selenium.

Returns:

New selenium cookies (a list[dict]).

cookies_dict: dict | None

Cookies in dict form. Mostly for requests module usage.

This form of cookies can be generated by requests-related functions and classes, or other cookie functions that generates a dict, etc.

A generation example is like:

import requests
from image_crawler_utils import Cookies

session = requests.Session()
# Some process that adds cookies to session
requests_cookies = session.cookies.get_dict()  # A list
cookies = Cookies(requests_cookies)
cookies_nodriver: list[Cookie] | None

Cookies in nodriver form.

This form of cookies can be generated by nodriver-related functions and classes, etc.

A generation example is like:

import nodriver
from image_crawler_utils.utils import set_up_nodriver_browser
from image_crawler_utils import Cookies

async def nodriver_func():
    browser = await set_up_nodriver_browser()
    tab = await browser.get('https://foo.bar.com')
    # Some other process
    nodriver_cookies = await browser.cookies.get_all()
    return nodriver_cookies

nodriver_cookies = nodriver.loop().run_until_complete(nodriver_func())
cookies = Cookies(nodriver_cookies)
cookies_selenium: list[dict] | None

Cookies in selenium form.

This form of cookies can be generated by selenium-related functions and classes, etc.

A generation example is like:

from selenium.webdriver import Chrome
from image_crawler_utils import Cookies

chrome_driver_path = '/path/to/chromedriver'
chrome_browser = webdriver.Chrome(executable_path=chrome_driver_path)
chrome_browser.get('https://foo.bar.com')
# Some other process
selenium_cookies = chrome_browser.get_cookies()  # A dict
cookies = Cookies(selenium_cookies)
cookies_string: str | None

Cookies in string form.

This form of cookies can be acquired by using Developer Mode (F12) in some browsers, etc.

A generation example is like:

from image_crawler_utils import Cookies

cookies = Cookies("your_cookies_string")

ImageInfo class

The result of Parsers is (and must be) a list of ImageInfo. The structure of ImageInfo class is like:

class image_crawler_utils.ImageInfo(url, name, info=<factory>, backup_urls=<factory>)[source]

Bases: object

A class consisting of image URL, name, info and back up URLs.

Can be used to download images and write result to files.

Parameters:
backup_urls: Iterable[str]

When downloading from .url failed, try downloading from URLs in the list of .backup_urls.

info: dict

A dict, containing information of the image.

  • info will not affect Downloader directly. It only works if you set the image_info_filter parameter in the Downloader class.

  • Different sites may have different info structures which are defined respectively by their Parsers.

  • ATTENTION: If you define you own info structure, please ENSURE it can be JSON-serialized (e.g. The values of the dict should be int, float, str, list, dict, etc.) in order to make it compatible with save_image_infos() and load_image_infos().

name: str

Name of the image when saved.

url: str

The URL used AT FIRST in downloading the image.

Save and Load the List of ImageInfo Class

The list of ImageInfo class can be saved with image_crawler_utils.save_image_infos() and loaded with image_crawler_utils.load_image_infos():

image_crawler_utils.save_image_infos(image_info_list, json_file, encoding='UTF-8', display_progress=True, log=<image_crawler_utils.log.Log object>)[source]

Save the ImageInfo list into a JSON file.

ONLY WORKS IF the info can be JSON serialized.

Parameters:
Returns:

(Saved file name, Absolute path of the saved file), or None if failed.

Return type:

tuple[str, str] | None

image_crawler_utils.load_image_infos(json_file, encoding='UTF-8', display_progress=True, log=<image_crawler_utils.log.Log object>)[source]

Load the ImageInfo list from a JSON file.

ONLY WORKS IF the info can be JSON serialized.

Parameters:
  • json_file (str) – Name / Path of the JSON file.

  • encoding (str) – Encoding of the JSON file.

  • display_progress (bool) – Display a rich progress bar when running. Progress bar will be hidden after finishing.

  • log (image_crawler_utils.log.Log, None) – Logging config.

Returns:

List of ImageInfo, or None if failed.

Return type:

list[ImageInfo] | None

Examples of ImageInfo

A JSON-converted example of ImageInfo generated by DanbooruKeywordParser from image ID 4994142 is like:

CLICK HERE TO DISPLAY
{
    "url": "https://cdn.donmai.us/original/cd/91/cd91f0000b9574bf142d125a1e886e5c.png",
    "name": "Danbooru 4994142 cd91f0000b9574bf142d125a1e886e5c.png",
    "info": {
        "info": {
            "id": 4994142,
            "created_at": "2021-12-21T08:02:13.706-05:00",
            "uploader_id": 772564,
            "score": 10,
            "source": "https://i.pximg.net/img-original/img/2020/08/11/12/41/43/83599609_p0.png",
            "md5": "cd91f0000b9574bf142d125a1e886e5c",
            "last_comment_bumped_at": null,
            "rating": "s",
            "image_width": 2000,
            "image_height": 2828,
            "tag_string": "1girl absurdres animal_ears black_eyes black_hair coat grabbing_own_breast hair_ornament hairband highres holding holding_mask japanese_clothes kuon_(utawarerumono) long_hair looking_at_viewer mask ponytail shirokuro_neko_(ouma_haruka) smile solo utawarerumono utawarerumono:_itsuwari_no_kamen",
            "fav_count": 10,
            "file_ext": "png",
            "last_noted_at": null,
            "parent_id": null,
            "has_children": false,
            "approver_id": null,
            "tag_count_general": 17,
            "tag_count_artist": 1,
            "tag_count_character": 1,
            "tag_count_copyright": 2,
            "file_size": 4527472,
            "up_score": 10,
            "down_score": 0,
            "is_pending": false,
            "is_flagged": false,
            "is_deleted": false,
            "tag_count": 23,
            "updated_at": "2024-07-10T12:21:31.782-04:00",
            "is_banned": false,
            "pixiv_id": 83599609,
            "last_commented_at": null,
            "has_active_children": false,
            "bit_flags": 0,
            "tag_count_meta": 2,
            "has_large": true,
            "has_visible_children": false,
            "media_asset": {
                "id": 5056745,
                "created_at": "2021-12-21T08:02:04.132-05:00",
                "updated_at": "2023-03-02T04:43:15.608-05:00",
                "md5": "cd91f0000b9574bf142d125a1e886e5c",
                "file_ext": "png",
                "file_size": 4527472,
                "image_width": 2000,
                "image_height": 2828,
                "duration": null,
                "status": "active",
                "file_key": "nxj2jBet8",
                "is_public": true,
                "pixel_hash": "5d34bcf53ddde76fd723f29aae5ebc53",
                "variants": [
                    {
                        "type": "180x180",
                        "url": "https://cdn.donmai.us/180x180/cd/91/cd91f0000b9574bf142d125a1e886e5c.jpg",
                        "width": 127,
                        "height": 180,
                        "file_ext": "jpg"
                    },
                    {
                        "type": "360x360",
                        "url": "https://cdn.donmai.us/360x360/cd/91/cd91f0000b9574bf142d125a1e886e5c.jpg",
                        "width": 255,
                        "height": 360,
                        "file_ext": "jpg"
                    },
                    {
                        "type": "720x720",
                        "url": "https://cdn.donmai.us/720x720/cd/91/cd91f0000b9574bf142d125a1e886e5c.webp",
                        "width": 509,
                        "height": 720,
                        "file_ext": "webp"
                    },
                    {
                        "type": "sample",
                        "url": "https://cdn.donmai.us/sample/cd/91/sample-cd91f0000b9574bf142d125a1e886e5c.jpg",
                        "width": 850,
                        "height": 1202,
                        "file_ext": "jpg"
                    },
                    {
                        "type": "original",
                        "url": "https://cdn.donmai.us/original/cd/91/cd91f0000b9574bf142d125a1e886e5c.png",
                        "width": 2000,
                        "height": 2828,
                        "file_ext": "png"
                    }
                ]
            },
            "tag_string_general": "1girl animal_ears black_eyes black_hair coat grabbing_own_breast hair_ornament hairband holding holding_mask japanese_clothes long_hair looking_at_viewer mask ponytail smile solo",
            "tag_string_character": "kuon_(utawarerumono)",
            "tag_string_copyright": "utawarerumono utawarerumono:_itsuwari_no_kamen",
            "tag_string_artist": "shirokuro_neko_(ouma_haruka)",
            "tag_string_meta": "absurdres highres",
            "file_url": "https://cdn.donmai.us/original/cd/91/cd91f0000b9574bf142d125a1e886e5c.png",
            "large_file_url": "https://cdn.donmai.us/sample/cd/91/sample-cd91f0000b9574bf142d125a1e886e5c.jpg",
            "preview_file_url": "https://cdn.donmai.us/180x180/cd/91/cd91f0000b9574bf142d125a1e886e5c.jpg"
        },
        "family_group": null,
        "tags": [
            "1girl",
            "absurdres",
            "animal_ears",
            "black_eyes",
            "black_hair",
            "coat",
            "grabbing_own_breast",
            "hair_ornament",
            "hairband",
            "highres",
            "holding",
            "holding_mask",
            "japanese_clothes",
            "kuon_(utawarerumono)",
            "long_hair",
            "looking_at_viewer",
            "mask",
            "ponytail",
            "shirokuro_neko_(ouma_haruka)",
            "smile",
            "solo",
            "utawarerumono",
            "utawarerumono:_itsuwari_no_kamen"
        ],
        "tags_class": {
            "1girl": "general",
            "animal_ears": "general",
            "black_eyes": "general",
            "black_hair": "general",
            "coat": "general",
            "grabbing_own_breast": "general",
            "hair_ornament": "general",
            "hairband": "general",
            "holding": "general",
            "holding_mask": "general",
            "japanese_clothes": "general",
            "long_hair": "general",
            "looking_at_viewer": "general",
            "mask": "general",
            "ponytail": "general",
            "smile": "general",
            "solo": "general",
            "kuon_(utawarerumono)": "character",
            "utawarerumono": "copyright",
            "utawarerumono:_itsuwari_no_kamen": "copyright",
            "shirokuro_neko_(ouma_haruka)": "artist",
            "absurdres": "meta",
            "highres": "meta"
        }
    },
    "backup_urls": [
        "https://i.pximg.net/img-original/img/2020/08/11/12/41/43/83599609_p0.png"
    ]
}

If you want to get the tags of this image (assume its image_info is an ImageInfo class), you should use image_info.info["tags"] instead of image_info["info"]["tags"] or image_info.info.tags.