image_crawler_utils.stations.twitter package

class image_crawler_utils.stations.twitter.TwitterKeywordMediaParser(station_url='https://x.com/', crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, standard_keyword_string=None, keyword_string=None, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''), twitter_search_settings=TwitterSearchSettings(from_users=None, to_users=None, mentioned_users=None, including_replies=True, only_replies=False, including_links=True, only_links=False, including_media=True, only_media=False, min_reply_num=None, min_favorite_num=None, min_retweet_num=None, starting_date='', ending_date=''), reload_times=1, error_retry_delay=200, headless=True)[source]

Bases: KeywordParser

Keyword Parser for Twitter. Will fetch all media images from the searching result of certain keywords.

Parameters:

crawler_settings (image_crawler_utils.CrawlerSettings) – The CrawlerSettings used in this Parser.
station_url (str) –
The URL of the main page of a website.
- This parameter works when several websites use the same structure. For example, https://yande.re/ and https://konachan.com/ both use Moebooru to build their websites, and this parameter must be filled to deal with these sites respectively.
- For websites like https://www.pixiv.net/, as no other website uses its structure, this parameter has already been initialized and do not need to be filled.
standard_keyword_string (str) – Query keyword string using standard syntax. Refer to the documentation for detailed instructions.
keyword_string (str, None) –
If you want to directly specify the keywords used in searching, set keyword_string to a custom non-empty string. It will OVERWRITE standard_keyword_string.
- For example, set keyword_string to "kuon_(utawarerumono) rating:safe" in DanbooruKeywordParser means searching directly with this string in Danbooru, and its standard keyword string equivalent is "kuon_(utawarerumono) AND rating:safe".
cookies (image_crawler_utils.Cookies, str, dict, list, None) – Cookies containing logging information.
twitter_search_settings (image_crawler_utils.stations.twitter.TwitterSearchSettings) – A TwitterSearchSettings class that contains extra options when searching.
reload_times (int) – Reload the page for reload_times times. May be useful when there are status (tweets) not detected.
error_retry_delay (float) – When Twitter / X returns an error, the Parser will retry after error_retry_delay seconds.
headless (bool) – Do not display browsers window when a browser is started. Set to False will pop up browser windows.

generate_keyword_string()[source]

Return type:: str

get_status()[source]

Return type:: list[TwitterStatus]

parse_images_from_status()[source]

Return type:: list[ImageInfo]

run()[source]

The main function that runs the Parser and returns a list of image_crawler_utils.ImageInfo.

Return type:: list[ImageInfo]

class image_crawler_utils.stations.twitter.TwitterSearchSettings(from_users=None, to_users=None, mentioned_users=None, including_replies=True, only_replies=False, including_links=True, only_links=False, including_media=True, only_media=False, min_reply_num=None, min_favorite_num=None, min_retweet_num=None, starting_date='', ending_date='')[source]

Bases: object

TwitterSearchSettings controls advanced searching settings. It will append an string to the keyword string according to the settings in this class.

Parameters:

from_users (list[str] | str | None)
to_users (list[str] | str | None)
mentioned_users (list[str] | str | None)
including_replies (bool)
only_replies (bool)
including_links (bool)
only_links (bool)
including_media (bool)
only_media (bool)
min_reply_num (int | None)
min_favorite_num (int | None)
min_retweet_num (int | None)
starting_date (str)
ending_date (str)

build_search_appending_str(keyword_string)[source]

Building a searching appending suffix.

Parameters:: keyword_string (str) – the constructed keyword string for Twitter.

ending_date: str = '': Tweets before this date. Must be “YYYY-MM-DD”, “YYYY.MM.DD” or “YYYY/MM/DD” format.

from_users: list[str] | str | None = None: Select tweets sent by a certain user / a certain list of users.

including_links: bool = True: Including tweets that contain at least one link.

including_media: bool = True: Including tweets that contain at least one media.

including_replies: bool = True: Including reply tweets.

mentioned_users: list[str] | str | None = None: Select tweets that mention a certain user / a certain list of users.

min_favorite_num: int | None = None: Including tweets with more than min_favorite_num favorites.

min_reply_num: int | None = None: Including tweets with more than min_reply_num replies.

min_retweet_num: int | None = None: Including tweets with more than min_retweet_num retweets.

only_links: bool = False: Only including tweets that contain at least one link. Works only if including_replies is set to True (default).

only_media: bool = False: Only including tweets that contain at least one media. Works only if including_replies is set to True (default).

only_replies: bool = False: Only including reply tweets. Works only if including_replies is set to True (default).

starting_date: str = '': Tweets after this date. Must be “YYYY-MM-DD”, “YYYY.MM.DD” or “YYYY/MM/DD” format.

to_users: list[str] | str | None = None: Select tweets replying to a certain user / a certain list of users.

class image_crawler_utils.stations.twitter.TwitterStatus(status_url=None, status_id=None, user_id=None, user_name=None, time=None, reply_num=0, retweet_num=0, like_num=0, view_num=None, text=None, hashtags=<factory>, links=<factory>, media_list=<factory>)[source]

Bases: object

Contains config of a tweet (Twitter / X status).

Parameters:

status_url (str | None)
status_id (str | None)
user_id (str | None)
user_name (str | None)
time (str | None)
reply_num (int)
retweet_num (int)
like_num (int)
view_num (int | None)
text (str | None)
hashtags (Iterable[str])
links (Iterable[str])
media_list (Iterable[TwitterStatusMedia])

hashtags: Iterable[str]

like_num: int = 0

links: Iterable[str]

media_list: Iterable[TwitterStatusMedia]

reply_num: int = 0

retweet_num: int = 0

status_id: str | None = None

status_url: str | None = None

text: str | None = None

time: str | None = None

user_id: str | None = None

user_name: str | None = None

view_num: int | None = None

class image_crawler_utils.stations.twitter.TwitterStatusMedia(link: str | None = None, image_source: str | None = None, image_id: str | None = None, image_name: str | None = None)[source]

Bases: object

Parameters:

link (str | None)
image_source (str | None)
image_id (str | None)
image_name (str | None)

image_id: str | None = None

image_name: str | None = None

image_source: str | None = None

link: str | None = None

class image_crawler_utils.stations.twitter.TwitterUserMediaParser(user_id, station_url='https://x.com/', crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, cookies=Cookies(cookies_nodriver=None, cookies_selenium=[], cookies_dict={}, cookies_string=''), reload_times=1, error_retry_delay=200, interval_days=180, starting_date=None, ending_date=None, exit_when_empty=False, headless=True)[source]

Bases: Parser

Parameters:

user_id (str)
station_url (str)
crawler_settings (CrawlerSettings)
cookies (Cookies | list | dict | str | None)
reload_times (int)
error_retry_delay (float)
interval_days (int)
starting_date (str | None)
ending_date (str | None)
exit_when_empty (bool)
headless (bool)

generate_search_settings()[source]

Return type:: list[TwitterSearchSettings]

get_status_from_urls()[source]

Return type:: list[TwitterStatus]

parse_images_from_status()[source]

Return type:: list[ImageInfo]

run()[source]

The main function that runs the Parser and returns a list of image_crawler_utils.ImageInfo.

Return type:: list[ImageInfo]

async image_crawler_utils.stations.twitter.find_twitter_status(tab, log=<image_crawler_utils.log.Log object>)[source]

Finding all Twitter / X status on current searching result page.

Parameters:

tab (unodriver.Tab) – Nodriver tab with loaded searching result page.
log (image_crawler_utils.log.Log, None) – Logging config.

Returns:

A list of image_crawler_utils.stations.twitter.TwitterStatus class.

Return type:

list[TwitterStatus]

image_crawler_utils.stations.twitter.get_twitter_cookies(twitter_account=None, user_id=None, password=None, proxies=None, timeout=30.0, headless=False, waiting_seconds=60.0, log=<image_crawler_utils.log.Log object>)[source]

Manually get cookies by logging in to Twitter / X.

Parameters:

twitter_account (str, None) – Your Twitter / X mail address. Leave it to input manually.
user_id (str, None) – Your Twitter / X mail user id (@user_id). Sometimes Twitter / X requires it to confirm your logging in. Leave it to input manually.
password (str, None) – Your Twitter / X password. Leave it to input manually.
proxies (dict, None) –
The proxies used in nodriver browser.
- The pattern should be in a requests-acceptable form like:
  - HTTP type: {'http': '127.0.0.1:7890'}
  - HTTPS type: {'https': '127.0.0.1:7890'}, or {'https': '127.0.0.1:7890', 'http': '127.0.0.1:7890'}
  - SOCKS type: {'https': 'socks5://127.0.0.1:7890'}
timeout (float, None) – Timeout (seconds) for waiting elements. Default is 30.
headless (bool, None) – Use headless mode. Default is False.
waiting_seconds (float, None) – In headless mode, if the next step cannot be loaded in waiting_seconds, then an error will be raised. Default is 60.
log (image_crawler_utils.log.Log, None) – Logging config.

Returns:

A image_crawler_utils.Cookies class.

Return type:

Cookies | None

image_crawler_utils.stations.twitter.parse_twitter_status_element(status_html, log=<image_crawler_utils.log.Log object>)[source]

Parse Twitter / X status element from search result page: “<article …></article>”.

Parameters:

status_html (str) – HTML string of status element “<article …></article>”.
log (image_crawler_utils.log.Log, None) – Logging config.

Returns:

A image_crawler_utils.stations.twitter.TwitterStatus class.

Return type:

TwitterStatus | None

async image_crawler_utils.stations.twitter.scrolling_to_find_status(tab, tab_url, crawler_settings=<image_crawler_utils.classes.crawler_settings.CrawlerSettings object>, reload_times=1, error_retry_delay=200, image_num_restriction=None, progress_group=None, transient=False)[source]

Scrolling to finding all Twitter / X status on current searching result page.

Parameters:

crawler_settings (image_crawler_utils.CrawlerSettings) – The CrawlerSettings used in this Parser.
tab (nodriver.Tab) – nodriver.Tab with loaded searching result page.
reload_times (int) – To deal with (possible) missing status, reload pages for reload_times to get status results.
error_retry_delay (float) – When an error happens (especially Twitter / X returns an error), sleep error_retry_delay before reloading again.
progress_group (image_crawler_utils.progress_bar.ProgressGroup) – The Group of Progress bars to be displayed in.
transient (bool) – Hide Progress bars after finishing.
tab_url (str)
image_num_restriction (int | None)

Returns:

A list of image_crawler_utils.stations.twitter.TwitterStatus class, sort by status from large to small.

Return type:

list[TwitterStatus]

async image_crawler_utils.stations.twitter.twitter_empty_check(tab)[source]

Check if the result is empty.

Parameters:

tab (nodriver.Tab) – Nodriver tab with loaded searching result page.
tab_url (str) – URL of the tab.
log (image_crawler_utils.log.Log, None) – Logging config.

Returns:

Return True if found empty element, or return False.

Return type:

str | None

async image_crawler_utils.stations.twitter.twitter_error_check(tab)[source]

Check if there is an error in loading Twitter / X page.

Parameters:: tab (nodriver.Tab) – Nodriver tab with loaded searching result page.
Returns:: Return True if found error element, or return False.
Return type:: str | None