Useful Classes and Functions
Most tools for building your custom image crawlers are provided in the submodules of image_crawler_utils. Check out the API Reference for detailed information.
Logging
Logging in Image Crawler Utils is packed in Log class, which uses rich-style output for logging onto console and logging library for logging into files.
Logging parameters is mostly set when creating CrawlerSettings. You can refer to the CrawlerSettings Class for more details. To log information in most Parsers and Downloaders, you can write it like
self.crawler_settings.log.info("LOGGING INFO")
To determine which logging level should be used, check out the documentation of Log class:
- class image_crawler_utils.log.Log(log_file=None, debug_config=DebugConfig(show_debug=False, show_info=True, show_warning=True, show_error=True, show_critical=True), logging_level=10, detailed_console_log=False)[source]
Bases:
objectClass provided for logging messages onto the console and into the file.
- Parameters:
log_file (str) – Output name for the logging file. NO SUFFIX APPENDED. Set to None (Default) is not to output any file.
debug_config (image_crawler_utils.configs.DebugConfig) – Set the OUTPUT MESSAGE TO CONSOLE level. Default is not to output any message.
logging_level (str, int) – Set the logging level of the LOGGING FILE. - Select from: logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR and logging.CRITICAL
detailed_console_log (bool) – When logging info to the console, always log
msg(the messages logged into files) even ifoutput_msgexists.
- critical(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]
Output critical messages of errors that interrupt the crawler. Usually a Python error will be raised when critical errors happen.
- Parameters:
msg (str) – Logging message.
output_msg (str, None) – Message to be output to console. Set to None will output the string in
msgparameter.exc_info – Please refer to the
loggingandrich.loggingdocumentation.stack_info (bool) – Please refer to the
loggingandrich.loggingdocumentation.stacklevel (int) – Please refer to the
loggingandrich.loggingdocumentation.extra (Mapping[str, object] | None) – Please refer to the
loggingandrich.loggingdocumentation.
- debug(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]
Output debug messages of many detailed information about running the crawler, especially connections with websites.
- Parameters:
msg (str) – Logging message.
output_msg (str, None) – Message to be output to console. Set to None will output the string in
msgparameter.exc_info – Please refer to the
loggingandrich.loggingdocumentation.stack_info (bool) – Please refer to the
loggingandrich.loggingdocumentation.stacklevel (int) – Please refer to the
loggingandrich.loggingdocumentation.extra (Mapping[str, object] | None) – Please refer to the
loggingandrich.loggingdocumentation.
- error(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]
Output error messages of errors that may affect the final results but do not interrupt the crawler.
- Parameters:
msg (str) – Logging message.
output_msg (str, None) – Message to be output to console. Set to None will output the string in
msgparameter.exc_info – Please refer to the
loggingandrich.loggingdocumentation.stack_info (bool) – Please refer to the
loggingandrich.loggingdocumentation.stacklevel (int) – Please refer to the
loggingandrich.loggingdocumentation.extra (Mapping[str, object] | None) – Please refer to the
loggingandrich.loggingdocumentation.
- info(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]
Output info messages of basic information indicating the progress of the crawler.
- Parameters:
msg (str) – Logging message.
output_msg (str, None) – Message to be output to console. Set to None will output the string in
msgparameter.exc_info – Please refer to the
loggingandrich.loggingdocumentation.stack_info (bool) – Please refer to the
loggingandrich.loggingdocumentation.stacklevel (int) – Please refer to the
loggingandrich.loggingdocumentation.extra (Mapping[str, object] | None) – Please refer to the
loggingandrich.loggingdocumentation.
- logging_file_handler()[source]
Return the file handler if logging into file, or None if not.
- logging_file_path()[source]
Output the absolute path of logging file if exists, or None if not.
- warning(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]
Output warning messages of errors that basically do not affect the final results, mostly connection failures with the websites.
- Parameters:
msg (str) – Logging message.
output_msg (str, None) – Message to be output to console. Set to None will output the string in
msgparameter.exc_info – Please refer to the
loggingandrich.loggingdocumentation.stack_info (bool) – Please refer to the
loggingandrich.loggingdocumentation.stacklevel (int) – Please refer to the
loggingandrich.loggingdocumentation.extra (Mapping[str, object] | None) – Please refer to the
loggingandrich.loggingdocumentation.
Progress Bars
Image Crawler Utils currently use Progress Display of rich to display progress bars, with several classes like CustomProgress designed for simple implementation.
CustomProgress Class
If you want to display only one progress bar, check out the documentation of CustomProgress:
- class image_crawler_utils.progress_bar.CustomProgress(*columns, text_only=False, has_spinner=False, spinner_name='dots', has_total=True, is_file=False, time_format='%H:%M:%S', is_compact_time_format=True, is_sub_process=False, console=None, auto_refresh=True, refresh_per_second=20, speed_estimate_period=30, transient=False, redirect_stdout=True, redirect_stderr=True, get_time=None, disable=False, expand=False)[source]
Bases:
ProgressA wrapped Progress class with specific format-controlling parameters.
- If you add ProgressColumns to CustomProgress like normal
rich.progress.Progressclass, it will be placed betweenBarColumn&TaskProgressColumn(i.e. the progress bar and percentage) andTimeColumnLeft. These ProgressColumns classes will be omitted if
text_onlyis set toTrue.
- Parameters:
text_only (bool) –
If set to
True, Progress bars will only display descriptions. Default isFalse.When set to
True, all other columns exceptrich.progress.TextColumn("[progress.description]{task.description}[reset]")will be omitted!
has_spinner (bool) – If set to
True, a spinner will be added to the left. Default isFalse.spinner_name (str) – The type of the spinner, which can be referred from https://jsfiddle.net/sindresorhus/2eLtsbey/embedded/result/. Default is :py:data:”dots”.
has_total (bool) – Set to
Trueif involved tasks have total numbers. Default isFalse.is_file (bool) – Set to
Trueif involved tasks deal with files. Default isFalse.time_format (str) –
A string that controls the time format. Default is “%H:%M:%S”.
’%H’ will be replaced with hours.
’%M’ will be replaced minutes.
’%S’ will be replaced seconds.
’%L’ will be replaced miliseconds.
is_compact_time_format (bool) –
When set to
True(default), thetime_formatwill be truncated to start from ‘%M’ when time is lower than 1 hour.For example, “%H:%M:%S.%L” will be truncated to “%M:%S.%L” when time is lower than 1 hour.
is_sub_process (bool) – Set to
Trueif it is a subprocess of a certainProgressGroup. Default isFalse.console (Console, None) – Optional Console instance. Defaults to an internal Console instance writing to stdout.
auto_refresh (bool, None) – Enable auto refresh. If disabled, you will need to call
refresh().refresh_per_second (Optional[float], None) – Number of times per second to refresh the progress information or None to use default (10). Defaults to None.
speed_estimate_period – (float, None): Period (in seconds) used to calculate the speed estimate. Defaults to 30.
transient – (bool, None): Clear the progress on exit. Defaults to False.
redirect_stdout – (bool, None): Enable redirection of stdout, so
printmay be used. Defaults to True.redirect_stderr – (bool, None): Enable redirection of stderr. Defaults to True.
get_time – (Callable, None): A callable that gets the current time, or None to use Console.get_time. Defaults to None.
disable (bool, None) – Disable progress display. Defaults to False
expand (bool, None) – Expand tasks table to fit width. Defaults to False.
- classmethod get_default_columns()
- Get the default columns used for a new Progress instance:
a text column for the description (TextColumn)
the bar itself (BarColumn)
a text column showing completion percentage (TextColumn)
an estimated-time-remaining column (TimeRemainingColumn)
If the Progress instance is created without passing a columns argument, the default columns defined here will be used.
You can also create a Progress instance using custom columns before and/or after the defaults, as in this example:
- progress = Progress(
SpinnerColumn(), *Progress.get_default_columns(), “Elapsed:”, TimeElapsedColumn(),
)
This code shows the creation of a Progress display, containing a spinner to the left, the default columns, and a labeled elapsed time column.
- Return type:
Tuple[ProgressColumn, …]
- add_task(description, start=True, total=100.0, completed=0, visible=True, **fields)
Add a new ‘task’ to the Progress display.
- Parameters:
description (str) – A description of the task.
start (bool, optional) – Start the task immediately (to calculate elapsed time). If set to False, you will need to call start manually. Defaults to True.
total (float, optional) – Number of total steps in the progress if known. Set to None to render a pulsing animation. Defaults to 100.
completed (int, optional) – Number of steps completed so far. Defaults to 0.
visible (bool, optional) – Enable display of the task. Defaults to True.
**fields (str) – Additional data fields required for rendering.
- Returns:
An ID you can use when calling update.
- Return type:
TaskID
- advance(task_id, advance=1)
Advance task by a number of steps.
- Parameters:
task_id (TaskID) – ID of task.
advance (float) – Number of steps to advance. Default is 1.
- Return type:
None
- finish_task(task, hide=True)[source]
Finish a task within the CustomProgress. Unless this CustomProgress is a preset attribute of :class`ProgressGroup` or its
is_sub_processis set toTrue, running this function will stop the whole Progress; otherwise it will just stop the task.- Parameters:
task (rich.progress.Task) – The Task class that is created under this CustomProgress.
hide (bool) – Set to
True(default) to hide the progress bar of this task.
- get_renderable()
Get a renderable for the progress display.
- Return type:
- get_renderables()
Get a number of renderables for the progress display.
- Return type:
- make_tasks_table(tasks)
Get a table to render the Progress display.
- Parameters:
tasks (Iterable[Task]) – An iterable of Task instances, one per row of the table.
- Returns:
A table instance.
- Return type:
Table
- open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, *, total=None, task_id=None, description='Reading...')
Track progress while reading from a binary file.
- Parameters:
path (Union[str, PathLike[str]]) – The path to the file to read.
mode (str) – The mode to use to open the file. Only supports “r”, “rb” or “rt”.
buffering (int) – The buffering strategy to use, see
io.open().encoding (str, optional) – The encoding to use when reading in text mode, see
io.open().errors (str, optional) – The error handling strategy for decoding errors, see
io.open().newline (str, optional) – The strategy for handling newlines in text mode, see
io.open().total (int, optional) – Total number of bytes to read. If none given, os.stat(path).st_size is used.
task_id (TaskID) – Task to track. Default is new task.
description (str, optional) – Description of task, if new task is created.
- Returns:
A readable file-like object in binary mode.
- Return type:
BinaryIO
- Raises:
ValueError – When an invalid mode is given.
- refresh()
Refresh (render) the progress information.
- Return type:
None
- remove_task(task_id)
Delete a task if it exists.
- Parameters:
task_id (TaskID) – A task ID.
- Return type:
None
- reset(task_id, *, start=True, total=None, completed=0, visible=None, description=None, **fields)
Reset a task so completed is 0 and the clock is reset.
- Parameters:
task_id (TaskID) – ID of task.
start (bool, optional) – Start the task after reset. Defaults to True.
total (float, optional) – New total steps in task, or None to use current total. Defaults to None.
completed (int, optional) – Number of steps completed. Defaults to 0.
visible (bool, optional) – Enable display of the task. Defaults to True.
description (str, optional) – Change task description if not None. Defaults to None.
**fields (str) – Additional data fields required for rendering.
- Return type:
None
- start()
Start the progress display.
- Return type:
None
- start_task(task_id)
Start a task.
Starts a task (used when calculating elapsed time). You may need to call this manually, if you called
add_taskwithstart=False.- Parameters:
task_id (TaskID) – ID of task.
- Return type:
None
- stop()
Stop the progress display.
- Return type:
None
- stop_task(task_id)
Stop a task.
This will freeze the elapsed time on the task.
- Parameters:
task_id (TaskID) – ID of task.
- Return type:
None
- track(sequence, total=None, completed=0, task_id=None, description='Working...', update_period=0.1)
Track progress by iterating over a sequence.
You can also track progress of an iterable, which might require that you additionally specify
total.- Parameters:
sequence (Iterable[ProgressType]) – Values you want to iterate over and track progress.
total (float | None) – (float, optional): Total number of steps. Default is len(sequence).
completed (int, optional) – Number of steps completed so far. Defaults to 0.
task_id (TaskID | None) – (TaskID): Task to track. Default is new task.
description (str) – (str, optional): Description of task, if new task is created.
update_period (float, optional) – Minimum time (in seconds) between calls to update(). Defaults to 0.1.
- Returns:
An iterable of values taken from the provided sequence.
- Return type:
Iterable[ProgressType]
- update(task_id, *, total=None, completed=None, advance=None, description=None, visible=None, refresh=False, **fields)
Update information associated with a task.
- Parameters:
task_id (TaskID) – Task id (returned by add_task).
total (float, optional) – Updates task.total if not None.
completed (float, optional) – Updates task.completed if not None.
advance (float, optional) – Add a value to task.completed if not None.
description (str, optional) – Change task description if not None.
visible (bool, optional) – Set visible flag if not None.
refresh (bool) – Force a refresh of progress information. Default is False.
**fields (Any) – Additional data fields required for rendering.
- Return type:
None
- wrap_file(file, total=None, *, task_id=None, description='Reading...')
Track progress file reading from a binary file.
- Parameters:
file (BinaryIO) – A file-like object opened in binary mode.
total (int, optional) – Total number of bytes to read. This must be provided unless a task with a total is also given.
task_id (TaskID) – Task to track. Default is new task.
description (str, optional) – Description of task, if new task is created.
- Returns:
A readable file-like object in binary mode.
- Return type:
BinaryIO
- Raises:
ValueError – When no total value can be extracted from the arguments or the task.
- property finished: bool
Check if all tasks have been completed.
- property task_ids: List[TaskID]
A list of task IDs.
- If you add ProgressColumns to CustomProgress like normal
Examples of CustomProgress
An example of using CustomProgress is like:
from image_crawler_utils.progress_bar import CustomProgress
# Create a Progress instance with spinner which vanishes after finishing
with CustomProgress(has_spinner=True, transient=True) as progress:
# Set the task for current progress bar
task_1 = progress.add_task(description="foo", total=100)
# Multiple tasks can be created for one Progress instance,
# which will be displayed as a different progress bar
task_2 = progress.add_task(description="bar", total=50)
try:
for i in range(50):
'''Do something'''
# Update the progress bar for 2 progress
progress.update(task_1, advance=2, description="foo_new")
progress.update(task_2) # Default: advance=1
except:
# If an error happens and you need to finish the progress immediately,
# use CustomProgress.finish_task()
progress.finish_task(task_1)
progress.finish_task(task_2)
# If you do not wish to use ``with`` structure,
# you can switch to the code below:
progress = CustomProgress()
progress.start()
'''Do something like above'''
progress.stop()
ProgressGroup Class
If you want to display multiple progress bars at once, it is recommended to use ProgressGroup.
Warning
DO NOT try to start another Progress instance (like CustomProgress) while a progress bar instance is already running! An error will be raised.
- class image_crawler_utils.progress_bar.ProgressGroup(progress_list=[], has_panel=True, panel_title=None, panel_subtitle=None, refresh_per_second=10)[source]
Bases:
objectA Group of Progress, which can simplify building multiple Progress bars.
- Parameters:
progress_list (list[rich.progress.Progress]) – An iterable list of
rich.progress.Progressclasses which will be added to the ProgressGroup when created. Default is[](an emptylist).has_panel (bool) – When set to
True(default), arich.panel.Panelwill be wrapped around all of the progress bars.panel_title (str) –
When set to a
str, the title will be displayed at the top middle of the panel.Works only if
has_panelis set toTrue.
panel_subtitle (str) –
When set to a
str, the title will be displayed at the bottom middle of the panel.Works only if
has_panelis set toTrue.
refresh_per_second (int) – Refreshing the progress bars for
refresh_per_secondtimes in a second. Default is 10.
- start()[source]
Start the ProgressGroup. That is, start the
ProgressGroup().liveattribute.- Return type:
None
- stop()[source]
Stop the ProgressGroup. That is, stop the
ProgressGroup().liveattribute.- Return type:
None
The ProgressGroup provided several preset Progress instances, which will be displayed from top to bottom in the order below:
Attribute |
transient |
has_total |
is_file |
text_only |
|---|---|---|---|---|
|
False |
True |
True |
False |
|
False |
True |
False |
False |
|
False |
False |
True |
False |
|
False |
False |
False |
False |
|
False |
/ |
/ |
True |
|
True |
True |
True |
False |
|
True |
True |
False |
False |
|
True |
False |
True |
False |
|
True |
False |
False |
False |
|
True |
/ |
/ |
True |
You can add you own Progress instances in the progress_list, which will be displayed below all of the preset Progress instances.
As one Progress instance can have multiple tasks (i.e. multiple progress bars), it is suggested to use the preset Progress instances first to create your progress bars.
Examples of ProgressGroup
An example of using ProgressGroup is like:
from image_crawler_utils.progress_bar import ProgressGroup, CustomProgress
# Use a custom progress bar which vanishes after finishing
custom_progress = CustomProgress(transient=True)
# Create a progress group
with ProgressGroup(
# Add the custom Progress instance into the progress group
progress_list=[custom_progress]
# Set the panel title
panel_title="Downloading"
) as progress_group:
# Use the preset .main_count_bar
main_progress = progress_group.main_count_bar
task_main = main_bar.add_task(total=10, description="Foo")
for i in range(10):
# The task_sub progress bar will be displayed
# below the task_main progress bar
task_sub = progress_group.progress_list[0].add_task(total=5, description="Bar")
for j in range(5):
'''Do something'''
# Update the task_sub progress bar
progress_group.progress_list[0].update(task_sub)
# Update the task_main progress bar
progress_group.main_count_bar.update(task_main)
# If you do not wish to use ``with`` structure,
# you can switch to the code below:
progress_group = ProgressGroup()
progress_group.start()
'''Do something like above'''
progress_group.stop()
User Agents
Currently, Image Crawler Utils use ua-generator for generating random user agents. Checkout its documentation for more details.
Other Tools
The image_crawler_utils.utils provided a wide variety of functions and classes for different uses.
- class image_crawler_utils.utils.Empty[source]
Bases:
objectAn empty placeholder class, mainly for checking if a parameter is used.
- image_crawler_utils.utils.attempt_name_len()[source]
Try to calculate the length of names.
- Returns:
The length of shortened file name. If terminal size is available, the result will be \(\left\lfloor\frac{\text{terminal_size} - 10}{5}\right\rfloor\). Otherwise, the result will be 10.
- Return type:
- image_crawler_utils.utils.check_dir(dir_path, log=<image_crawler_utils.log.Log object>)[source]
This function will check whether a directory exists, and try to create it when not existing. A logging message will be print to console when succeeded, and a critical error will be thrown when failed.
- Parameters:
dir_path (str) – Directory path.
log (image_crawler_utils.log.Log, None) – Logging config.
- Return type:
None
- image_crawler_utils.utils.load_dataclass(dataclass_to_load, file_name, file_type=None, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]
Load the file containing a dataclass into the
dataclass_to_loadparameter.The dataclass should be the same as the one you once saved.
- Parameters:
dataclass (A dataclass) – The dataclass to be loaded.
file_name (str) – Name of the file.
file_type (str, Optional) –
If suffix not found in
file_name, designate file type manually.Set
file_typeparameter tojsonorpklwill force the function to consider the file as this type, or leaving this parameter blank will cause the funtion to determine file type according tofile_name.That is,
load_dataclass(dataclass, 'foo.json')works the same asload_dataclass(dataclass, 'foo.json', 'json').
encoding (str) – Encoding of JSON file. Only works when loading from
.json.log (image_crawler_utils.log.Log, None) – Logging config.
- Returns:
Loaded dataclass_to_load, or None if failed.
- Return type:
- image_crawler_utils.utils.save_dataclass(dataclass, file_name, file_type=None, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]
Save the
dataclassparameter into a file.- Parameters:
dataclass (A dataclass) – The dataclass to be saved.
file_name (str) – Name of file. Suffix (.json / .pkl) is optional.
file_type (str, Optional) –
If suffix not found in
file_name, designate file type manually.Set
file_typeparameter tojsonorpklwill force the function to save the dataclass (config) into this type, or leaving this parameter blank will cause the funtion to determine the file type according tofile_name.That is,
save_dataclass(dataclass, 'foo.json')works the same assave_dataclass(dataclass, 'foo', 'json')..jsonis suggested when your dataclasses (configs) do not include objects that cannot be JSON-serialized (e.g. a function), while serialized data file.pklcan support most data types but the saved file is not readable.
encoding (str) – Encoding of JSON file. Only works when saving as
.json.log (image_crawler_utils.log.Log, None) –
Logging config.
You can use
log=crawler_settings.logto make it the same as the CrawlerSettings you set up.
- Returns:
(Saved file name, Absolute path of the saved file), or None if failed.
- Return type:
- async image_crawler_utils.utils.set_up_nodriver_browser(proxies=None, headless=True, window_width=None, window_height=None, no_image_stylesheet=False)[source]
Set up a nodriver in a more convenient way.
WARNING: nodriver use async functions. This function is async as well!
- Parameters:
proxies (dict, Callable, None) –
The proxies used in nodriver browser.
The pattern should be in a
requests-acceptable form like:HTTP type:
{'http': '127.0.0.1:7890'}HTTPS type:
{'https': '127.0.0.1:7890'}, or{'https': '127.0.0.1:7890', 'http': '127.0.0.1:7890'}SOCKS type:
{'https': 'socks5://127.0.0.1:7890'}
headless (bool) – Do not display browsers window when a browser is started. Set to
Falsewill pop up browser windows.window_width (int, None) –
Width of browser window. Set to
Nonewill maximize window.Set
headlesstoTruewill omit this parameter.
window_height (int, None) –
Height of window when displayed. Set to
Nonewill maximize window.Set
headlesstoTruewill omit this parameter.
no_image_stylesheet (bool) –
Do not load any images or stylesheet when loading webpages in this browser.
Set this parameter to
Truecan reduce the traffic when loading pages and accelerate loading speed.
- Returns:
The nodriver.
- Return type:
- image_crawler_utils.utils.shorten_file_name(file_name, name_len=10)[source]
Shorten file name for displaying on console.
- image_crawler_utils.utils.silent_deconstruct_browser(log=<image_crawler_utils.log.Log object>)[source]
I have had enough with nodriver’s annoying removing temp file messages. This function will do the same thing without those spamming messages.
- Parameters:
log (image_crawler_utils.log.Log) – Displaying those spamming messages.
- image_crawler_utils.utils.suppress_print()[source]
Suppress built-in print so that it may not output anything.
An example is like:
with suppress_print(): # suppressed print()