Useful Classes and Functions

Most tools for building your custom image crawlers are provided in the submodules of image_crawler_utils. Check out the API Reference for detailed information.

Logging

Logging in Image Crawler Utils is packed in Log class, which uses rich-style output for logging onto console and logging library for logging into files.

Logging parameters is mostly set when creating CrawlerSettings. You can refer to the CrawlerSettings Class for more details. To log information in most Parsers and Downloaders, you can write it like

self.crawler_settings.log.info("LOGGING INFO")

To determine which logging level should be used, check out the documentation of Log class:

class image_crawler_utils.log.Log(log_file=None, debug_config=DebugConfig(show_debug=False, show_info=True, show_warning=True, show_error=True, show_critical=True), logging_level=10, detailed_console_log=False)[source]

Bases: object

Class provided for logging messages onto the console and into the file.

Parameters:
  • log_file (str) – Output name for the logging file. NO SUFFIX APPENDED. Set to None (Default) is not to output any file.

  • debug_config (image_crawler_utils.configs.DebugConfig) – Set the OUTPUT MESSAGE TO CONSOLE level. Default is not to output any message.

  • logging_level (str, int) – Set the logging level of the LOGGING FILE. - Select from: logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR and logging.CRITICAL

  • detailed_console_log (bool) – When logging info to the console, always log msg (the messages logged into files) even if output_msg exists.

critical(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]

Output critical messages of errors that interrupt the crawler. Usually a Python error will be raised when critical errors happen.

Parameters:
debug(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]

Output debug messages of many detailed information about running the crawler, especially connections with websites.

Parameters:
error(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]

Output error messages of errors that may affect the final results but do not interrupt the crawler.

Parameters:
info(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]

Output info messages of basic information indicating the progress of the crawler.

Parameters:
logging_file_handler()[source]

Return the file handler if logging into file, or None if not.

logging_file_path()[source]

Output the absolute path of logging file if exists, or None if not.

warning(msg, output_msg=None, exc_info=None, stack_info=False, stacklevel=1, extra=None)[source]

Output warning messages of errors that basically do not affect the final results, mostly connection failures with the websites.

Parameters:

Progress Bars

Image Crawler Utils currently use Progress Display of rich to display progress bars, with several classes like CustomProgress designed for simple implementation.

CustomProgress Class

If you want to display only one progress bar, check out the documentation of CustomProgress:

class image_crawler_utils.progress_bar.CustomProgress(*columns, text_only=False, has_spinner=False, spinner_name='dots', has_total=True, is_file=False, time_format='%H:%M:%S', is_compact_time_format=True, is_sub_process=False, console=None, auto_refresh=True, refresh_per_second=20, speed_estimate_period=30, transient=False, redirect_stdout=True, redirect_stderr=True, get_time=None, disable=False, expand=False)[source]

Bases: Progress

A wrapped Progress class with specific format-controlling parameters.

If you add ProgressColumns to CustomProgress like normal rich.progress.Progress class, it will be placed between BarColumn & TaskProgressColumn (i.e. the progress bar and percentage) and TimeColumnLeft.
  • These ProgressColumns classes will be omitted if text_only is set to True.

Parameters:
  • text_only (bool) –

    If set to True, Progress bars will only display descriptions. Default is False.

    • When set to True, all other columns except rich.progress.TextColumn("[progress.description]{task.description}[reset]") will be omitted!

  • has_spinner (bool) – If set to True, a spinner will be added to the left. Default is False.

  • spinner_name (str) – The type of the spinner, which can be referred from https://jsfiddle.net/sindresorhus/2eLtsbey/embedded/result/. Default is :py:data:”dots”.

  • has_total (bool) – Set to True if involved tasks have total numbers. Default is False.

  • is_file (bool) – Set to True if involved tasks deal with files. Default is False.

  • time_format (str) –

    A string that controls the time format. Default is “%H:%M:%S”.

    • ’%H’ will be replaced with hours.

    • ’%M’ will be replaced minutes.

    • ’%S’ will be replaced seconds.

    • ’%L’ will be replaced miliseconds.

  • is_compact_time_format (bool) –

    When set to True (default), the time_format will be truncated to start from ‘%M’ when time is lower than 1 hour.

    • For example, “%H:%M:%S.%L” will be truncated to “%M:%S.%L” when time is lower than 1 hour.

  • is_sub_process (bool) – Set to True if it is a subprocess of a certain ProgressGroup. Default is False.

  • console (Console, None) – Optional Console instance. Defaults to an internal Console instance writing to stdout.

  • auto_refresh (bool, None) – Enable auto refresh. If disabled, you will need to call refresh().

  • refresh_per_second (Optional[float], None) – Number of times per second to refresh the progress information or None to use default (10). Defaults to None.

  • speed_estimate_period – (float, None): Period (in seconds) used to calculate the speed estimate. Defaults to 30.

  • transient – (bool, None): Clear the progress on exit. Defaults to False.

  • redirect_stdout – (bool, None): Enable redirection of stdout, so print may be used. Defaults to True.

  • redirect_stderr – (bool, None): Enable redirection of stderr. Defaults to True.

  • get_time – (Callable, None): A callable that gets the current time, or None to use Console.get_time. Defaults to None.

  • disable (bool, None) – Disable progress display. Defaults to False

  • expand (bool, None) – Expand tasks table to fit width. Defaults to False.

classmethod get_default_columns()
Get the default columns used for a new Progress instance:
  • a text column for the description (TextColumn)

  • the bar itself (BarColumn)

  • a text column showing completion percentage (TextColumn)

  • an estimated-time-remaining column (TimeRemainingColumn)

If the Progress instance is created without passing a columns argument, the default columns defined here will be used.

You can also create a Progress instance using custom columns before and/or after the defaults, as in this example:

progress = Progress(

SpinnerColumn(), *Progress.get_default_columns(), “Elapsed:”, TimeElapsedColumn(),

)

This code shows the creation of a Progress display, containing a spinner to the left, the default columns, and a labeled elapsed time column.

Return type:

Tuple[ProgressColumn, …]

add_task(description, start=True, total=100.0, completed=0, visible=True, **fields)

Add a new ‘task’ to the Progress display.

Parameters:
  • description (str) – A description of the task.

  • start (bool, optional) – Start the task immediately (to calculate elapsed time). If set to False, you will need to call start manually. Defaults to True.

  • total (float, optional) – Number of total steps in the progress if known. Set to None to render a pulsing animation. Defaults to 100.

  • completed (int, optional) – Number of steps completed so far. Defaults to 0.

  • visible (bool, optional) – Enable display of the task. Defaults to True.

  • **fields (str) – Additional data fields required for rendering.

Returns:

An ID you can use when calling update.

Return type:

TaskID

advance(task_id, advance=1)

Advance task by a number of steps.

Parameters:
  • task_id (TaskID) – ID of task.

  • advance (float) – Number of steps to advance. Default is 1.

Return type:

None

finish_task(task, hide=True)[source]

Finish a task within the CustomProgress. Unless this CustomProgress is a preset attribute of :class`ProgressGroup` or its is_sub_process is set to True, running this function will stop the whole Progress; otherwise it will just stop the task.

Parameters:
  • task (rich.progress.Task) – The Task class that is created under this CustomProgress.

  • hide (bool) – Set to True (default) to hide the progress bar of this task.

get_renderable()

Get a renderable for the progress display.

Return type:

ConsoleRenderable | RichCast | str

get_renderables()

Get a number of renderables for the progress display.

Return type:

Iterable[ConsoleRenderable | RichCast | str]

make_tasks_table(tasks)

Get a table to render the Progress display.

Parameters:

tasks (Iterable[Task]) – An iterable of Task instances, one per row of the table.

Returns:

A table instance.

Return type:

Table

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, *, total=None, task_id=None, description='Reading...')

Track progress while reading from a binary file.

Parameters:
  • path (Union[str, PathLike[str]]) – The path to the file to read.

  • mode (str) – The mode to use to open the file. Only supports “r”, “rb” or “rt”.

  • buffering (int) – The buffering strategy to use, see io.open().

  • encoding (str, optional) – The encoding to use when reading in text mode, see io.open().

  • errors (str, optional) – The error handling strategy for decoding errors, see io.open().

  • newline (str, optional) – The strategy for handling newlines in text mode, see io.open().

  • total (int, optional) – Total number of bytes to read. If none given, os.stat(path).st_size is used.

  • task_id (TaskID) – Task to track. Default is new task.

  • description (str, optional) – Description of task, if new task is created.

  • file (str | PathLike[str] | bytes)

Returns:

A readable file-like object in binary mode.

Return type:

BinaryIO

Raises:

ValueError – When an invalid mode is given.

refresh()

Refresh (render) the progress information.

Return type:

None

remove_task(task_id)

Delete a task if it exists.

Parameters:

task_id (TaskID) – A task ID.

Return type:

None

reset(task_id, *, start=True, total=None, completed=0, visible=None, description=None, **fields)

Reset a task so completed is 0 and the clock is reset.

Parameters:
  • task_id (TaskID) – ID of task.

  • start (bool, optional) – Start the task after reset. Defaults to True.

  • total (float, optional) – New total steps in task, or None to use current total. Defaults to None.

  • completed (int, optional) – Number of steps completed. Defaults to 0.

  • visible (bool, optional) – Enable display of the task. Defaults to True.

  • description (str, optional) – Change task description if not None. Defaults to None.

  • **fields (str) – Additional data fields required for rendering.

Return type:

None

start()

Start the progress display.

Return type:

None

start_task(task_id)

Start a task.

Starts a task (used when calculating elapsed time). You may need to call this manually, if you called add_task with start=False.

Parameters:

task_id (TaskID) – ID of task.

Return type:

None

stop()

Stop the progress display.

Return type:

None

stop_task(task_id)

Stop a task.

This will freeze the elapsed time on the task.

Parameters:

task_id (TaskID) – ID of task.

Return type:

None

track(sequence, total=None, completed=0, task_id=None, description='Working...', update_period=0.1)

Track progress by iterating over a sequence.

You can also track progress of an iterable, which might require that you additionally specify total.

Parameters:
  • sequence (Iterable[ProgressType]) – Values you want to iterate over and track progress.

  • total (float | None) – (float, optional): Total number of steps. Default is len(sequence).

  • completed (int, optional) – Number of steps completed so far. Defaults to 0.

  • task_id (TaskID | None) – (TaskID): Task to track. Default is new task.

  • description (str) – (str, optional): Description of task, if new task is created.

  • update_period (float, optional) – Minimum time (in seconds) between calls to update(). Defaults to 0.1.

Returns:

An iterable of values taken from the provided sequence.

Return type:

Iterable[ProgressType]

update(task_id, *, total=None, completed=None, advance=None, description=None, visible=None, refresh=False, **fields)

Update information associated with a task.

Parameters:
  • task_id (TaskID) – Task id (returned by add_task).

  • total (float, optional) – Updates task.total if not None.

  • completed (float, optional) – Updates task.completed if not None.

  • advance (float, optional) – Add a value to task.completed if not None.

  • description (str, optional) – Change task description if not None.

  • visible (bool, optional) – Set visible flag if not None.

  • refresh (bool) – Force a refresh of progress information. Default is False.

  • **fields (Any) – Additional data fields required for rendering.

Return type:

None

wrap_file(file, total=None, *, task_id=None, description='Reading...')

Track progress file reading from a binary file.

Parameters:
  • file (BinaryIO) – A file-like object opened in binary mode.

  • total (int, optional) – Total number of bytes to read. This must be provided unless a task with a total is also given.

  • task_id (TaskID) – Task to track. Default is new task.

  • description (str, optional) – Description of task, if new task is created.

Returns:

A readable file-like object in binary mode.

Return type:

BinaryIO

Raises:

ValueError – When no total value can be extracted from the arguments or the task.

property finished: bool

Check if all tasks have been completed.

property task_ids: List[TaskID]

A list of task IDs.

property tasks: List[Task]

Get a list of Task instances.

Examples of CustomProgress

An example of using CustomProgress is like:

from image_crawler_utils.progress_bar import CustomProgress

# Create a Progress instance with spinner which vanishes after finishing
with CustomProgress(has_spinner=True, transient=True) as progress:
    # Set the task for current progress bar
    task_1 = progress.add_task(description="foo", total=100)
    # Multiple tasks can be created for one Progress instance,
    # which will be displayed as a different progress bar
    task_2 = progress.add_task(description="bar", total=50)

    try:
        for i in range(50):
            '''Do something'''
            # Update the progress bar for 2 progress
            progress.update(task_1, advance=2, description="foo_new")
            progress.update(task_2)  # Default: advance=1

    except:
        # If an error happens and you need to finish the progress immediately,
        # use CustomProgress.finish_task()
        progress.finish_task(task_1)
        progress.finish_task(task_2)

# If you do not wish to use ``with`` structure,
# you can switch to the code below:
progress = CustomProgress()
progress.start()
'''Do something like above'''
progress.stop()

ProgressGroup Class

If you want to display multiple progress bars at once, it is recommended to use ProgressGroup.

Warning

DO NOT try to start another Progress instance (like CustomProgress) while a progress bar instance is already running! An error will be raised.

class image_crawler_utils.progress_bar.ProgressGroup(progress_list=[], has_panel=True, panel_title=None, panel_subtitle=None, refresh_per_second=10)[source]

Bases: object

A Group of Progress, which can simplify building multiple Progress bars.

Parameters:
  • progress_list (list[rich.progress.Progress]) – An iterable list of rich.progress.Progress classes which will be added to the ProgressGroup when created. Default is [] (an empty list).

  • has_panel (bool) – When set to True (default), a rich.panel.Panel will be wrapped around all of the progress bars.

  • panel_title (str) –

    When set to a str, the title will be displayed at the top middle of the panel.

    • Works only if has_panel is set to True.

  • panel_subtitle (str) –

    When set to a str, the title will be displayed at the bottom middle of the panel.

    • Works only if has_panel is set to True.

  • refresh_per_second (int) – Refreshing the progress bars for refresh_per_second times in a second. Default is 10.

start()[source]

Start the ProgressGroup. That is, start the ProgressGroup().live attribute.

Return type:

None

stop()[source]

Stop the ProgressGroup. That is, stop the ProgressGroup().live attribute.

Return type:

None

The ProgressGroup provided several preset Progress instances, which will be displayed from top to bottom in the order below:

Attribute

transient

has_total

is_file

text_only

.main_file_bar

False

True

True

False

.main_count_bar

False

True

False

False

.main_no_total_file_bar

False

False

True

False

.main_no_total_count_bar

False

False

False

False

.main_text_only_bar

False

/

/

True

.sub_file_bar

True

True

True

False

.sub_count_bar

True

True

False

False

.sub_no_total_file_bar

True

False

True

False

.sub_no_total_count_bar

True

False

False

False

.sub_text_only_bar

True

/

/

True

You can add you own Progress instances in the progress_list, which will be displayed below all of the preset Progress instances.

As one Progress instance can have multiple tasks (i.e. multiple progress bars), it is suggested to use the preset Progress instances first to create your progress bars.

Examples of ProgressGroup

An example of using ProgressGroup is like:

from image_crawler_utils.progress_bar import ProgressGroup, CustomProgress

# Use a custom progress bar which vanishes after finishing
custom_progress = CustomProgress(transient=True)

# Create a progress group
with ProgressGroup(
    # Add the custom Progress instance into the progress group
    progress_list=[custom_progress]
    # Set the panel title
    panel_title="Downloading"
) as progress_group:
    # Use the preset .main_count_bar
    main_progress = progress_group.main_count_bar
    task_main = main_bar.add_task(total=10, description="Foo")

    for i in range(10):
        # The task_sub progress bar will be displayed
        # below the task_main progress bar
        task_sub = progress_group.progress_list[0].add_task(total=5, description="Bar")

        for j in range(5):
            '''Do something'''
            # Update the task_sub progress bar
            progress_group.progress_list[0].update(task_sub)

        # Update the task_main progress bar
        progress_group.main_count_bar.update(task_main)

# If you do not wish to use ``with`` structure,
# you can switch to the code below:
progress_group = ProgressGroup()
progress_group.start()
'''Do something like above'''
progress_group.stop()

User Agents

Currently, Image Crawler Utils use ua-generator for generating random user agents. Checkout its documentation for more details.

Other Tools

The image_crawler_utils.utils provided a wide variety of functions and classes for different uses.

class image_crawler_utils.utils.Empty[source]

Bases: object

An empty placeholder class, mainly for checking if a parameter is used.

image_crawler_utils.utils.attempt_name_len()[source]

Try to calculate the length of names.

Returns:

The length of shortened file name. If terminal size is available, the result will be \(\left\lfloor\frac{\text{terminal_size} - 10}{5}\right\rfloor\). Otherwise, the result will be 10.

Return type:

int

image_crawler_utils.utils.check_dir(dir_path, log=<image_crawler_utils.log.Log object>)[source]

This function will check whether a directory exists, and try to create it when not existing. A logging message will be print to console when succeeded, and a critical error will be thrown when failed.

Parameters:
Return type:

None

image_crawler_utils.utils.load_dataclass(dataclass_to_load, file_name, file_type=None, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]

Load the file containing a dataclass into the dataclass_to_load parameter.

The dataclass should be the same as the one you once saved.

Parameters:
  • dataclass (A dataclass) – The dataclass to be loaded.

  • file_name (str) – Name of the file.

  • file_type (str, Optional) –

    If suffix not found in file_name, designate file type manually.

    • Set file_type parameter to json or pkl will force the function to consider the file as this type, or leaving this parameter blank will cause the funtion to determine file type according to file_name.

      • That is, load_dataclass(dataclass, 'foo.json') works the same as load_dataclass(dataclass, 'foo.json', 'json').

  • encoding (str) – Encoding of JSON file. Only works when loading from .json.

  • log (image_crawler_utils.log.Log, None) – Logging config.

Returns:

Loaded dataclass_to_load, or None if failed.

Return type:

Any

image_crawler_utils.utils.save_dataclass(dataclass, file_name, file_type=None, encoding='UTF-8', log=<image_crawler_utils.log.Log object>)[source]

Save the dataclass parameter into a file.

Parameters:
  • dataclass (A dataclass) – The dataclass to be saved.

  • file_name (str) – Name of file. Suffix (.json / .pkl) is optional.

  • file_type (str, Optional) –

    If suffix not found in file_name, designate file type manually.

    • Set file_type parameter to json or pkl will force the function to save the dataclass (config) into this type, or leaving this parameter blank will cause the funtion to determine the file type according to file_name.

      • That is, save_dataclass(dataclass, 'foo.json') works the same as save_dataclass(dataclass, 'foo', 'json').

      • .json is suggested when your dataclasses (configs) do not include objects that cannot be JSON-serialized (e.g. a function), while serialized data file .pkl can support most data types but the saved file is not readable.

  • encoding (str) – Encoding of JSON file. Only works when saving as .json.

  • log (image_crawler_utils.log.Log, None) –

    Logging config.

    • You can use log=crawler_settings.log to make it the same as the CrawlerSettings you set up.

Returns:

(Saved file name, Absolute path of the saved file), or None if failed.

Return type:

tuple[str, str] | None

async image_crawler_utils.utils.set_up_nodriver_browser(proxies=None, headless=True, window_width=None, window_height=None, no_image_stylesheet=False)[source]

Set up a nodriver in a more convenient way.

WARNING: nodriver use async functions. This function is async as well!

Parameters:
  • proxies (dict, Callable, None) –

    The proxies used in nodriver browser.

    • The pattern should be in a requests-acceptable form like:

      • HTTP type: {'http': '127.0.0.1:7890'}

      • HTTPS type: {'https': '127.0.0.1:7890'}, or {'https': '127.0.0.1:7890', 'http': '127.0.0.1:7890'}

      • SOCKS type: {'https': 'socks5://127.0.0.1:7890'}

  • headless (bool) – Do not display browsers window when a browser is started. Set to False will pop up browser windows.

  • window_width (int, None) –

    Width of browser window. Set to None will maximize window.

    • Set headless to True will omit this parameter.

  • window_height (int, None) –

    Height of window when displayed. Set to None will maximize window.

    • Set headless to True will omit this parameter.

  • no_image_stylesheet (bool) –

    Do not load any images or stylesheet when loading webpages in this browser.

    • Set this parameter to True can reduce the traffic when loading pages and accelerate loading speed.

Returns:

The nodriver.

Return type:

Browser

image_crawler_utils.utils.shorten_file_name(file_name, name_len=10)[source]

Shorten file name for displaying on console.

Parameters:
  • file_name (str) – Name of the file.

  • name_len (int, None) – Maximum length allowed. Set to None (default) will use IMAGE_NAME_LEN above.

Returns:

Shortened file name.

Return type:

str

image_crawler_utils.utils.silent_deconstruct_browser(log=<image_crawler_utils.log.Log object>)[source]

I have had enough with nodriver’s annoying removing temp file messages. This function will do the same thing without those spamming messages.

Parameters:

log (image_crawler_utils.log.Log) – Displaying those spamming messages.

image_crawler_utils.utils.suppress_print()[source]

Suppress built-in print so that it may not output anything.

An example is like:

with suppress_print():
    # suppressed print()