Resize
The resizing feature relies on three subcomponents:
- Speaker diarization with Pyannote
- Scene change detection with PySceneDetect
- Face detection with MTCNN and MediaPipe
These libraries are leveraged to dynamically resize a video to focus on whoever is speaking at any given moment. For a detailed explanation of the algorithm, read here.
Usage
The following returns the information to be able to resize the video.
from clipsai import resize
crops = resize(
video_file_path="/abs/path/to/video.mp4",
pyannote_auth_token="pyannote_token",
aspect_ratio=(9, 16)
)
print("Crops: ", crops.segments)
To resize the video using the returned crops
, run the following code.
media_editor = clipsai.MediaEditor()
# use this if the file contains video stream only
media_file = clipsai.VideoFile("/abs/path/to/video_only_file.mp4")
# use this if the file contains both audio and video stream
media_file = clipsai.AudioVideoFile("/abs/path/to/video.mp4")
resized_video_file = media_editor.resize_video(
original_video_file=media_file,
resized_video_file_path="/abs/path/to/resized/video.mp4", # doesn't exist yet
width=crops.crop_width,
height=crops.crop_height,
segments=crops.to_dict()["segments"],
)
Resize Function
- Name
resize
- Type
- -> Crops
- Description
Dynamically resizes a video to a specified aspect ratio (default 9:16) to focus on the current speaker
Required Parameters
- Name
- video_file_pathstring
- Description
Absolute path to the video file to resize.
- Name
- pyannote_auth_tokenstring
- Description
Authentication token for Pyannote, obtained from HuggingFace.
Optional Parameters
- Name
- aspect_ratiotuple[int, int] = (9, 16)
- Description
The target aspect ratio for resizing the video (width, height). Default is (9, 16).
- Name
- min_segment_durationfloat = 1.5
- Description
The minimum duration in seconds for a diarized speaker segment to be considered. Default is 1.5.
- Name
- samples_per_segmentint = 13
- Description
The number of samples to take per speaker segment for face detection. Default is 13. Reduce this for faster performance (at the sake of worse accuracy).
- Name
- face_detect_widthint = 960
- Description
The width in pixels to which the video will be downscaled for face detection. Smaller widths detect faster, but may be less accurate. Default is 960.
- Name
- face_detect_marginint = 20
- Description
Margin around detected faces, used in the MTCNN face detector. Default is 20.
- Name
- face_detect_post_processbool = False
- Description
If set to True, post-processing is applied to the face detection output to make it appear more natural. Default is False.
- Name
- n_face_detect_batchesint = 8
- Description
Number of batches for processing face detection when using GPUs. This is vital for proper memory allocation. Default is 8.
- Name
- min_scene_durationfloat = 0.25
- Description
Minimum duration in seconds for a scene to be considered during scene detection. Default is 0.25.
- Name
- scene_merge_thresholdfloat = 0.25
- Description
Threshold in seconds for merging scene changes with speaker segments. Default is 0.25.
- Name
- time_precisionint = 6
- Description
Precision (number of decimal places) for start and end times of the segments. Default is 6. Less than 4 decimal places may result in rounding errors for the purposes of cropping the video with ffmpeg.
- Name
- devicestring: cuda | cpu = None
- Description
PyTorch device to perform computations on. Default is None, which auto detects the correct device.
Crops Class
Represents the resizing information for an entire video including the video's original width and height dimensions, the video's resized width and height dimensions, and the segments of the video for focusing on the current speaker. Segments are defined over an interval of time, providing the x-y coordinate of the top left corner of a rectangle with pixel dimensions crop_width
by crop_height
to focus on the current speaker.
Properties
- Name
crop_width
- Type
- int
- Description
The width of the resized video in number of pixels.
- Name
crop_height
- Type
- int
- Description
The height of the resized video in number of pixels.
- Name
original_width
- Type
- int
- Description
The width of the original video in number of pixels.
- Name
original_height
- Type
- int
- Description
The height of the original video in number of pixels.
- Name
segments
- Type
- List[Segment]
- Description
The list of Segments providing the crop coordinates and times.
Methods
- Name
copy
- Type
- -> Crops
- Description
Returns a copy of the Crops instance.
- Name
to_dict
- Type
- -> dict
- Description
Returns a dictionary representation of the Crops instance.
Segment Class
Segments are defined over an interval of time in the video, providing the x-y coordinate of the top left corner of a rectangle with pixel dimensions crop_width
by crop_height
to focus on the current speaker.
Properties
- Name
x
- Type
- int
- Description
The x coordinate of the top left corner of the crop from the original video.
- Name
y
- Type
- int
- Description
The y coordinate of the top left corner of the crop from the original video.
- Name
start_time
- Type
- float
- Description
The start time of the segment in seconds.
- Name
end_time
- Type
- float
- Description
The end time of the segment in seconds.
- Name
speakers
- Type
- List[int]
- Description
Returns a list of speaker identifiers in this segment. Each identifier uniquely represents a speaker in the entire video.
Methods
- Name
copy
- Type
- -> Segment
- Description
Returns a copy of the Segment instance.
- Name
to_dict
- Type
- -> dict
- Description
Returns a dictionary representation of the Segment properties.