00:00:00.71, the man first looks at the woman before turning his gaze toward the
camera. His right hand is already on her shoulder. He then slowly moves his hand slightly to the right and
gently clenches her shoulder at 00:00:06.40 , and at 00:00:06.44 , he blinks
his eyes. Both are now looking at the camera.
The lighting is natural daylight. The video starts with a close-up shot, and the camera slightly tilts
upward."
Comment: This annotation is clear and objective, includes detailed descriptions of characters and their
appearance, positioning and movements. Has precise timestamps for actions and lighting conditions and
camera movement are well described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/337407-clip_000000
00_1.mp4
"The video takes place in a bright green background with even lighting, capturing a woman dancing
throughout. The focus is on a young woman with long, wavy, light brown hair. She wears an oversized white
long-sleeve shirt, round earrings and interacts with her hair while posing. At various points in the video, she
looks up and smiles.
At 00:00:00.06 , she raises her left hand and touches her hair with her fingers. Then, at
00:00:01.23 , she raises her right hand, touches her shoulder, moves her right hand through
her hair, and returns her left hand to her side at 00:00:02.12 . She begins jumping four
times, moving both hands up and down alternately and smiling. At 00:00:02.95 , she looks
directly at the camera and flips her hair upward with both hands. At 00:00:03.77 , she looks
up and stretches her arms above shoulder level and pucker whistling, lowering them below head level at
00:00:04.65 . Her movements gently reflect her shadow on the background.
The camera remains static with a medium shot angle throughout the video. The lighting conditions are
artificial indoor lighting."
Comment: clear and detailed action breakdown with accurate timestamps and no subjectivity. Well
described lighting condition and camera movements
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/254480-clip_000000
00_0.mp4
The scene is set inside a building that appears to be a doctor's oice. In the background are large windows
covered by translucent blinds, through which sunlight enters and illuminates the space. Beyond the
windows, there are faint silhouettes of skyscrapers that are blurred out. In the middle of the windows is a
pillar covered in red brick tiles. There are two female subjects present in the frame, both of whom are
wearing flu masks and dark blue sanitary gloves, and have slightly tanned skin tones with dark hair.
The first subject on the left has shoulder-length straight black hair and is wearing a brown buon-down
shirt. She is clutching a silver tablet device with a black case and a dark blue piece of paper in her right
arm. Throughout the video, she talks to the subject on the right while looking at her and briefly looks away
from the second subject at 00:00:04.30 until 00:00:05.96 without changing
the direction of her head, referring to a white piece of paper by signaling at it with her hand.
At 00:00:02.18 to 00:00:03.02 , the subject on the left gently nods her head.
She also shrugs her left shoulder slightly while nodding her head at 00:00:04.30 . The
second subject is wearing a pastel pink buon-down shirt with a white lab coat. She has curly hair styled in
a single ponytail and is holding up two pieces of paper in one hand, looking at their contents throughout
the video while the first subject talks. Both subjects have neutral facial expressions, which can be inferred
from their eyes and eyebrows, as their faces are covered with masks.
The camera is moving towards the left side of the frame at a fixed height, capturing the subjects above the
waist. The scene is lit by the natural sunlight coming through the windows.
Comments: well structured and detailed annotation with good timestamps and camera description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/365681-clip_000000
00_0.mp4
The scene is set outdoors in a forest filled with tall green trees and a pale blue twilight sky. The subject is a
young white blonde woman with blue eyes who is wearing a white sleeveless dress. The scene begins with
the woman facing forward and walking ahead, with the camera capturing her from the back at an angle
biased toward her left side.
At 00:00:00.50 , the woman begins to turn back to look over her left shoulder. At
00:00:03.12 , she has turned her body almost fully around to look back and slightly above
her eye level, wearing an anxious expression that conveys fear. At 00:00:03.86 , she begins
to turn back, and by 00:00:06.12 , she has completed turning her head forward again, with
her hair flowing in the direction of her head movement and starts to pick up the pace. As she begins
walking faster, her loose hair flows and bounces, reflecting her hurried manner of walking.
The lighting is natural, and the camera follows the woman's movement, capturing her only from the waist
up. The background is blurred with a bokeh effect.
Comment: Strong visual detail, with clear timestamped and camera movement.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/131239-clip_0000000
0_0.mp4
The scene is set in the afternoon in a jungle, featuring a paved path slightly covered with rocks and
surrounded by tall trees and short grass, with sunlight filtering through the foliage. The video begins with
the subject out of frame. The subject is a thin white female with dark, short, and curly hair, running for
exercise while wearing black capri pants and a light maroon long-sleeve t-shirt, along with gray sneakers
and white socks.
At the 00:00:00.58 mark, the subject enters the frame from the left side, running in the
same direction as the camera. Upon entering the frame, the subject gradually overtakes the camera, and
the distance between her and the camera increases as the video progresses.
Comment: Clear and focused description of environment and movement, timestamps well used and
camera movement well described in relation with the subject
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/253997-clip_000000
00_0.mp4
The scene begins with a female subject who in her early twenties, is thin and has light skin and dark hair
styled in a ponytail that reaches the length of her back. She is wearing beige trousers and a black and
white Aztec diamond-patterned overshirt over a white T-shirt, along with silver drop earrings. She stands
beside a river that flows parallel to the direction of the camera. On the left side of the frame is the subject,
who is standing atop small grass-covered rock formations and tree roots, with foliage of varying heights in
the background and a tree about 8 feet behind her. The right side of the frame is dominated by the flow of
the river, which is pale blue in color and has a rock protruding from the surface. In the distance, on the
ground of the right side of the frame next to the river, there is some foliage.
As the video progresses, the foliage sways slightly due to the wind, along with the overshirt of the female
subject. The subject displays an expression of calmness and pleasure in the clip, momentarily closing her
eyes as well. At the 00:00:02.39 mark, she drops her hand, which was originally over her
head, and looks towards the camera. At 00:00:05.86 , she raises her hand to her collarbone
while tilting her head and closing her eyes. Afterwards, she lowers her hand and wraps it around her waist.
During the length of the video, her right hand remains stationary and rests beside her leg. At
00:00:05.91 , as the camera pans to the right, a second rock protruding from within the
river comes into view.
The camera is positioned at a fixed distance from the subject but occasionally pans around and changes
slightly. The scene is illuminated primarily by warm sunlight, as well as by light reflected o the water.
Comment: Highly detailed and well structured annotation, with clear timestamped actions, and accurate
camera behavior well described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/199735-clip_000000
00_0.mp4
The scene takes place indoors in an oice, featuring a white man in his forties with gray hair and a short
beard. He is wearing a blue, long-sleeve buon-down shirt and gray pants, and he is wearing glasses. He is
seated at his gray desk, operating a desktop computer with his left hand while taking notes with his right
hand. A metal table lamp, which is lit, is positioned to the right of the monitor. In the background, a glass
wall consists of white beams serving as panels, through which another oice, mostly obscured and
illuminated with neon blue light, can be seen. A white blonde woman is seated at her desk, facing to the
right of the frame.
Between 00:00:00.00 and 00:00:01.42 , he looks back and forth between
his notebook and the monitor before focusing on the computer screen. Over his right shoulder, on the glass
wall in the background, is a sketch of the front of a car, while over his left shoulder is a sketch of the top
view of the car from an angle. At 00:00:07.10 , he stops taking notes and presses a single
key on the keyboard, looking at the camera with a neutral expression.
The majority of the lighting is dim and artificial, with a blue hue, likely projecting from ceiling lights.
Additionally, the scene is illuminated by a table lamp which is projecting warm light on to the desk of the
main subject. Further, there are neon blue light fixtures in the background that are adding illumination. The
camera zooms in slowly throughout the scene.
Comment : detailed and objective description with well used timestamps, great camera and lighting
condition description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/376414-clip_000000
00_0.mp4
The scene is set at sunset in a forest with dense trees in the background, featuring a patch of trees on the
left side of the frame that has a few bright yellow leaves. Small clouds of smoke linger behind the subject,
who is a bald, slender, gray-haired monk in his 50s, wearing a deep red Buddhist robe. The monk is seated
still with his eyes closed, wearing a neutral expression while meditating.
He maintains this position for the duration of the clip, while the camera gradually moves closer to him as
the clip progresses.-
The camera captures only the monk from the waist up and zooms in at an upward angle. The scene is
illuminated by the warm natural sunlight from sunset through the trees and leaves in the background.
Comment: Strong environmental detail and lighting description, camera movement clearly described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/374764-clip_000000
00_0.mp4
The scene is set on a street surrounded by various buildings. The o-white building on the left side of the
frame features revivalist architecture and extends into the distance, while the one on the right has modern
architecture and is also o-white in color. The subject is Black, with curly hair and a high fade haircut. He is
wearing a thin grey turtleneck sweater and a black backpack with yellow and black steel zippers, along
with a leather strap chronograph wristwatch and a face mask. In the distance, there are modern
skyscrapers primarily made of glass.
The video begins with the subject looking down at the screen of his phone, which he is holding with both
hands while typing. He is standing next to the upper part of the stairway leading to the subway below.
There is a black sign overhead the staircase indicating the name of the station, i.e., 34 St-Penn Station, in
white and blue font. The middle rail of the stairway is slightly visible on the left side of the frame.
At the 00:00:03.05 mark, a man wearing a white and blue polo shirt and a white hat starts
to walk across the bridge, on the wall of which the aforementioned sign is aached. At the
00:00:05.97 mark, a steel grey hatchback drives across the road into the distance from the
right side of the frame.
The camera is handheld at a fixed position, capturing the subject from the waist up. In terms of lighting,
the environment is naturally sunlit with overcast clouds.
Comment: Great details for the character and well timestamped actions, camera and lighting condition
are described clearly.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/378498-clip_000000
00_0.mp4
The scene begins with a young Black man standing and clutching thin window bars in a dark, poorly lit
room, looking through them at the outdoor environment that appears to contain trees in the distance
beyond a large empty courtyard. He is standing in the right half of the frame. The subject is wearing a
raglan half-sleeve T-shirt with a white torso and black sleeves, along with black true wireless earphones.
He has short, curly black hair and a low fade haircut, and a neutral expression on his face as he peers
through the window bars.
At the 00:00:02.03 mark, the camera moves forward enough to crop out the background
of the interior while geing closer to the subject, who is only visible from above the shoulders, along with
his forearms and hands.
The camera is positioned to the left side of the subject. The scene is primarily illuminated by sunlight and
the white tube light in the interior.
Comment : Well balanced annotation, objective and great character description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/275926-clip_000000
00_1.mp4
The video shows a person on the left side of the screen, kneeling on a wooden plank surface as he plants
tubers into the dark soil. He is wearing blue jeans, a gray knied sweater, and black gloves. Beside his knee
is a white container holding the tubers. The soil, positioned on the right side of the screen, has been dug
into a trench, forming a small mound along the edge. One tuber is already planted at the far end.
At 00:00:01.10 , the person places a tuber into the soil. Then he moves his hand into the
container, picks another one, and plants it at 00:00:04.87 . He repeats this process ,
steadily working along the dugout space. The camera starts o still, then gradually tilts upward, capturing
the scene.
Comment : Great scene description with accurate layout and it is objective.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/291922-clip_000000
00_1.mp4
The video is set indoors in a studio, with a plain black background that has a round light source attached in
the middle to cast a warm light, making the subject appear as a silhouette. From what can be inferred, the
subject is wearing waist-high flare pants and a tucked-in long-sleeve shirt. The subject has short hair, and
in this scene, they are performing hand combat moves using a small knife.
At 00:00:00.00 , the subject is facing their body towards the camera while turned to their
left, wielding the knife in their right hand. They raise the knife upwards with an arm movement while
throwing a punch with their other hand until the 00:00:00.26 mark. Afterwards, they take
a neutral stance and proceed to swing the blade over and then under their hand to their right side while
executing an additional move in that direction with the blade at eye level at 00:00:02.54 .
At 00:00:03.89 , the subject adopts another neutral stance, keeping the blade and their
free hand close to their body while looking to the right. By 00:00:05.09 , they proceed to
throw another combination involving a low blow and another strike at eye level while facing their right
side. They repeat the combination one more time, this time having their left arm resting against the side of
their waist by 00:00:06.85 . The video concludes with the subject initiating the
aforementioned combination again in the same direction, which they start for the last time at
00:00:07.50 , this time having their left arm raised to their head in a defensive position.
The camera remains static, with the subject placed in the middle of the frame the entire time. The scene is
lit using a single warm elliptical light source in the background.
Comment : Well detailed and structured annotation with good timestamped actions and an objective
tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/285583-clip_000000
00_1.mp4
A woman in her twenties with a fair skin tone and golden brown hair tied behind her head swims toward
the foreground right of the frame wearing a black scuba diving and yellow gloves. The oxygen
cylinder behind her back has a pale yellow shade. She holds a small metallic silver rectangular object
between her left index finger and thumb. The flaps on her feet are also black. She is visible in the middle of
the screen at 00:00:00.00 in a horizontal swimming position with her body extending from
the background toward the foreground right and her face turned toward the camera. Water bubbles are
visible rising above her head from both side of her face.
Another person wearing the same is visible in the background. They are also swimming toward the
camera.
The surrounding is filled with bluish water with a rocky boom covering most of the screen in the
foreground in the boom half and the middle ground in the top half. They gradually become blurred
toward the background. The rocks in the foreground are visibly covered with thin patches of green algae.
Small gray fishes swim around the woman and are visible near the top edge.
As the video starts, the woman and the person in the background continue to swim and two fishes from the
top swim and come near the foreground on the right side. They have small yellow tails, a white body with a
black strip on top. The woman then starts turning her head at 00:00:00.98 to look at the
fishes. her eyes then follow the movement of a fish that starts swimming toward the boom from the front
of the woman. She also drops the object from her hand at 00:00:01.25 which is revealed to
be aached with a black stick extending from her. She then extends her right hand to touch the fish
and moves her hand near the fish such that the fish touches the back of her hand at
00:00:02.91 . The woman's gaze follows her.
The camera pans a bit toward the left at 00:00:03.83 and then again starts panning to the
right while moving a bit upward. The woman stops swimming and stays at one place from
00:00:04.33 as she looks at the camera. She is very close to the camera and is visible
mostly on the left half of the screen at the end at 00:00:05.11 .
Comment: well detailed and accurate annotation with great use of timestamps on actions, clear camera
description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/208037-clip_000000
00_0.mp4
The video shows an older woman in an outdoor setting surrounded by various flowers and green plants.
The flowers and vases, in shades of brown, black, and gray, are lined up on the right side of the screen
against a dark-framed wall. Some flowers are nestled among green plant stems, with white, pink, and a
hint of purple flowers visible. Behind the woman, there’s a brown brick wall and a gray door frame on the
left, leading to another area with billowing plants.
The woman has short white hair and is dressed in an o-white shirt with buons down the front, paired
with blue jeans. She wears silver dangling earrings in her right ear. Initially bending over, she straightens
up, at pulls away from a black vase, and at 00:00:01.60 places her left hand on a small
brown vase. At 00:00:02.77 she touches a white flower with her left hand and another with
her right, smiling as she admires them. Finally, at 00:00:08.40 she stretches her right hand
to caress one of the green leaves.
Comment: well detailed annotation with accurate timestamped actions and an objective tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/239838-clip_000000
00_0.mp4
A woman walks on a flat sandy terrain outdoors. She wears a wide-brimmed hat, sunglasses, a face mask
pulled below her chin, a rust-colored jacket over a pink t-shirt, and a backpack which she is wearing on her
shoulders. She also has long dark hair.
The background features a desert-like surface with vehicle tire marks behind her. On the left side of the
frame, a parked silver SUV with a black roof cargo bag is visible. The sun is low on the horizon behind her,
creating natural lighting and casting shadows on the ground.
At 00:00:00.00 she starts walking forward while holding her backpack straps.
At 00:00:01.79 she turns her head slightly to the right while continuing to walk she continues
to turn her head until she looks almost behind her and sun now reflecting to her face at
00:00:07.33
The camera remains static, capturing the subject in a wide shot for the duration of the video. The scene is
illuminated by the sunlight emitted from the setting sun.
Comment: well detailed annotation with accurate timestamped actions and an objective tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/155361-clip_0000000
0_0.mp4
The scene is set inside a gymnasium with large window panels on the upper walls, allowing ample sunlight
to stream in, creating a cooler tone. Below the windows are two wall-mounted air conditioners positioned
far apart, and an LED scoreboard displays a score of 12:11. The floor of the gymnasium is painted blue,
with a green area in the middle. The two subjects are fencers in standard white fencing aire, engaged in a
duel, each having a cable attached to the back of their aire.
The video begins with the fencers standing face to face, legs wide apart, holding up their foils and looking
for an opportunity to strike. The fencer on the right, whose body is facing the camera, is advancing toward
the fencer on the left, who is facing away from the camera and gently stepping back. At the
00:00:01.67 mark, the fencer on the left lifts his foil, and at 00:00:02.14 ,
both fencers lower their foils to feint at each other. The fencer on the right has advanced further to the
left of the frame, while the fencer on the left continues to retreat. At 00:00:04.42 , the
fencer on the left raises his foil, while the fencer on the right lowers his once again. At
00:00:04.57 , the fencer on the right retreats with his lowered foil and strikes forward,
moving his body ahead and taking a wide step to make contact with his rival. The fencer on the left lowers
his foil in an attempt to carry the pack, but ultimately fails.
The camera dollies to the left of the frame at a constant speed, tracking the movement of the subjects.
The environment is naturally lit due to the sunlight coming through the windows.
Comment: Well detailed movement breakdown, timestamps well placed and camera clearly described.
Bad task examples:
1. hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/177736-clip_0
0000000_0.mp4
"00:00:00:01 The scene shows a fair skinned woman with blonde long hair in a brightly lit
room looking down at her phone during the day.
She is wearing a black sleeveless dress and holding a phone with a brown pouch on her right hand.
00:00:00:04She is seen at the middle of the screen bending her head downward leaning a
bit more into the phone.
00:00:01:74She moves a little bit backing out of the phone
00:00:02:96She tilts her head to the left side of the screen and is seen smiling at the phone
00:00:05:92She tilts her head back to the right side of the screen and holds the phone with
both her hands.
Sunlight can be seen at the back of the scene."
Comment: The first timestamp is incorrect because it doesn't correspond to an actual action, it should
only be used to mark specific moments when something happens.
In addition, the caption misses several key visual details. It does not mention the blurry background or the
object visible on the wall. There's also no description of the camera work, such as the angle, movement, or
framing, which are important for understanding how the scene is visually presented.
Important visual elements are also omied, including the light reflecting on her face, her eye color, and her
eyelashes. Furthermore, the caption fails to note the moment when she closes her mouth, which is a clear
and noticeable action that should have been included.
2. http://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/347440-clip_00
000000_0.mp4
00:00:04.54 Close-up of a street-side coee vendor, visible only from the midsection. He’s
wearing a blue shirt with black buons. With his right hand, he pours freshly brewed coee into a small
white wine cup wrapped in crystals that he's holding with his left hand. In the boom right corner of the
frame, an ornate coeepot with a crystal lid and a small ornate plate are visible.
Comment: The caption is too short and omits several important visual details. For instance,
there's no mention of the man's blue jeans, his dark blue, short-sleeve shirt with doed paerns,
or his hairy hands—all of which are clearly visible and contribute to the scene’s realism.
Additionally, key background elements are missing: the blurry area to his left reveals a circular
floor design, and there's a stainless steel coee stand that goes unmentioned. The small coee
cup is described inadequately—it features a white plastic upper part, a shiny, crystalline lower
part, a scooping spoon, and a distinct gold band separating the two sections, but none of this is
captured in the caption.
To his right, there’s a truncated stainless steel bowl resting on a brown wooden stool, and in front
of it, a black, wrapped cloth becomes visible as the camera pans. These are also excluded from
the description.
The caption also lacks an action timestamp to indicate when the man is scooping coee into the
cup, which is a key moment. Finally, it fails to mention the natural lighting conditions and a
daytime scene. The timestamp in the caption does not point to an action.
3. http://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/309939-clip_00
000000_1.mp4
00:00:00.00 The right profile of a brown-skinned boy is seen on the left side of the frame,
being instructed on how to play the violin. He is wearing a short-sleeved navy blue collared shirt and is
visible only from his chest to his brows. In the slightly blurry background, a mid-sized plant sits in front of a
large window. Everything is painted white. Outside the window, green trees are visible.
00:00:01.55 A woman's hands appear in the boom right corner of the frame, gently
assisting the boy's right hand as he uses the bow on the violin.
Comment: The first timestamp does not correspond to any specific action. Additionally, it is
inaccurate to describe the boy as brown-skinned; he is fair or light-skinned. It's also subjective to
assume the hands shown belong to a woman when there’s no clear indication of gender, it's more
appropriate to refer to them simply as a fair-skinned hands.
Several important visual and contextual details are missing. There is no mention of the violin's
placement, the finger movements of the boy’s left hand, or his focused gaze on the instrument
while playing. The camera movement, angle, and framing are also left out, which are important
for understanding how the scene is presented.
The caption omits the actions of the other person’s hands as well: at first, both hands come into
view holding the bow; then, the right hand is removed, and the left hand remains on the bow,
assisting him in practicing until the video ends.
Additional descriptive elements are also missing, such as the boy’s hair length or color, the
natural lighting in the room through the window, and the color of the violin.Use these task examples for reference on how you will execute the tasks. Follow the order; the first sentence should summarize the video, including setting and featuring character, then the main subject, subject's skin tone and ethnicity, full dressing code of the subjects, action done by the subject, background information, lighting, and camera positioning. I will upload a screenshot for you to start. The lighting and camera angle should be in one line each
Kindly say if it's a man, a woman, or a child. Do not use "what appears to be" or "appears to be," or "suggesting" in your explanations, as this shows a lack of confidence.
Task Examples
Good task examples:
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/247758-clip_000000
00_0.mp4
"The setting is outdoors, surrounded by lush greenery, including many plants and trees in the background.
The main focus is a woman and a man.
A woman with thick black thick hair is in the foreground, wearing a white top, a silver colored chain, and
a surgical mask. Two moles are visible on her face, one at the edge of her right eyebrow and another on
the right side of her nose near her right eye above the mask. Behind her, slightly to her left, a man with
blonde hair, black framed glasses, and a blue checkered shirt is also wearing a surgical mask. He stands
close behind her left side with his right hand resting on her right shoulder.
At 00:00:00.71 , the man first looks at the woman before turning his gaze toward the
camera. His right hand is already on her shoulder. He then slowly moves his hand slightly to the right and
gently clenches her shoulder at 00:00:06.40 , and at 00:00:06.44 , he blinks
his eyes. Both are now looking at the camera.
The lighting is natural daylight. The video starts with a close-up shot, and the camera slightly tilts
upward."
Comment: This annotation is clear and objective, includes detailed descriptions of characters and their
appearance, positioning and movements. Has precise timestamps for actions and lighting conditions and
camera movement are well described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/337407-clip_000000
00_1.mp4
"The video takes place in a bright green background with even lighting, capturing a woman dancing
throughout. The focus is on a young woman with long, wavy, light brown hair. She wears an oversized white
long-sleeve shirt, round earrings and interacts with her hair while posing. At various points in the video, she
looks up and smiles.
At 00:00:00.06 , she raises her left hand and touches her hair with her fingers. Then, at
00:00:01.23 , she raises her right hand, touches her shoulder, moves her right hand through
her hair, and returns her left hand to her side at 00:00:02.12 . She begins jumping four
times, moving both hands up and down alternately and smiling. At 00:00:02.95 , she looks
directly at the camera and flips her hair upward with both hands. At 00:00:03.77 , she looks
up and stretches her arms above shoulder level and pucker whistling, lowering them below head level at
00:00:04.65 . Her movements gently reflect her shadow on the background.
The camera remains static with a medium shot angle throughout the video. The lighting conditions are
artificial indoor lighting."
Comment: clear and detailed action breakdown with accurate timestamps and no subjectivity. Well
described lighting condition and camera movements
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/254480-clip_000000
00_0.mp4
The scene is set inside a building that appears to be a doctor's oice. In the background are large windows
covered by translucent blinds, through which sunlight enters and illuminates the space. Beyond the
windows, there are faint silhouettes of skyscrapers that are blurred out. In the middle of the windows is a
pillar covered in red brick tiles. There are two female subjects present in the frame, both of whom are
wearing flu masks and dark blue sanitary gloves, and have slightly tanned skin tones with dark hair.
The first subject on the left has shoulder-length straight black hair and is wearing a brown buon-down
shirt. She is clutching a silver tablet device with a black case and a dark blue piece of paper in her right
arm. Throughout the video, she talks to the subject on the right while looking at her and briefly looks away
from the second subject at 00:00:04.30 until 00:00:05.96 without changing
the direction of her head, referring to a white piece of paper by signaling at it with her hand.
At 00:00:02.18 to 00:00:03.02 , the subject on the left gently nods her head.
She also shrugs her left shoulder slightly while nodding her head at 00:00:04.30 . The
second subject is wearing a pastel pink buon-down shirt with a white lab coat. She has curly hair styled in
a single ponytail and is holding up two pieces of paper in one hand, looking at their contents throughout
the video while the first subject talks. Both subjects have neutral facial expressions, which can be inferred
from their eyes and eyebrows, as their faces are covered with masks.
The camera is moving towards the left side of the frame at a fixed height, capturing the subjects above the
waist. The scene is lit by the natural sunlight coming through the windows.
Comments: well structured and detailed annotation with good timestamps and camera description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/365681-clip_000000
00_0.mp4
The scene is set outdoors in a forest filled with tall green trees and a pale blue twilight sky. The subject is a
young white blonde woman with blue eyes who is wearing a white sleeveless dress. The scene begins with
the woman facing forward and walking ahead, with the camera capturing her from the back at an angle
biased toward her left side.
At 00:00:00.50 , the woman begins to turn back to look over her left shoulder. At
00:00:03.12 , she has turned her body almost fully around to look back and slightly above
her eye level, wearing an anxious expression that conveys fear. At 00:00:03.86 , she begins
to turn back, and by 00:00:06.12 , she has completed turning her head forward again, with
her hair flowing in the direction of her head movement and starts to pick up the pace. As she begins
walking faster, her loose hair flows and bounces, reflecting her hurried manner of walking.
The lighting is natural, and the camera follows the woman's movement, capturing her only from the waist
up. The background is blurred with a bokeh effect.
Comment: Strong visual detail, with clear timestamped and camera movement.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/131239-clip_0000000
0_0.mp4
The scene is set in the afternoon in a jungle, featuring a paved path slightly covered with rocks and
surrounded by tall trees and short grass, with sunlight filtering through the foliage. The video begins with
the subject out of frame. The subject is a thin white female with dark, short, and curly hair, running for
exercise while wearing black capri pants and a light maroon long-sleeve t-shirt, along with gray sneakers
and white socks.
At the 00:00:00.58 mark, the subject enters the frame from the left side, running in the
same direction as the camera. Upon entering the frame, the subject gradually overtakes the camera, and
the distance between her and the camera increases as the video progresses.
Comment: Clear and focused description of environment and movement, timestamps well used and
camera movement well described in relation with the subject
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/253997-clip_000000
00_0.mp4
The scene begins with a female subject who in her early twenties, is thin and has light skin and dark hair
styled in a ponytail that reaches the length of her back. She is wearing beige trousers and a black and
white Aztec diamond-patterned overshirt over a white T-shirt, along with silver drop earrings. She stands
beside a river that flows parallel to the direction of the camera. On the left side of the frame is the subject,
who is standing atop small grass-covered rock formations and tree roots, with foliage of varying heights in
the background and a tree about 8 feet behind her. The right side of the frame is dominated by the flow of
the river, which is pale blue in color and has a rock protruding from the surface. In the distance, on the
ground of the right side of the frame next to the river, there is some foliage.
As the video progresses, the foliage sways slightly due to the wind, along with the overshirt of the female
subject. The subject displays an expression of calmness and pleasure in the clip, momentarily closing her
eyes as well. At the 00:00:02.39 mark, she drops her hand, which was originally over her
head, and looks towards the camera. At 00:00:05.86 , she raises her hand to her collarbone
while tilting her head and closing her eyes. Afterwards, she lowers her hand and wraps it around her waist.
During the length of the video, her right hand remains stationary and rests beside her leg. At
00:00:05.91 , as the camera pans to the right, a second rock protruding from within the
river comes into view.
The camera is positioned at a fixed distance from the subject but occasionally pans around and changes
slightly. The scene is illuminated primarily by warm sunlight, as well as by light reflected o the water.
Comment: Highly detailed and well structured annotation, with clear timestamped actions, and accurate
camera behavior well described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/199735-clip_000000
00_0.mp4
The scene takes place indoors in an oice, featuring a white man in his forties with gray hair and a short
beard. He is wearing a blue, long-sleeve buon-down shirt and gray pants, and he is wearing glasses. He is
seated at his gray desk, operating a desktop computer with his left hand while taking notes with his right
hand. A metal table lamp, which is lit, is positioned to the right of the monitor. In the background, a glass
wall consists of white beams serving as panels, through which another oice, mostly obscured and
illuminated with neon blue light, can be seen. A white blonde woman is seated at her desk, facing to the
right of the frame.
Between 00:00:00.00 and 00:00:01.42 , he looks back and forth between
his notebook and the monitor before focusing on the computer screen. Over his right shoulder, on the glass
wall in the background, is a sketch of the front of a car, while over his left shoulder is a sketch of the top
view of the car from an angle. At 00:00:07.10 , he stops taking notes and presses a single
key on the keyboard, looking at the camera with a neutral expression.
The majority of the lighting is dim and artificial, with a blue hue, likely projecting from ceiling lights.
Additionally, the scene is illuminated by a table lamp which is projecting warm light on to the desk of the
main subject. Further, there are neon blue light fixtures in the background that are adding illumination. The
camera zooms in slowly throughout the scene.
Comment : detailed and objective description with well used timestamps, great camera and lighting
condition description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/376414-clip_000000
00_0.mp4
The scene is set at sunset in a forest with dense trees in the background, featuring a patch of trees on the
left side of the frame that has a few bright yellow leaves. Small clouds of smoke linger behind the subject,
who is a bald, slender, gray-haired monk in his 50s, wearing a deep red Buddhist robe. The monk is seated
still with his eyes closed, wearing a neutral expression while meditating.
He maintains this position for the duration of the clip, while the camera gradually moves closer to him as
the clip progresses.-
The camera captures only the monk from the waist up and zooms in at an upward angle. The scene is
illuminated by the warm natural sunlight from sunset through the trees and leaves in the background.
Comment: Strong environmental detail and lighting description, camera movement clearly described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/374764-clip_000000
00_0.mp4
The scene is set on a street surrounded by various buildings. The o-white building on the left side of the
frame features revivalist architecture and extends into the distance, while the one on the right has modern
architecture and is also o-white in color. The subject is Black, with curly hair and a high fade haircut. He is
wearing a thin grey turtleneck sweater and a black backpack with yellow and black steel zippers, along
with a leather strap chronograph wristwatch and a face mask. In the distance, there are modern
skyscrapers primarily made of glass.
The video begins with the subject looking down at the screen of his phone, which he is holding with both
hands while typing. He is standing next to the upper part of the stairway leading to the subway below.
There is a black sign overhead the staircase indicating the name of the station, i.e., 34 St-Penn Station, in
white and blue font. The middle rail of the stairway is slightly visible on the left side of the frame.
At the 00:00:03.05 mark, a man wearing a white and blue polo shirt and a white hat starts
to walk across the bridge, on the wall of which the aforementioned sign is aached. At the
00:00:05.97 mark, a steel grey hatchback drives across the road into the distance from the
right side of the frame.
The camera is handheld at a fixed position, capturing the subject from the waist up. In terms of lighting,
the environment is naturally sunlit with overcast clouds.
Comment: Great details for the character and well timestamped actions, camera and lighting condition
are described clearly.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/378498-clip_000000
00_0.mp4
The scene begins with a young Black man standing and clutching thin window bars in a dark, poorly lit
room, looking through them at the outdoor environment that appears to contain trees in the distance
beyond a large empty courtyard. He is standing in the right half of the frame. The subject is wearing a
raglan half-sleeve T-shirt with a white torso and black sleeves, along with black true wireless earphones.
He has short, curly black hair and a low fade haircut, and a neutral expression on his face as he peers
through the window bars.
At the 00:00:02.03 mark, the camera moves forward enough to crop out the background
of the interior while geing closer to the subject, who is only visible from above the shoulders, along with
his forearms and hands.
The camera is positioned to the left side of the subject. The scene is primarily illuminated by sunlight and
the white tube light in the interior.
Comment : Well balanced annotation, objective and great character description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/275926-clip_000000
00_1.mp4
The video shows a person on the left side of the screen, kneeling on a wooden plank surface as he plants
tubers into the dark soil. He is wearing blue jeans, a gray knied sweater, and black gloves. Beside his knee
is a white container holding the tubers. The soil, positioned on the right side of the screen, has been dug
into a trench, forming a small mound along the edge. One tuber is already planted at the far end.
At 00:00:01.10 , the person places a tuber into the soil. Then he moves his hand into the
container, picks another one, and plants it at 00:00:04.87 . He repeats this process ,
steadily working along the dugout space. The camera starts o still, then gradually tilts upward, capturing
the scene.
Comment : Great scene description with accurate layout and it is objective.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/291922-clip_000000
00_1.mp4
The video is set indoors in a studio, with a plain black background that has a round light source attached in
the middle to cast a warm light, making the subject appear as a silhouette. From what can be inferred, the
subject is wearing waist-high flare pants and a tucked-in long-sleeve shirt. The subject has short hair, and
in this scene, they are performing hand combat moves using a small knife.
At 00:00:00.00 , the subject is facing their body towards the camera while turned to their
left, wielding the knife in their right hand. They raise the knife upwards with an arm movement while
throwing a punch with their other hand until the 00:00:00.26 mark. Afterwards, they take
a neutral stance and proceed to swing the blade over and then under their hand to their right side while
executing an additional move in that direction with the blade at eye level at 00:00:02.54 .
At 00:00:03.89 , the subject adopts another neutral stance, keeping the blade and their
free hand close to their body while looking to the right. By 00:00:05.09 , they proceed to
throw another combination involving a low blow and another strike at eye level while facing their right
side. They repeat the combination one more time, this time having their left arm resting against the side of
their waist by 00:00:06.85 . The video concludes with the subject initiating the
aforementioned combination again in the same direction, which they start for the last time at
00:00:07.50 , this time having their left arm raised to their head in a defensive position.
The camera remains static, with the subject placed in the middle of the frame the entire time. The scene is
lit using a single warm elliptical light source in the background.
Comment : Well detailed and structured annotation with good timestamped actions and an objective
tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/285583-clip_000000
00_1.mp4
A woman in her twenties with a fair skin tone and golden brown hair tied behind her head swims toward
the foreground right of the frame wearing a black scuba diving and yellow gloves. The oxygen
cylinder behind her back has a pale yellow shade. She holds a small metallic silver rectangular object
between her left index finger and thumb. The flaps on her feet are also black. She is visible in the middle of
the screen at 00:00:00.00 in a horizontal swimming position with her body extending from
the background toward the foreground right and her face turned toward the camera. Water bubbles are
visible rising above her head from both side of her face.
Another person wearing the same is visible in the background. They are also swimming toward the
camera.
The surrounding is filled with bluish water with a rocky boom covering most of the screen in the
foreground in the boom half and the middle ground in the top half. They gradually become blurred
toward the background. The rocks in the foreground are visibly covered with thin patches of green algae.
Small gray fishes swim around the woman and are visible near the top edge.
As the video starts, the woman and the person in the background continue to swim and two fishes from the
top swim and come near the foreground on the right side. They have small yellow tails, a white body with a
black strip on top. The woman then starts turning her head at 00:00:00.98 to look at the
fishes. her eyes then follow the movement of a fish that starts swimming toward the boom from the front
of the woman. She also drops the object from her hand at 00:00:01.25 which is revealed to
be aached with a black stick extending from her. She then extends her right hand to touch the fish
and moves her hand near the fish such that the fish touches the back of her hand at
00:00:02.91 . The woman's gaze follows her.
The camera pans a bit toward the left at 00:00:03.83 and then again starts panning to the
right while moving a bit upward. The woman stops swimming and stays at one place from
00:00:04.33 as she looks at the camera. She is very close to the camera and is visible
mostly on the left half of the screen at the end at 00:00:05.11 .
Comment: well detailed and accurate annotation with great use of timestamps on actions, clear camera
description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/208037-clip_000000
00_0.mp4
The video shows an older woman in an outdoor setting surrounded by various flowers and green plants.
The flowers and vases, in shades of brown, black, and gray, are lined up on the right side of the screen
against a dark-framed wall. Some flowers are nestled among green plant stems, with white, pink, and a
hint of purple flowers visible. Behind the woman, there’s a brown brick wall and a gray door frame on the
left, leading to another area with billowing plants.
The woman has short white hair and is dressed in an o-white shirt with buons down the front, paired
with blue jeans. She wears silver dangling earrings in her right ear. Initially bending over, she straightens
up, at pulls away from a black vase, and at 00:00:01.60 places her left hand on a small
brown vase. At 00:00:02.77 she touches a white flower with her left hand and another with
her right, smiling as she admires them. Finally, at 00:00:08.40 she stretches her right hand
to caress one of the green leaves.
Comment: well detailed annotation with accurate timestamped actions and an objective tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/239838-clip_000000
00_0.mp4
A woman walks on a flat sandy terrain outdoors. She wears a wide-brimmed hat, sunglasses, a face mask
pulled below her chin, a rust-colored jacket over a pink t-shirt, and a backpack which she is wearing on her
shoulders. She also has long dark hair.
The background features a desert-like surface with vehicle tire marks behind her. On the left side of the
frame, a parked silver SUV with a black roof cargo bag is visible. The sun is low on the horizon behind her,
creating natural lighting and casting shadows on the ground.
At 00:00:00.00 she starts walking forward while holding her backpack straps.
At 00:00:01.79 she turns her head slightly to the right while continuing to walk she continues
to turn her head until she looks almost behind her and sun now reflecting to her face at
00:00:07.33
The camera remains static, capturing the subject in a wide shot for the duration of the video. The scene is
illuminated by the sunlight emitted from the setting sun.
Comment: well detailed annotation with accurate timestamped actions and an objective tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/155361-clip_0000000
0_0.mp4
The scene is set inside a gymnasium with large window panels on the upper walls, allowing ample sunlight
to stream in, creating a cooler tone. Below the windows are two wall-mounted air conditioners positioned
far apart, and an LED scoreboard displays a score of 12:11. The floor of the gymnasium is painted blue,
with a green area in the middle. The two subjects are fencers in standard white fencing aire, engaged in a
duel, each having a cable attached to the back of their aire.
The video begins with the fencers standing face to face, legs wide apart, holding up their foils and looking
for an opportunity to strike. The fencer on the right, whose body is facing the camera, is advancing toward
the fencer on the left, who is facing away from the camera and gently stepping back. At the
00:00:01.67 mark, the fencer on the left lifts his foil, and at 00:00:02.14 ,
both fencers lower their foils to feint at each other. The fencer on the right has advanced further to the
left of the frame, while the fencer on the left continues to retreat. At 00:00:04.42 , the
fencer on the left raises his foil, while the fencer on the right lowers his once again. At
00:00:04.57 , the fencer on the right retreats with his lowered foil and strikes forward,
moving his body ahead and taking a wide step to make contact with his rival. The fencer on the left lowers
his foil in an attempt to carry the pack, but ultimately fails.
The camera dollies to the left of the frame at a constant speed, tracking the movement of the subjects.
The environment is naturally lit due to the sunlight coming through the windows.
Comment: Well detailed movement breakdown, timestamps well placed and camera clearly described.
Bad task examples:
1. hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/177736-clip_0
0000000_0.mp4
"00:00:00:01 The scene shows a fair skinned woman with blonde long hair in a brightly lit
room looking down at her phone during the day.
She is wearing a black sleeveless dress and holding a phone with a brown pouch on her right hand.
00:00:00:04She is seen at the middle of the screen bending her head downward leaning a
bit more into the phone.
00:00:01:74She moves a little bit backing out of the phone
00:00:02:96She tilts her head to the left side of the screen and is seen smiling at the phone
00:00:05:92She tilts her head back to the right side of the screen and holds the phone with
both her hands.
Sunlight can be seen at the back of the scene."
Comment: The first timestamp is incorrect because it doesn't correspond to an actual action, it should
only be used to mark specific moments when something happens.
In addition, the caption misses several key visual details. It does not mention the blurry background or the
object visible on the wall. There's also no description of the camera work, such as the angle, movement, or
framing, which are important for understanding how the scene is visually presented.
Important visual elements are also omied, including the light reflecting on her face, her eye color, and her
eyelashes. Furthermore, the caption fails to note the moment when she closes her mouth, which is a clear
and noticeable action that should have been included.
2. http://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/347440-clip_00
000000_0.mp4
00:00:04.54 Close-up of a street-side coee vendor, visible only from the midsection. He’s
wearing a blue shirt with black buons. With his right hand, he pours freshly brewed coee into a small
white wine cup wrapped in crystals that he's holding with his left hand. In the boom right corner of the
frame, an ornate coeepot with a crystal lid and a small ornate plate are visible.
Comment: The caption is too short and omits several important visual details. For instance,
there's no mention of the man's blue jeans, his dark blue, short-sleeve shirt with doed paerns,
or his hairy hands—all of which are clearly visible and contribute to the scene’s realism.
Additionally, key background elements are missing: the blurry area to his left reveals a circular
floor design, and there's a stainless steel coee stand that goes unmentioned. The small coee
cup is described inadequately—it features a white plastic upper part, a shiny, crystalline lower
part, a scooping spoon, and a distinct gold band separating the two sections, but none of this is
captured in the caption.
To his right, there’s a truncated stainless steel bowl resting on a brown wooden stool, and in front
of it, a black, wrapped cloth becomes visible as the camera pans. These are also excluded from
the description.
The caption also lacks an action timestamp to indicate when the man is scooping coee into the
cup, which is a key moment. Finally, it fails to mention the natural lighting conditions and a
daytime scene. The timestamp in the caption does not point to an action.
3. http://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/309939-clip_00
000000_1.mp4
00:00:00.00 The right profile of a brown-skinned boy is seen on the left side of the frame,
being instructed on how to play the violin. He is wearing a short-sleeved navy blue collared shirt and is
visible only from his chest to his brows. In the slightly blurry background, a mid-sized plant sits in front of a
large window. Everything is painted white. Outside the window, green trees are visible.
00:00:01.55 A woman's hands appear in the boom right corner of the frame, gently
assisting the boy's right hand as he uses the bow on the violin.
Comment: The first timestamp does not correspond to any specific action. Additionally, it is
inaccurate to describe the boy as brown-skinned; he is fair or light-skinned. It's also subjective to
assume the hands shown belong to a woman when there’s no clear indication of gender, it's more
appropriate to refer to them simply as a fair-skinned hands.
Several important visual and contextual details are missing. There is no mention of the violin's
placement, the finger movements of the boy’s left hand, or his focused gaze on the instrument
while playing. The camera movement, angle, and framing are also left out, which are important
for understanding how the scene is presented.
The caption omits the actions of the other person’s hands as well: at first, both hands come into
view holding the bow; then, the right hand is removed, and the left hand remains on the bow,
assisting him in practicing until the video ends.
Additional descriptive elements are also missing, such as the boy’s hair length or color, the
natural lighting in the room through the window, and the color of the violin.Follow the order; the first sentence should summarize the video, including setting and featuring character, then the main subject, subject's skin tone and ethnicity, full dressing code of the subjects, action done by the subject, background information, lighting, and camera positioning. I will upload a screenshot for you to start. The lighting and camera angle should be in one line each
Kindly say if it's a man, a woman, or a child. Do not use "what appears to be" or "appears to be," or "suggesting" in your explanations, as this shows a lack of confidence.">
00:00:00.71 , the man first looks at the woman before turning his gaze toward the
camera. His right hand is already on her shoulder. He then slowly moves his hand slightly to the right and
gently clenches her shoulder at 00:00:06.40 , and at 00:00:06.44 , he blinks
his eyes. Both are now looking at the camera.
The lighting is natural daylight. The video starts with a close-up shot, and the camera slightly tilts
upward."
Comment: This annotation is clear and objective, includes detailed descriptions of characters and their
appearance, positioning and movements. Has precise timestamps for actions and lighting conditions and
camera movement are well described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/337407-clip_000000
00_1.mp4
"The video takes place in a bright green background with even lighting, capturing a woman dancing
throughout. The focus is on a young woman with long, wavy, light brown hair. She wears an oversized white
long-sleeve shirt, round earrings and interacts with her hair while posing. At various points in the video, she
looks up and smiles.
At 00:00:00.06 , she raises her left hand and touches her hair with her fingers. Then, at
00:00:01.23 , she raises her right hand, touches her shoulder, moves her right hand through
her hair, and returns her left hand to her side at 00:00:02.12 . She begins jumping four
times, moving both hands up and down alternately and smiling. At 00:00:02.95 , she looks
directly at the camera and flips her hair upward with both hands. At 00:00:03.77 , she looks
up and stretches her arms above shoulder level and pucker whistling, lowering them below head level at
00:00:04.65 . Her movements gently reflect her shadow on the background.
The camera remains static with a medium shot angle throughout the video. The lighting conditions are
artificial indoor lighting."
Comment: clear and detailed action breakdown with accurate timestamps and no subjectivity. Well
described lighting condition and camera movements
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/254480-clip_000000
00_0.mp4
The scene is set inside a building that appears to be a doctor's oice. In the background are large windows
covered by translucent blinds, through which sunlight enters and illuminates the space. Beyond the
windows, there are faint silhouettes of skyscrapers that are blurred out. In the middle of the windows is a
pillar covered in red brick tiles. There are two female subjects present in the frame, both of whom are
wearing flu masks and dark blue sanitary gloves, and have slightly tanned skin tones with dark hair.
The first subject on the left has shoulder-length straight black hair and is wearing a brown buon-down
shirt. She is clutching a silver tablet device with a black case and a dark blue piece of paper in her right
arm. Throughout the video, she talks to the subject on the right while looking at her and briefly looks away
from the second subject at 00:00:04.30 until 00:00:05.96 without changing
the direction of her head, referring to a white piece of paper by signaling at it with her hand.
At 00:00:02.18 to 00:00:03.02 , the subject on the left gently nods her head.
She also shrugs her left shoulder slightly while nodding her head at 00:00:04.30 . The
second subject is wearing a pastel pink buon-down shirt with a white lab coat. She has curly hair styled in
a single ponytail and is holding up two pieces of paper in one hand, looking at their contents throughout
the video while the first subject talks. Both subjects have neutral facial expressions, which can be inferred
from their eyes and eyebrows, as their faces are covered with masks.
The camera is moving towards the left side of the frame at a fixed height, capturing the subjects above the
waist. The scene is lit by the natural sunlight coming through the windows.
Comments: well structured and detailed annotation with good timestamps and camera description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/365681-clip_000000
00_0.mp4
The scene is set outdoors in a forest filled with tall green trees and a pale blue twilight sky. The subject is a
young white blonde woman with blue eyes who is wearing a white sleeveless dress. The scene begins with
the woman facing forward and walking ahead, with the camera capturing her from the back at an angle
biased toward her left side.
At 00:00:00.50 , the woman begins to turn back to look over her left shoulder. At
00:00:03.12 , she has turned her body almost fully around to look back and slightly above
her eye level, wearing an anxious expression that conveys fear. At 00:00:03.86 , she begins
to turn back, and by 00:00:06.12 , she has completed turning her head forward again, with
her hair flowing in the direction of her head movement and starts to pick up the pace. As she begins
walking faster, her loose hair flows and bounces, reflecting her hurried manner of walking.
The lighting is natural, and the camera follows the woman's movement, capturing her only from the waist
up. The background is blurred with a bokeh effect.
Comment: Strong visual detail, with clear timestamped and camera movement.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/131239-clip_0000000
0_0.mp4
The scene is set in the afternoon in a jungle, featuring a paved path slightly covered with rocks and
surrounded by tall trees and short grass, with sunlight filtering through the foliage. The video begins with
the subject out of frame. The subject is a thin white female with dark, short, and curly hair, running for
exercise while wearing black capri pants and a light maroon long-sleeve t-shirt, along with gray sneakers
and white socks.
At the 00:00:00.58 mark, the subject enters the frame from the left side, running in the
same direction as the camera. Upon entering the frame, the subject gradually overtakes the camera, and
the distance between her and the camera increases as the video progresses.
Comment: Clear and focused description of environment and movement, timestamps well used and
camera movement well described in relation with the subject
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/253997-clip_000000
00_0.mp4
The scene begins with a female subject who in her early twenties, is thin and has light skin and dark hair
styled in a ponytail that reaches the length of her back. She is wearing beige trousers and a black and
white Aztec diamond-patterned overshirt over a white T-shirt, along with silver drop earrings. She stands
beside a river that flows parallel to the direction of the camera. On the left side of the frame is the subject,
who is standing atop small grass-covered rock formations and tree roots, with foliage of varying heights in
the background and a tree about 8 feet behind her. The right side of the frame is dominated by the flow of
the river, which is pale blue in color and has a rock protruding from the surface. In the distance, on the
ground of the right side of the frame next to the river, there is some foliage.
As the video progresses, the foliage sways slightly due to the wind, along with the overshirt of the female
subject. The subject displays an expression of calmness and pleasure in the clip, momentarily closing her
eyes as well. At the 00:00:02.39 mark, she drops her hand, which was originally over her
head, and looks towards the camera. At 00:00:05.86 , she raises her hand to her collarbone
while tilting her head and closing her eyes. Afterwards, she lowers her hand and wraps it around her waist.
During the length of the video, her right hand remains stationary and rests beside her leg. At
00:00:05.91 , as the camera pans to the right, a second rock protruding from within the
river comes into view.
The camera is positioned at a fixed distance from the subject but occasionally pans around and changes
slightly. The scene is illuminated primarily by warm sunlight, as well as by light reflected o the water.
Comment: Highly detailed and well structured annotation, with clear timestamped actions, and accurate
camera behavior well described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/199735-clip_000000
00_0.mp4
The scene takes place indoors in an oice, featuring a white man in his forties with gray hair and a short
beard. He is wearing a blue, long-sleeve buon-down shirt and gray pants, and he is wearing glasses. He is
seated at his gray desk, operating a desktop computer with his left hand while taking notes with his right
hand. A metal table lamp, which is lit, is positioned to the right of the monitor. In the background, a glass
wall consists of white beams serving as panels, through which another oice, mostly obscured and
illuminated with neon blue light, can be seen. A white blonde woman is seated at her desk, facing to the
right of the frame.
Between 00:00:00.00 and 00:00:01.42 , he looks back and forth between
his notebook and the monitor before focusing on the computer screen. Over his right shoulder, on the glass
wall in the background, is a sketch of the front of a car, while over his left shoulder is a sketch of the top
view of the car from an angle. At 00:00:07.10 , he stops taking notes and presses a single
key on the keyboard, looking at the camera with a neutral expression.
The majority of the lighting is dim and artificial, with a blue hue, likely projecting from ceiling lights.
Additionally, the scene is illuminated by a table lamp which is projecting warm light on to the desk of the
main subject. Further, there are neon blue light fixtures in the background that are adding illumination. The
camera zooms in slowly throughout the scene.
Comment : detailed and objective description with well used timestamps, great camera and lighting
condition description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/376414-clip_000000
00_0.mp4
The scene is set at sunset in a forest with dense trees in the background, featuring a patch of trees on the
left side of the frame that has a few bright yellow leaves. Small clouds of smoke linger behind the subject,
who is a bald, slender, gray-haired monk in his 50s, wearing a deep red Buddhist robe. The monk is seated
still with his eyes closed, wearing a neutral expression while meditating.
He maintains this position for the duration of the clip, while the camera gradually moves closer to him as
the clip progresses.-
The camera captures only the monk from the waist up and zooms in at an upward angle. The scene is
illuminated by the warm natural sunlight from sunset through the trees and leaves in the background.
Comment: Strong environmental detail and lighting description, camera movement clearly described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/374764-clip_000000
00_0.mp4
The scene is set on a street surrounded by various buildings. The o-white building on the left side of the
frame features revivalist architecture and extends into the distance, while the one on the right has modern
architecture and is also o-white in color. The subject is Black, with curly hair and a high fade haircut. He is
wearing a thin grey turtleneck sweater and a black backpack with yellow and black steel zippers, along
with a leather strap chronograph wristwatch and a face mask. In the distance, there are modern
skyscrapers primarily made of glass.
The video begins with the subject looking down at the screen of his phone, which he is holding with both
hands while typing. He is standing next to the upper part of the stairway leading to the subway below.
There is a black sign overhead the staircase indicating the name of the station, i.e., 34 St-Penn Station, in
white and blue font. The middle rail of the stairway is slightly visible on the left side of the frame.
At the 00:00:03.05 mark, a man wearing a white and blue polo shirt and a white hat starts
to walk across the bridge, on the wall of which the aforementioned sign is aached. At the
00:00:05.97 mark, a steel grey hatchback drives across the road into the distance from the
right side of the frame.
The camera is handheld at a fixed position, capturing the subject from the waist up. In terms of lighting,
the environment is naturally sunlit with overcast clouds.
Comment: Great details for the character and well timestamped actions, camera and lighting condition
are described clearly.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/378498-clip_000000
00_0.mp4
The scene begins with a young Black man standing and clutching thin window bars in a dark, poorly lit
room, looking through them at the outdoor environment that appears to contain trees in the distance
beyond a large empty courtyard. He is standing in the right half of the frame. The subject is wearing a
raglan half-sleeve T-shirt with a white torso and black sleeves, along with black true wireless earphones.
He has short, curly black hair and a low fade haircut, and a neutral expression on his face as he peers
through the window bars.
At the 00:00:02.03 mark, the camera moves forward enough to crop out the background
of the interior while geing closer to the subject, who is only visible from above the shoulders, along with
his forearms and hands.
The camera is positioned to the left side of the subject. The scene is primarily illuminated by sunlight and
the white tube light in the interior.
Comment : Well balanced annotation, objective and great character description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/275926-clip_000000
00_1.mp4
The video shows a person on the left side of the screen, kneeling on a wooden plank surface as he plants
tubers into the dark soil. He is wearing blue jeans, a gray knied sweater, and black gloves. Beside his knee
is a white container holding the tubers. The soil, positioned on the right side of the screen, has been dug
into a trench, forming a small mound along the edge. One tuber is already planted at the far end.
At 00:00:01.10 , the person places a tuber into the soil. Then he moves his hand into the
container, picks another one, and plants it at 00:00:04.87 . He repeats this process ,
steadily working along the dugout space. The camera starts o still, then gradually tilts upward, capturing
the scene.
Comment : Great scene description with accurate layout and it is objective.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/291922-clip_000000
00_1.mp4
The video is set indoors in a studio, with a plain black background that has a round light source attached in
the middle to cast a warm light, making the subject appear as a silhouette. From what can be inferred, the
subject is wearing waist-high flare pants and a tucked-in long-sleeve shirt. The subject has short hair, and
in this scene, they are performing hand combat moves using a small knife.
At 00:00:00.00 , the subject is facing their body towards the camera while turned to their
left, wielding the knife in their right hand. They raise the knife upwards with an arm movement while
throwing a punch with their other hand until the 00:00:00.26 mark. Afterwards, they take
a neutral stance and proceed to swing the blade over and then under their hand to their right side while
executing an additional move in that direction with the blade at eye level at 00:00:02.54 .
At 00:00:03.89 , the subject adopts another neutral stance, keeping the blade and their
free hand close to their body while looking to the right. By 00:00:05.09 , they proceed to
throw another combination involving a low blow and another strike at eye level while facing their right
side. They repeat the combination one more time, this time having their left arm resting against the side of
their waist by 00:00:06.85 . The video concludes with the subject initiating the
aforementioned combination again in the same direction, which they start for the last time at
00:00:07.50 , this time having their left arm raised to their head in a defensive position.
The camera remains static, with the subject placed in the middle of the frame the entire time. The scene is
lit using a single warm elliptical light source in the background.
Comment : Well detailed and structured annotation with good timestamped actions and an objective
tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/285583-clip_000000
00_1.mp4
A woman in her twenties with a fair skin tone and golden brown hair tied behind her head swims toward
the foreground right of the frame wearing a black scuba diving and yellow gloves. The oxygen
cylinder behind her back has a pale yellow shade. She holds a small metallic silver rectangular object
between her left index finger and thumb. The flaps on her feet are also black. She is visible in the middle of
the screen at 00:00:00.00 in a horizontal swimming position with her body extending from
the background toward the foreground right and her face turned toward the camera. Water bubbles are
visible rising above her head from both side of her face.
Another person wearing the same is visible in the background. They are also swimming toward the
camera.
The surrounding is filled with bluish water with a rocky boom covering most of the screen in the
foreground in the boom half and the middle ground in the top half. They gradually become blurred
toward the background. The rocks in the foreground are visibly covered with thin patches of green algae.
Small gray fishes swim around the woman and are visible near the top edge.
As the video starts, the woman and the person in the background continue to swim and two fishes from the
top swim and come near the foreground on the right side. They have small yellow tails, a white body with a
black strip on top. The woman then starts turning her head at 00:00:00.98 to look at the
fishes. her eyes then follow the movement of a fish that starts swimming toward the boom from the front
of the woman. She also drops the object from her hand at 00:00:01.25 which is revealed to
be aached with a black stick extending from her. She then extends her right hand to touch the fish
and moves her hand near the fish such that the fish touches the back of her hand at
00:00:02.91 . The woman's gaze follows her.
The camera pans a bit toward the left at 00:00:03.83 and then again starts panning to the
right while moving a bit upward. The woman stops swimming and stays at one place from
00:00:04.33 as she looks at the camera. She is very close to the camera and is visible
mostly on the left half of the screen at the end at 00:00:05.11 .
Comment: well detailed and accurate annotation with great use of timestamps on actions, clear camera
description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/208037-clip_000000
00_0.mp4
The video shows an older woman in an outdoor setting surrounded by various flowers and green plants.
The flowers and vases, in shades of brown, black, and gray, are lined up on the right side of the screen
against a dark-framed wall. Some flowers are nestled among green plant stems, with white, pink, and a
hint of purple flowers visible. Behind the woman, there’s a brown brick wall and a gray door frame on the
left, leading to another area with billowing plants.
The woman has short white hair and is dressed in an o-white shirt with buons down the front, paired
with blue jeans. She wears silver dangling earrings in her right ear. Initially bending over, she straightens
up, at pulls away from a black vase, and at 00:00:01.60 places her left hand on a small
brown vase. At 00:00:02.77 she touches a white flower with her left hand and another with
her right, smiling as she admires them. Finally, at 00:00:08.40 she stretches her right hand
to caress one of the green leaves.
Comment: well detailed annotation with accurate timestamped actions and an objective tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/239838-clip_000000
00_0.mp4
A woman walks on a flat sandy terrain outdoors. She wears a wide-brimmed hat, sunglasses, a face mask
pulled below her chin, a rust-colored jacket over a pink t-shirt, and a backpack which she is wearing on her
shoulders. She also has long dark hair.
The background features a desert-like surface with vehicle tire marks behind her. On the left side of the
frame, a parked silver SUV with a black roof cargo bag is visible. The sun is low on the horizon behind her,
creating natural lighting and casting shadows on the ground.
At 00:00:00.00 she starts walking forward while holding her backpack straps.
At 00:00:01.79 she turns her head slightly to the right while continuing to walk she continues
to turn her head until she looks almost behind her and sun now reflecting to her face at
00:00:07.33
The camera remains static, capturing the subject in a wide shot for the duration of the video. The scene is
illuminated by the sunlight emitted from the setting sun.
Comment: well detailed annotation with accurate timestamped actions and an objective tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/155361-clip_0000000
0_0.mp4
The scene is set inside a gymnasium with large window panels on the upper walls, allowing ample sunlight
to stream in, creating a cooler tone. Below the windows are two wall-mounted air conditioners positioned
far apart, and an LED scoreboard displays a score of 12:11. The floor of the gymnasium is painted blue,
with a green area in the middle. The two subjects are fencers in standard white fencing aire, engaged in a
duel, each having a cable attached to the back of their aire.
The video begins with the fencers standing face to face, legs wide apart, holding up their foils and looking
for an opportunity to strike. The fencer on the right, whose body is facing the camera, is advancing toward
the fencer on the left, who is facing away from the camera and gently stepping back. At the
00:00:01.67 mark, the fencer on the left lifts his foil, and at 00:00:02.14 ,
both fencers lower their foils to feint at each other. The fencer on the right has advanced further to the
left of the frame, while the fencer on the left continues to retreat. At 00:00:04.42 , the
fencer on the left raises his foil, while the fencer on the right lowers his once again. At
00:00:04.57 , the fencer on the right retreats with his lowered foil and strikes forward,
moving his body ahead and taking a wide step to make contact with his rival. The fencer on the left lowers
his foil in an attempt to carry the pack, but ultimately fails.
The camera dollies to the left of the frame at a constant speed, tracking the movement of the subjects.
The environment is naturally lit due to the sunlight coming through the windows.
Comment: Well detailed movement breakdown, timestamps well placed and camera clearly described.
Bad task examples:
1. hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/177736-clip_0
0000000_0.mp4
"00:00:00:01 The scene shows a fair skinned woman with blonde long hair in a brightly lit
room looking down at her phone during the day.
She is wearing a black sleeveless dress and holding a phone with a brown pouch on her right hand.
00:00:00:04She is seen at the middle of the screen bending her head downward leaning a
bit more into the phone.
00:00:01:74She moves a little bit backing out of the phone
00:00:02:96She tilts her head to the left side of the screen and is seen smiling at the phone
00:00:05:92She tilts her head back to the right side of the screen and holds the phone with
both her hands.
Sunlight can be seen at the back of the scene."
Comment: The first timestamp is incorrect because it doesn't correspond to an actual action, it should
only be used to mark specific moments when something happens.
In addition, the caption misses several key visual details. It does not mention the blurry background or the
object visible on the wall. There's also no description of the camera work, such as the angle, movement, or
framing, which are important for understanding how the scene is visually presented.
Important visual elements are also omied, including the light reflecting on her face, her eye color, and her
eyelashes. Furthermore, the caption fails to note the moment when she closes her mouth, which is a clear
and noticeable action that should have been included.
2. http://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/347440-clip_00
000000_0.mp4
00:00:04.54 Close-up of a street-side coee vendor, visible only from the midsection. He’s
wearing a blue shirt with black buons. With his right hand, he pours freshly brewed coee into a small
white wine cup wrapped in crystals that he's holding with his left hand. In the boom right corner of the
frame, an ornate coeepot with a crystal lid and a small ornate plate are visible.
Comment: The caption is too short and omits several important visual details. For instance,
there's no mention of the man's blue jeans, his dark blue, short-sleeve shirt with doed paerns,
or his hairy hands—all of which are clearly visible and contribute to the scene’s realism.
Additionally, key background elements are missing: the blurry area to his left reveals a circular
floor design, and there's a stainless steel coee stand that goes unmentioned. The small coee
cup is described inadequately—it features a white plastic upper part, a shiny, crystalline lower
part, a scooping spoon, and a distinct gold band separating the two sections, but none of this is
captured in the caption.
To his right, there’s a truncated stainless steel bowl resting on a brown wooden stool, and in front
of it, a black, wrapped cloth becomes visible as the camera pans. These are also excluded from
the description.
The caption also lacks an action timestamp to indicate when the man is scooping coee into the
cup, which is a key moment. Finally, it fails to mention the natural lighting conditions and a
daytime scene. The timestamp in the caption does not point to an action.
3. http://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/309939-clip_00
000000_1.mp4
00:00:00.00 The right profile of a brown-skinned boy is seen on the left side of the frame,
being instructed on how to play the violin. He is wearing a short-sleeved navy blue collared shirt and is
visible only from his chest to his brows. In the slightly blurry background, a mid-sized plant sits in front of a
large window. Everything is painted white. Outside the window, green trees are visible.
00:00:01.55 A woman's hands appear in the boom right corner of the frame, gently
assisting the boy's right hand as he uses the bow on the violin.
Comment: The first timestamp does not correspond to any specific action. Additionally, it is
inaccurate to describe the boy as brown-skinned; he is fair or light-skinned. It's also subjective to
assume the hands shown belong to a woman when there’s no clear indication of gender, it's more
appropriate to refer to them simply as a fair-skinned hands.
Several important visual and contextual details are missing. There is no mention of the violin's
placement, the finger movements of the boy’s left hand, or his focused gaze on the instrument
while playing. The camera movement, angle, and framing are also left out, which are important
for understanding how the scene is presented.
The caption omits the actions of the other person’s hands as well: at first, both hands come into
view holding the bow; then, the right hand is removed, and the left hand remains on the bow,
assisting him in practicing until the video ends.
Additional descriptive elements are also missing, such as the boy’s hair length or color, the
natural lighting in the room through the window, and the color of the violin.Use these task examples for reference on how you will execute the tasks. Follow the order; the first sentence should summarize the video, including setting and featuring character, then the main subject, subject's skin tone and ethnicity, full dressing code of the subjects, action done by the subject, background information, lighting, and camera positioning. I will upload a screenshot for you to start. The lighting and camera angle should be in one line each
Kindly say if it's a man, a woman, or a child. Do not use "what appears to be" or "appears to be," or "suggesting" in your explanations, as this shows a lack of confidence.
Task Examples
Good task examples:
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/247758-clip_000000
00_0.mp4
"The setting is outdoors, surrounded by lush greenery, including many plants and trees in the background.
The main focus is a woman and a man.
A woman with thick black thick hair is in the foreground, wearing a white top, a silver colored chain, and
a surgical mask. Two moles are visible on her face, one at the edge of her right eyebrow and another on
the right side of her nose near her right eye above the mask. Behind her, slightly to her left, a man with
blonde hair, black framed glasses, and a blue checkered shirt is also wearing a surgical mask. He stands
close behind her left side with his right hand resting on her right shoulder.
At 00:00:00.71 , the man first looks at the woman before turning his gaze toward the
camera. His right hand is already on her shoulder. He then slowly moves his hand slightly to the right and
gently clenches her shoulder at 00:00:06.40 , and at 00:00:06.44 , he blinks
his eyes. Both are now looking at the camera.
The lighting is natural daylight. The video starts with a close-up shot, and the camera slightly tilts
upward."
Comment: This annotation is clear and objective, includes detailed descriptions of characters and their
appearance, positioning and movements. Has precise timestamps for actions and lighting conditions and
camera movement are well described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/337407-clip_000000
00_1.mp4
"The video takes place in a bright green background with even lighting, capturing a woman dancing
throughout. The focus is on a young woman with long, wavy, light brown hair. She wears an oversized white
long-sleeve shirt, round earrings and interacts with her hair while posing. At various points in the video, she
looks up and smiles.
At 00:00:00.06 , she raises her left hand and touches her hair with her fingers. Then, at
00:00:01.23 , she raises her right hand, touches her shoulder, moves her right hand through
her hair, and returns her left hand to her side at 00:00:02.12 . She begins jumping four
times, moving both hands up and down alternately and smiling. At 00:00:02.95 , she looks
directly at the camera and flips her hair upward with both hands. At 00:00:03.77 , she looks
up and stretches her arms above shoulder level and pucker whistling, lowering them below head level at
00:00:04.65 . Her movements gently reflect her shadow on the background.
The camera remains static with a medium shot angle throughout the video. The lighting conditions are
artificial indoor lighting."
Comment: clear and detailed action breakdown with accurate timestamps and no subjectivity. Well
described lighting condition and camera movements
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/254480-clip_000000
00_0.mp4
The scene is set inside a building that appears to be a doctor's oice. In the background are large windows
covered by translucent blinds, through which sunlight enters and illuminates the space. Beyond the
windows, there are faint silhouettes of skyscrapers that are blurred out. In the middle of the windows is a
pillar covered in red brick tiles. There are two female subjects present in the frame, both of whom are
wearing flu masks and dark blue sanitary gloves, and have slightly tanned skin tones with dark hair.
The first subject on the left has shoulder-length straight black hair and is wearing a brown buon-down
shirt. She is clutching a silver tablet device with a black case and a dark blue piece of paper in her right
arm. Throughout the video, she talks to the subject on the right while looking at her and briefly looks away
from the second subject at 00:00:04.30 until 00:00:05.96 without changing
the direction of her head, referring to a white piece of paper by signaling at it with her hand.
At 00:00:02.18 to 00:00:03.02 , the subject on the left gently nods her head.
She also shrugs her left shoulder slightly while nodding her head at 00:00:04.30 . The
second subject is wearing a pastel pink buon-down shirt with a white lab coat. She has curly hair styled in
a single ponytail and is holding up two pieces of paper in one hand, looking at their contents throughout
the video while the first subject talks. Both subjects have neutral facial expressions, which can be inferred
from their eyes and eyebrows, as their faces are covered with masks.
The camera is moving towards the left side of the frame at a fixed height, capturing the subjects above the
waist. The scene is lit by the natural sunlight coming through the windows.
Comments: well structured and detailed annotation with good timestamps and camera description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/365681-clip_000000
00_0.mp4
The scene is set outdoors in a forest filled with tall green trees and a pale blue twilight sky. The subject is a
young white blonde woman with blue eyes who is wearing a white sleeveless dress. The scene begins with
the woman facing forward and walking ahead, with the camera capturing her from the back at an angle
biased toward her left side.
At 00:00:00.50 , the woman begins to turn back to look over her left shoulder. At
00:00:03.12 , she has turned her body almost fully around to look back and slightly above
her eye level, wearing an anxious expression that conveys fear. At 00:00:03.86 , she begins
to turn back, and by 00:00:06.12 , she has completed turning her head forward again, with
her hair flowing in the direction of her head movement and starts to pick up the pace. As she begins
walking faster, her loose hair flows and bounces, reflecting her hurried manner of walking.
The lighting is natural, and the camera follows the woman's movement, capturing her only from the waist
up. The background is blurred with a bokeh effect.
Comment: Strong visual detail, with clear timestamped and camera movement.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/131239-clip_0000000
0_0.mp4
The scene is set in the afternoon in a jungle, featuring a paved path slightly covered with rocks and
surrounded by tall trees and short grass, with sunlight filtering through the foliage. The video begins with
the subject out of frame. The subject is a thin white female with dark, short, and curly hair, running for
exercise while wearing black capri pants and a light maroon long-sleeve t-shirt, along with gray sneakers
and white socks.
At the 00:00:00.58 mark, the subject enters the frame from the left side, running in the
same direction as the camera. Upon entering the frame, the subject gradually overtakes the camera, and
the distance between her and the camera increases as the video progresses.
Comment: Clear and focused description of environment and movement, timestamps well used and
camera movement well described in relation with the subject
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/253997-clip_000000
00_0.mp4
The scene begins with a female subject who in her early twenties, is thin and has light skin and dark hair
styled in a ponytail that reaches the length of her back. She is wearing beige trousers and a black and
white Aztec diamond-patterned overshirt over a white T-shirt, along with silver drop earrings. She stands
beside a river that flows parallel to the direction of the camera. On the left side of the frame is the subject,
who is standing atop small grass-covered rock formations and tree roots, with foliage of varying heights in
the background and a tree about 8 feet behind her. The right side of the frame is dominated by the flow of
the river, which is pale blue in color and has a rock protruding from the surface. In the distance, on the
ground of the right side of the frame next to the river, there is some foliage.
As the video progresses, the foliage sways slightly due to the wind, along with the overshirt of the female
subject. The subject displays an expression of calmness and pleasure in the clip, momentarily closing her
eyes as well. At the 00:00:02.39 mark, she drops her hand, which was originally over her
head, and looks towards the camera. At 00:00:05.86 , she raises her hand to her collarbone
while tilting her head and closing her eyes. Afterwards, she lowers her hand and wraps it around her waist.
During the length of the video, her right hand remains stationary and rests beside her leg. At
00:00:05.91 , as the camera pans to the right, a second rock protruding from within the
river comes into view.
The camera is positioned at a fixed distance from the subject but occasionally pans around and changes
slightly. The scene is illuminated primarily by warm sunlight, as well as by light reflected o the water.
Comment: Highly detailed and well structured annotation, with clear timestamped actions, and accurate
camera behavior well described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/199735-clip_000000
00_0.mp4
The scene takes place indoors in an oice, featuring a white man in his forties with gray hair and a short
beard. He is wearing a blue, long-sleeve buon-down shirt and gray pants, and he is wearing glasses. He is
seated at his gray desk, operating a desktop computer with his left hand while taking notes with his right
hand. A metal table lamp, which is lit, is positioned to the right of the monitor. In the background, a glass
wall consists of white beams serving as panels, through which another oice, mostly obscured and
illuminated with neon blue light, can be seen. A white blonde woman is seated at her desk, facing to the
right of the frame.
Between 00:00:00.00 and 00:00:01.42 , he looks back and forth between
his notebook and the monitor before focusing on the computer screen. Over his right shoulder, on the glass
wall in the background, is a sketch of the front of a car, while over his left shoulder is a sketch of the top
view of the car from an angle. At 00:00:07.10 , he stops taking notes and presses a single
key on the keyboard, looking at the camera with a neutral expression.
The majority of the lighting is dim and artificial, with a blue hue, likely projecting from ceiling lights.
Additionally, the scene is illuminated by a table lamp which is projecting warm light on to the desk of the
main subject. Further, there are neon blue light fixtures in the background that are adding illumination. The
camera zooms in slowly throughout the scene.
Comment : detailed and objective description with well used timestamps, great camera and lighting
condition description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/376414-clip_000000
00_0.mp4
The scene is set at sunset in a forest with dense trees in the background, featuring a patch of trees on the
left side of the frame that has a few bright yellow leaves. Small clouds of smoke linger behind the subject,
who is a bald, slender, gray-haired monk in his 50s, wearing a deep red Buddhist robe. The monk is seated
still with his eyes closed, wearing a neutral expression while meditating.
He maintains this position for the duration of the clip, while the camera gradually moves closer to him as
the clip progresses.-
The camera captures only the monk from the waist up and zooms in at an upward angle. The scene is
illuminated by the warm natural sunlight from sunset through the trees and leaves in the background.
Comment: Strong environmental detail and lighting description, camera movement clearly described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/374764-clip_000000
00_0.mp4
The scene is set on a street surrounded by various buildings. The o-white building on the left side of the
frame features revivalist architecture and extends into the distance, while the one on the right has modern
architecture and is also o-white in color. The subject is Black, with curly hair and a high fade haircut. He is
wearing a thin grey turtleneck sweater and a black backpack with yellow and black steel zippers, along
with a leather strap chronograph wristwatch and a face mask. In the distance, there are modern
skyscrapers primarily made of glass.
The video begins with the subject looking down at the screen of his phone, which he is holding with both
hands while typing. He is standing next to the upper part of the stairway leading to the subway below.
There is a black sign overhead the staircase indicating the name of the station, i.e., 34 St-Penn Station, in
white and blue font. The middle rail of the stairway is slightly visible on the left side of the frame.
At the 00:00:03.05 mark, a man wearing a white and blue polo shirt and a white hat starts
to walk across the bridge, on the wall of which the aforementioned sign is aached. At the
00:00:05.97 mark, a steel grey hatchback drives across the road into the distance from the
right side of the frame.
The camera is handheld at a fixed position, capturing the subject from the waist up. In terms of lighting,
the environment is naturally sunlit with overcast clouds.
Comment: Great details for the character and well timestamped actions, camera and lighting condition
are described clearly.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/378498-clip_000000
00_0.mp4
The scene begins with a young Black man standing and clutching thin window bars in a dark, poorly lit
room, looking through them at the outdoor environment that appears to contain trees in the distance
beyond a large empty courtyard. He is standing in the right half of the frame. The subject is wearing a
raglan half-sleeve T-shirt with a white torso and black sleeves, along with black true wireless earphones.
He has short, curly black hair and a low fade haircut, and a neutral expression on his face as he peers
through the window bars.
At the 00:00:02.03 mark, the camera moves forward enough to crop out the background
of the interior while geing closer to the subject, who is only visible from above the shoulders, along with
his forearms and hands.
The camera is positioned to the left side of the subject. The scene is primarily illuminated by sunlight and
the white tube light in the interior.
Comment : Well balanced annotation, objective and great character description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/275926-clip_000000
00_1.mp4
The video shows a person on the left side of the screen, kneeling on a wooden plank surface as he plants
tubers into the dark soil. He is wearing blue jeans, a gray knied sweater, and black gloves. Beside his knee
is a white container holding the tubers. The soil, positioned on the right side of the screen, has been dug
into a trench, forming a small mound along the edge. One tuber is already planted at the far end.
At 00:00:01.10 , the person places a tuber into the soil. Then he moves his hand into the
container, picks another one, and plants it at 00:00:04.87 . He repeats this process ,
steadily working along the dugout space. The camera starts o still, then gradually tilts upward, capturing
the scene.
Comment : Great scene description with accurate layout and it is objective.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/291922-clip_000000
00_1.mp4
The video is set indoors in a studio, with a plain black background that has a round light source attached in
the middle to cast a warm light, making the subject appear as a silhouette. From what can be inferred, the
subject is wearing waist-high flare pants and a tucked-in long-sleeve shirt. The subject has short hair, and
in this scene, they are performing hand combat moves using a small knife.
At 00:00:00.00 , the subject is facing their body towards the camera while turned to their
left, wielding the knife in their right hand. They raise the knife upwards with an arm movement while
throwing a punch with their other hand until the 00:00:00.26 mark. Afterwards, they take
a neutral stance and proceed to swing the blade over and then under their hand to their right side while
executing an additional move in that direction with the blade at eye level at 00:00:02.54 .
At 00:00:03.89 , the subject adopts another neutral stance, keeping the blade and their
free hand close to their body while looking to the right. By 00:00:05.09 , they proceed to
throw another combination involving a low blow and another strike at eye level while facing their right
side. They repeat the combination one more time, this time having their left arm resting against the side of
their waist by 00:00:06.85 . The video concludes with the subject initiating the
aforementioned combination again in the same direction, which they start for the last time at
00:00:07.50 , this time having their left arm raised to their head in a defensive position.
The camera remains static, with the subject placed in the middle of the frame the entire time. The scene is
lit using a single warm elliptical light source in the background.
Comment : Well detailed and structured annotation with good timestamped actions and an objective
tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/285583-clip_000000
00_1.mp4
A woman in her twenties with a fair skin tone and golden brown hair tied behind her head swims toward
the foreground right of the frame wearing a black scuba diving and yellow gloves. The oxygen
cylinder behind her back has a pale yellow shade. She holds a small metallic silver rectangular object
between her left index finger and thumb. The flaps on her feet are also black. She is visible in the middle of
the screen at 00:00:00.00 in a horizontal swimming position with her body extending from
the background toward the foreground right and her face turned toward the camera. Water bubbles are
visible rising above her head from both side of her face.
Another person wearing the same is visible in the background. They are also swimming toward the
camera.
The surrounding is filled with bluish water with a rocky boom covering most of the screen in the
foreground in the boom half and the middle ground in the top half. They gradually become blurred
toward the background. The rocks in the foreground are visibly covered with thin patches of green algae.
Small gray fishes swim around the woman and are visible near the top edge.
As the video starts, the woman and the person in the background continue to swim and two fishes from the
top swim and come near the foreground on the right side. They have small yellow tails, a white body with a
black strip on top. The woman then starts turning her head at 00:00:00.98 to look at the
fishes. her eyes then follow the movement of a fish that starts swimming toward the boom from the front
of the woman. She also drops the object from her hand at 00:00:01.25 which is revealed to
be aached with a black stick extending from her. She then extends her right hand to touch the fish
and moves her hand near the fish such that the fish touches the back of her hand at
00:00:02.91 . The woman's gaze follows her.
The camera pans a bit toward the left at 00:00:03.83 and then again starts panning to the
right while moving a bit upward. The woman stops swimming and stays at one place from
00:00:04.33 as she looks at the camera. She is very close to the camera and is visible
mostly on the left half of the screen at the end at 00:00:05.11 .
Comment: well detailed and accurate annotation with great use of timestamps on actions, clear camera
description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/208037-clip_000000
00_0.mp4
The video shows an older woman in an outdoor setting surrounded by various flowers and green plants.
The flowers and vases, in shades of brown, black, and gray, are lined up on the right side of the screen
against a dark-framed wall. Some flowers are nestled among green plant stems, with white, pink, and a
hint of purple flowers visible. Behind the woman, there’s a brown brick wall and a gray door frame on the
left, leading to another area with billowing plants.
The woman has short white hair and is dressed in an o-white shirt with buons down the front, paired
with blue jeans. She wears silver dangling earrings in her right ear. Initially bending over, she straightens
up, at pulls away from a black vase, and at 00:00:01.60 places her left hand on a small
brown vase. At 00:00:02.77 she touches a white flower with her left hand and another with
her right, smiling as she admires them. Finally, at 00:00:08.40 she stretches her right hand
to caress one of the green leaves.
Comment: well detailed annotation with accurate timestamped actions and an objective tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/239838-clip_000000
00_0.mp4
A woman walks on a flat sandy terrain outdoors. She wears a wide-brimmed hat, sunglasses, a face mask
pulled below her chin, a rust-colored jacket over a pink t-shirt, and a backpack which she is wearing on her
shoulders. She also has long dark hair.
The background features a desert-like surface with vehicle tire marks behind her. On the left side of the
frame, a parked silver SUV with a black roof cargo bag is visible. The sun is low on the horizon behind her,
creating natural lighting and casting shadows on the ground.
At 00:00:00.00 she starts walking forward while holding her backpack straps.
At 00:00:01.79 she turns her head slightly to the right while continuing to walk she continues
to turn her head until she looks almost behind her and sun now reflecting to her face at
00:00:07.33
The camera remains static, capturing the subject in a wide shot for the duration of the video. The scene is
illuminated by the sunlight emitted from the setting sun.
Comment: well detailed annotation with accurate timestamped actions and an objective tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/155361-clip_0000000
0_0.mp4
The scene is set inside a gymnasium with large window panels on the upper walls, allowing ample sunlight
to stream in, creating a cooler tone. Below the windows are two wall-mounted air conditioners positioned
far apart, and an LED scoreboard displays a score of 12:11. The floor of the gymnasium is painted blue,
with a green area in the middle. The two subjects are fencers in standard white fencing aire, engaged in a
duel, each having a cable attached to the back of their aire.
The video begins with the fencers standing face to face, legs wide apart, holding up their foils and looking
for an opportunity to strike. The fencer on the right, whose body is facing the camera, is advancing toward
the fencer on the left, who is facing away from the camera and gently stepping back. At the
00:00:01.67 mark, the fencer on the left lifts his foil, and at 00:00:02.14 ,
both fencers lower their foils to feint at each other. The fencer on the right has advanced further to the
left of the frame, while the fencer on the left continues to retreat. At 00:00:04.42 , the
fencer on the left raises his foil, while the fencer on the right lowers his once again. At
00:00:04.57 , the fencer on the right retreats with his lowered foil and strikes forward,
moving his body ahead and taking a wide step to make contact with his rival. The fencer on the left lowers
his foil in an attempt to carry the pack, but ultimately fails.
The camera dollies to the left of the frame at a constant speed, tracking the movement of the subjects.
The environment is naturally lit due to the sunlight coming through the windows.
Comment: Well detailed movement breakdown, timestamps well placed and camera clearly described.
Bad task examples:
1. hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/177736-clip_0
0000000_0.mp4
"00:00:00:01 The scene shows a fair skinned woman with blonde long hair in a brightly lit
room looking down at her phone during the day.
She is wearing a black sleeveless dress and holding a phone with a brown pouch on her right hand.
00:00:00:04She is seen at the middle of the screen bending her head downward leaning a
bit more into the phone.
00:00:01:74She moves a little bit backing out of the phone
00:00:02:96She tilts her head to the left side of the screen and is seen smiling at the phone
00:00:05:92She tilts her head back to the right side of the screen and holds the phone with
both her hands.
Sunlight can be seen at the back of the scene."
Comment: The first timestamp is incorrect because it doesn't correspond to an actual action, it should
only be used to mark specific moments when something happens.
In addition, the caption misses several key visual details. It does not mention the blurry background or the
object visible on the wall. There's also no description of the camera work, such as the angle, movement, or
framing, which are important for understanding how the scene is visually presented.
Important visual elements are also omied, including the light reflecting on her face, her eye color, and her
eyelashes. Furthermore, the caption fails to note the moment when she closes her mouth, which is a clear
and noticeable action that should have been included.
2. http://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/347440-clip_00
000000_0.mp4
00:00:04.54 Close-up of a street-side coee vendor, visible only from the midsection. He’s
wearing a blue shirt with black buons. With his right hand, he pours freshly brewed coee into a small
white wine cup wrapped in crystals that he's holding with his left hand. In the boom right corner of the
frame, an ornate coeepot with a crystal lid and a small ornate plate are visible.
Comment: The caption is too short and omits several important visual details. For instance,
there's no mention of the man's blue jeans, his dark blue, short-sleeve shirt with doed paerns,
or his hairy hands—all of which are clearly visible and contribute to the scene’s realism.
Additionally, key background elements are missing: the blurry area to his left reveals a circular
floor design, and there's a stainless steel coee stand that goes unmentioned. The small coee
cup is described inadequately—it features a white plastic upper part, a shiny, crystalline lower
part, a scooping spoon, and a distinct gold band separating the two sections, but none of this is
captured in the caption.
To his right, there’s a truncated stainless steel bowl resting on a brown wooden stool, and in front
of it, a black, wrapped cloth becomes visible as the camera pans. These are also excluded from
the description.
The caption also lacks an action timestamp to indicate when the man is scooping coee into the
cup, which is a key moment. Finally, it fails to mention the natural lighting conditions and a
daytime scene. The timestamp in the caption does not point to an action.
3. http://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/309939-clip_00
000000_1.mp4
00:00:00.00 The right profile of a brown-skinned boy is seen on the left side of the frame,
being instructed on how to play the violin. He is wearing a short-sleeved navy blue collared shirt and is
visible only from his chest to his brows. In the slightly blurry background, a mid-sized plant sits in front of a
large window. Everything is painted white. Outside the window, green trees are visible.
00:00:01.55 A woman's hands appear in the boom right corner of the frame, gently
assisting the boy's right hand as he uses the bow on the violin.
Comment: The first timestamp does not correspond to any specific action. Additionally, it is
inaccurate to describe the boy as brown-skinned; he is fair or light-skinned. It's also subjective to
assume the hands shown belong to a woman when there’s no clear indication of gender, it's more
appropriate to refer to them simply as a fair-skinned hands.
Several important visual and contextual details are missing. There is no mention of the violin's
placement, the finger movements of the boy’s left hand, or his focused gaze on the instrument
while playing. The camera movement, angle, and framing are also left out, which are important
for understanding how the scene is presented.
The caption omits the actions of the other person’s hands as well: at first, both hands come into
view holding the bow; then, the right hand is removed, and the left hand remains on the bow,
assisting him in practicing until the video ends.
Additional descriptive elements are also missing, such as the boy’s hair length or color, the
natural lighting in the room through the window, and the color of the violin.Follow the order; the first sentence should summarize the video, including setting and featuring character, then the main subject, subject's skin tone and ethnicity, full dressing code of the subjects, action done by the subject, background information, lighting, and camera positioning. I will upload a screenshot for you to start. The lighting and camera angle should be in one line each
Kindly say if it's a man, a woman, or a child. Do not use "what appears to be" or "appears to be," or "suggesting" in your explanations, as this shows a lack of confidence.">
00:00:00.71 , the man first looks at the woman before turning his gaze toward the
camera. His right hand is already on her shoulder. He then slowly moves his hand slightly to the right and
gently clenches her shoulder at 00:00:06.40 , and at 00:00:06.44 , he blinks
his eyes. Both are now looking at the camera.
The lighting is natural daylight. The video starts with a close-up shot, and the camera slightly tilts
upward."
Comment: This annotation is clear and objective, includes detailed descriptions of characters and their
appearance, positioning and movements. Has precise timestamps for actions and lighting conditions and
camera movement are well described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/337407-clip_000000
00_1.mp4
"The video takes place in a bright green background with even lighting, capturing a woman dancing
throughout. The focus is on a young woman with long, wavy, light brown hair. She wears an oversized white
long-sleeve shirt, round earrings and interacts with her hair while posing. At various points in the video, she
looks up and smiles.
At 00:00:00.06 , she raises her left hand and touches her hair with her fingers. Then, at
00:00:01.23 , she raises her right hand, touches her shoulder, moves her right hand through
her hair, and returns her left hand to her side at 00:00:02.12 . She begins jumping four
times, moving both hands up and down alternately and smiling. At 00:00:02.95 , she looks
directly at the camera and flips her hair upward with both hands. At 00:00:03.77 , she looks
up and stretches her arms above shoulder level and pucker whistling, lowering them below head level at
00:00:04.65 . Her movements gently reflect her shadow on the background.
The camera remains static with a medium shot angle throughout the video. The lighting conditions are
artificial indoor lighting."
Comment: clear and detailed action breakdown with accurate timestamps and no subjectivity. Well
described lighting condition and camera movements
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/254480-clip_000000
00_0.mp4
The scene is set inside a building that appears to be a doctor's oice. In the background are large windows
covered by translucent blinds, through which sunlight enters and illuminates the space. Beyond the
windows, there are faint silhouettes of skyscrapers that are blurred out. In the middle of the windows is a
pillar covered in red brick tiles. There are two female subjects present in the frame, both of whom are
wearing flu masks and dark blue sanitary gloves, and have slightly tanned skin tones with dark hair.
The first subject on the left has shoulder-length straight black hair and is wearing a brown buon-down
shirt. She is clutching a silver tablet device with a black case and a dark blue piece of paper in her right
arm. Throughout the video, she talks to the subject on the right while looking at her and briefly looks away
from the second subject at 00:00:04.30 until 00:00:05.96 without changing
the direction of her head, referring to a white piece of paper by signaling at it with her hand.
At 00:00:02.18 to 00:00:03.02 , the subject on the left gently nods her head.
She also shrugs her left shoulder slightly while nodding her head at 00:00:04.30 . The
second subject is wearing a pastel pink buon-down shirt with a white lab coat. She has curly hair styled in
a single ponytail and is holding up two pieces of paper in one hand, looking at their contents throughout
the video while the first subject talks. Both subjects have neutral facial expressions, which can be inferred
from their eyes and eyebrows, as their faces are covered with masks.
The camera is moving towards the left side of the frame at a fixed height, capturing the subjects above the
waist. The scene is lit by the natural sunlight coming through the windows.
Comments: well structured and detailed annotation with good timestamps and camera description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/365681-clip_000000
00_0.mp4
The scene is set outdoors in a forest filled with tall green trees and a pale blue twilight sky. The subject is a
young white blonde woman with blue eyes who is wearing a white sleeveless dress. The scene begins with
the woman facing forward and walking ahead, with the camera capturing her from the back at an angle
biased toward her left side.
At 00:00:00.50 , the woman begins to turn back to look over her left shoulder. At
00:00:03.12 , she has turned her body almost fully around to look back and slightly above
her eye level, wearing an anxious expression that conveys fear. At 00:00:03.86 , she begins
to turn back, and by 00:00:06.12 , she has completed turning her head forward again, with
her hair flowing in the direction of her head movement and starts to pick up the pace. As she begins
walking faster, her loose hair flows and bounces, reflecting her hurried manner of walking.
The lighting is natural, and the camera follows the woman's movement, capturing her only from the waist
up. The background is blurred with a bokeh effect.
Comment: Strong visual detail, with clear timestamped and camera movement.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/131239-clip_0000000
0_0.mp4
The scene is set in the afternoon in a jungle, featuring a paved path slightly covered with rocks and
surrounded by tall trees and short grass, with sunlight filtering through the foliage. The video begins with
the subject out of frame. The subject is a thin white female with dark, short, and curly hair, running for
exercise while wearing black capri pants and a light maroon long-sleeve t-shirt, along with gray sneakers
and white socks.
At the 00:00:00.58 mark, the subject enters the frame from the left side, running in the
same direction as the camera. Upon entering the frame, the subject gradually overtakes the camera, and
the distance between her and the camera increases as the video progresses.
Comment: Clear and focused description of environment and movement, timestamps well used and
camera movement well described in relation with the subject
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/253997-clip_000000
00_0.mp4
The scene begins with a female subject who in her early twenties, is thin and has light skin and dark hair
styled in a ponytail that reaches the length of her back. She is wearing beige trousers and a black and
white Aztec diamond-patterned overshirt over a white T-shirt, along with silver drop earrings. She stands
beside a river that flows parallel to the direction of the camera. On the left side of the frame is the subject,
who is standing atop small grass-covered rock formations and tree roots, with foliage of varying heights in
the background and a tree about 8 feet behind her. The right side of the frame is dominated by the flow of
the river, which is pale blue in color and has a rock protruding from the surface. In the distance, on the
ground of the right side of the frame next to the river, there is some foliage.
As the video progresses, the foliage sways slightly due to the wind, along with the overshirt of the female
subject. The subject displays an expression of calmness and pleasure in the clip, momentarily closing her
eyes as well. At the 00:00:02.39 mark, she drops her hand, which was originally over her
head, and looks towards the camera. At 00:00:05.86 , she raises her hand to her collarbone
while tilting her head and closing her eyes. Afterwards, she lowers her hand and wraps it around her waist.
During the length of the video, her right hand remains stationary and rests beside her leg. At
00:00:05.91 , as the camera pans to the right, a second rock protruding from within the
river comes into view.
The camera is positioned at a fixed distance from the subject but occasionally pans around and changes
slightly. The scene is illuminated primarily by warm sunlight, as well as by light reflected o the water.
Comment: Highly detailed and well structured annotation, with clear timestamped actions, and accurate
camera behavior well described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/199735-clip_000000
00_0.mp4
The scene takes place indoors in an oice, featuring a white man in his forties with gray hair and a short
beard. He is wearing a blue, long-sleeve buon-down shirt and gray pants, and he is wearing glasses. He is
seated at his gray desk, operating a desktop computer with his left hand while taking notes with his right
hand. A metal table lamp, which is lit, is positioned to the right of the monitor. In the background, a glass
wall consists of white beams serving as panels, through which another oice, mostly obscured and
illuminated with neon blue light, can be seen. A white blonde woman is seated at her desk, facing to the
right of the frame.
Between 00:00:00.00 and 00:00:01.42 , he looks back and forth between
his notebook and the monitor before focusing on the computer screen. Over his right shoulder, on the glass
wall in the background, is a sketch of the front of a car, while over his left shoulder is a sketch of the top
view of the car from an angle. At 00:00:07.10 , he stops taking notes and presses a single
key on the keyboard, looking at the camera with a neutral expression.
The majority of the lighting is dim and artificial, with a blue hue, likely projecting from ceiling lights.
Additionally, the scene is illuminated by a table lamp which is projecting warm light on to the desk of the
main subject. Further, there are neon blue light fixtures in the background that are adding illumination. The
camera zooms in slowly throughout the scene.
Comment : detailed and objective description with well used timestamps, great camera and lighting
condition description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/376414-clip_000000
00_0.mp4
The scene is set at sunset in a forest with dense trees in the background, featuring a patch of trees on the
left side of the frame that has a few bright yellow leaves. Small clouds of smoke linger behind the subject,
who is a bald, slender, gray-haired monk in his 50s, wearing a deep red Buddhist robe. The monk is seated
still with his eyes closed, wearing a neutral expression while meditating.
He maintains this position for the duration of the clip, while the camera gradually moves closer to him as
the clip progresses.-
The camera captures only the monk from the waist up and zooms in at an upward angle. The scene is
illuminated by the warm natural sunlight from sunset through the trees and leaves in the background.
Comment: Strong environmental detail and lighting description, camera movement clearly described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/374764-clip_000000
00_0.mp4
The scene is set on a street surrounded by various buildings. The o-white building on the left side of the
frame features revivalist architecture and extends into the distance, while the one on the right has modern
architecture and is also o-white in color. The subject is Black, with curly hair and a high fade haircut. He is
wearing a thin grey turtleneck sweater and a black backpack with yellow and black steel zippers, along
with a leather strap chronograph wristwatch and a face mask. In the distance, there are modern
skyscrapers primarily made of glass.
The video begins with the subject looking down at the screen of his phone, which he is holding with both
hands while typing. He is standing next to the upper part of the stairway leading to the subway below.
There is a black sign overhead the staircase indicating the name of the station, i.e., 34 St-Penn Station, in
white and blue font. The middle rail of the stairway is slightly visible on the left side of the frame.
At the 00:00:03.05 mark, a man wearing a white and blue polo shirt and a white hat starts
to walk across the bridge, on the wall of which the aforementioned sign is aached. At the
00:00:05.97 mark, a steel grey hatchback drives across the road into the distance from the
right side of the frame.
The camera is handheld at a fixed position, capturing the subject from the waist up. In terms of lighting,
the environment is naturally sunlit with overcast clouds.
Comment: Great details for the character and well timestamped actions, camera and lighting condition
are described clearly.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/378498-clip_000000
00_0.mp4
The scene begins with a young Black man standing and clutching thin window bars in a dark, poorly lit
room, looking through them at the outdoor environment that appears to contain trees in the distance
beyond a large empty courtyard. He is standing in the right half of the frame. The subject is wearing a
raglan half-sleeve T-shirt with a white torso and black sleeves, along with black true wireless earphones.
He has short, curly black hair and a low fade haircut, and a neutral expression on his face as he peers
through the window bars.
At the 00:00:02.03 mark, the camera moves forward enough to crop out the background
of the interior while geing closer to the subject, who is only visible from above the shoulders, along with
his forearms and hands.
The camera is positioned to the left side of the subject. The scene is primarily illuminated by sunlight and
the white tube light in the interior.
Comment : Well balanced annotation, objective and great character description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/275926-clip_000000
00_1.mp4
The video shows a person on the left side of the screen, kneeling on a wooden plank surface as he plants
tubers into the dark soil. He is wearing blue jeans, a gray knied sweater, and black gloves. Beside his knee
is a white container holding the tubers. The soil, positioned on the right side of the screen, has been dug
into a trench, forming a small mound along the edge. One tuber is already planted at the far end.
At 00:00:01.10 , the person places a tuber into the soil. Then he moves his hand into the
container, picks another one, and plants it at 00:00:04.87 . He repeats this process ,
steadily working along the dugout space. The camera starts o still, then gradually tilts upward, capturing
the scene.
Comment : Great scene description with accurate layout and it is objective.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/291922-clip_000000
00_1.mp4
The video is set indoors in a studio, with a plain black background that has a round light source attached in
the middle to cast a warm light, making the subject appear as a silhouette. From what can be inferred, the
subject is wearing waist-high flare pants and a tucked-in long-sleeve shirt. The subject has short hair, and
in this scene, they are performing hand combat moves using a small knife.
At 00:00:00.00 , the subject is facing their body towards the camera while turned to their
left, wielding the knife in their right hand. They raise the knife upwards with an arm movement while
throwing a punch with their other hand until the 00:00:00.26 mark. Afterwards, they take
a neutral stance and proceed to swing the blade over and then under their hand to their right side while
executing an additional move in that direction with the blade at eye level at 00:00:02.54 .
At 00:00:03.89 , the subject adopts another neutral stance, keeping the blade and their
free hand close to their body while looking to the right. By 00:00:05.09 , they proceed to
throw another combination involving a low blow and another strike at eye level while facing their right
side. They repeat the combination one more time, this time having their left arm resting against the side of
their waist by 00:00:06.85 . The video concludes with the subject initiating the
aforementioned combination again in the same direction, which they start for the last time at
00:00:07.50 , this time having their left arm raised to their head in a defensive position.
The camera remains static, with the subject placed in the middle of the frame the entire time. The scene is
lit using a single warm elliptical light source in the background.
Comment : Well detailed and structured annotation with good timestamped actions and an objective
tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/285583-clip_000000
00_1.mp4
A woman in her twenties with a fair skin tone and golden brown hair tied behind her head swims toward
the foreground right of the frame wearing a black scuba diving and yellow gloves. The oxygen
cylinder behind her back has a pale yellow shade. She holds a small metallic silver rectangular object
between her left index finger and thumb. The flaps on her feet are also black. She is visible in the middle of
the screen at 00:00:00.00 in a horizontal swimming position with her body extending from
the background toward the foreground right and her face turned toward the camera. Water bubbles are
visible rising above her head from both side of her face.
Another person wearing the same is visible in the background. They are also swimming toward the
camera.
The surrounding is filled with bluish water with a rocky boom covering most of the screen in the
foreground in the boom half and the middle ground in the top half. They gradually become blurred
toward the background. The rocks in the foreground are visibly covered with thin patches of green algae.
Small gray fishes swim around the woman and are visible near the top edge.
As the video starts, the woman and the person in the background continue to swim and two fishes from the
top swim and come near the foreground on the right side. They have small yellow tails, a white body with a
black strip on top. The woman then starts turning her head at 00:00:00.98 to look at the
fishes. her eyes then follow the movement of a fish that starts swimming toward the boom from the front
of the woman. She also drops the object from her hand at 00:00:01.25 which is revealed to
be aached with a black stick extending from her. She then extends her right hand to touch the fish
and moves her hand near the fish such that the fish touches the back of her hand at
00:00:02.91 . The woman's gaze follows her.
The camera pans a bit toward the left at 00:00:03.83 and then again starts panning to the
right while moving a bit upward. The woman stops swimming and stays at one place from
00:00:04.33 as she looks at the camera. She is very close to the camera and is visible
mostly on the left half of the screen at the end at 00:00:05.11 .
Comment: well detailed and accurate annotation with great use of timestamps on actions, clear camera
description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/208037-clip_000000
00_0.mp4
The video shows an older woman in an outdoor setting surrounded by various flowers and green plants.
The flowers and vases, in shades of brown, black, and gray, are lined up on the right side of the screen
against a dark-framed wall. Some flowers are nestled among green plant stems, with white, pink, and a
hint of purple flowers visible. Behind the woman, there’s a brown brick wall and a gray door frame on the
left, leading to another area with billowing plants.
The woman has short white hair and is dressed in an o-white shirt with buons down the front, paired
with blue jeans. She wears silver dangling earrings in her right ear. Initially bending over, she straightens
up, at pulls away from a black vase, and at 00:00:01.60 places her left hand on a small
brown vase. At 00:00:02.77 she touches a white flower with her left hand and another with
her right, smiling as she admires them. Finally, at 00:00:08.40 she stretches her right hand
to caress one of the green leaves.
Comment: well detailed annotation with accurate timestamped actions and an objective tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/239838-clip_000000
00_0.mp4
A woman walks on a flat sandy terrain outdoors. She wears a wide-brimmed hat, sunglasses, a face mask
pulled below her chin, a rust-colored jacket over a pink t-shirt, and a backpack which she is wearing on her
shoulders. She also has long dark hair.
The background features a desert-like surface with vehicle tire marks behind her. On the left side of the
frame, a parked silver SUV with a black roof cargo bag is visible. The sun is low on the horizon behind her,
creating natural lighting and casting shadows on the ground.
At 00:00:00.00 she starts walking forward while holding her backpack straps.
At 00:00:01.79 she turns her head slightly to the right while continuing to walk she continues
to turn her head until she looks almost behind her and sun now reflecting to her face at
00:00:07.33
The camera remains static, capturing the subject in a wide shot for the duration of the video. The scene is
illuminated by the sunlight emitted from the setting sun.
Comment: well detailed annotation with accurate timestamped actions and an objective tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/155361-clip_0000000
0_0.mp4
The scene is set inside a gymnasium with large window panels on the upper walls, allowing ample sunlight
to stream in, creating a cooler tone. Below the windows are two wall-mounted air conditioners positioned
far apart, and an LED scoreboard displays a score of 12:11. The floor of the gymnasium is painted blue,
with a green area in the middle. The two subjects are fencers in standard white fencing aire, engaged in a
duel, each having a cable attached to the back of their aire.
The video begins with the fencers standing face to face, legs wide apart, holding up their foils and looking
for an opportunity to strike. The fencer on the right, whose body is facing the camera, is advancing toward
the fencer on the left, who is facing away from the camera and gently stepping back. At the
00:00:01.67 mark, the fencer on the left lifts his foil, and at 00:00:02.14 ,
both fencers lower their foils to feint at each other. The fencer on the right has advanced further to the
left of the frame, while the fencer on the left continues to retreat. At 00:00:04.42 , the
fencer on the left raises his foil, while the fencer on the right lowers his once again. At
00:00:04.57 , the fencer on the right retreats with his lowered foil and strikes forward,
moving his body ahead and taking a wide step to make contact with his rival. The fencer on the left lowers
his foil in an attempt to carry the pack, but ultimately fails.
The camera dollies to the left of the frame at a constant speed, tracking the movement of the subjects.
The environment is naturally lit due to the sunlight coming through the windows.
Comment: Well detailed movement breakdown, timestamps well placed and camera clearly described.
Bad task examples:
1. hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/177736-clip_0
0000000_0.mp4
"00:00:00:01 The scene shows a fair skinned woman with blonde long hair in a brightly lit
room looking down at her phone during the day.
She is wearing a black sleeveless dress and holding a phone with a brown pouch on her right hand.
00:00:00:04She is seen at the middle of the screen bending her head downward leaning a
bit more into the phone.
00:00:01:74She moves a little bit backing out of the phone
00:00:02:96She tilts her head to the left side of the screen and is seen smiling at the phone
00:00:05:92She tilts her head back to the right side of the screen and holds the phone with
both her hands.
Sunlight can be seen at the back of the scene."
Comment: The first timestamp is incorrect because it doesn't correspond to an actual action, it should
only be used to mark specific moments when something happens.
In addition, the caption misses several key visual details. It does not mention the blurry background or the
object visible on the wall. There's also no description of the camera work, such as the angle, movement, or
framing, which are important for understanding how the scene is visually presented.
Important visual elements are also omied, including the light reflecting on her face, her eye color, and her
eyelashes. Furthermore, the caption fails to note the moment when she closes her mouth, which is a clear
and noticeable action that should have been included.
2. http://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/347440-clip_00
000000_0.mp4
00:00:04.54 Close-up of a street-side coee vendor, visible only from the midsection. He’s
wearing a blue shirt with black buons. With his right hand, he pours freshly brewed coee into a small
white wine cup wrapped in crystals that he's holding with his left hand. In the boom right corner of the
frame, an ornate coeepot with a crystal lid and a small ornate plate are visible.
Comment: The caption is too short and omits several important visual details. For instance,
there's no mention of the man's blue jeans, his dark blue, short-sleeve shirt with doed paerns,
or his hairy hands—all of which are clearly visible and contribute to the scene’s realism.
Additionally, key background elements are missing: the blurry area to his left reveals a circular
floor design, and there's a stainless steel coee stand that goes unmentioned. The small coee
cup is described inadequately—it features a white plastic upper part, a shiny, crystalline lower
part, a scooping spoon, and a distinct gold band separating the two sections, but none of this is
captured in the caption.
To his right, there’s a truncated stainless steel bowl resting on a brown wooden stool, and in front
of it, a black, wrapped cloth becomes visible as the camera pans. These are also excluded from
the description.
The caption also lacks an action timestamp to indicate when the man is scooping coee into the
cup, which is a key moment. Finally, it fails to mention the natural lighting conditions and a
daytime scene. The timestamp in the caption does not point to an action.
3. http://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/309939-clip_00
000000_1.mp4
00:00:00.00 The right profile of a brown-skinned boy is seen on the left side of the frame,
being instructed on how to play the violin. He is wearing a short-sleeved navy blue collared shirt and is
visible only from his chest to his brows. In the slightly blurry background, a mid-sized plant sits in front of a
large window. Everything is painted white. Outside the window, green trees are visible.
00:00:01.55 A woman's hands appear in the boom right corner of the frame, gently
assisting the boy's right hand as he uses the bow on the violin.
Comment: The first timestamp does not correspond to any specific action. Additionally, it is
inaccurate to describe the boy as brown-skinned; he is fair or light-skinned. It's also subjective to
assume the hands shown belong to a woman when there’s no clear indication of gender, it's more
appropriate to refer to them simply as a fair-skinned hands.
Several important visual and contextual details are missing. There is no mention of the violin's
placement, the finger movements of the boy’s left hand, or his focused gaze on the instrument
while playing. The camera movement, angle, and framing are also left out, which are important
for understanding how the scene is presented.
The caption omits the actions of the other person’s hands as well: at first, both hands come into
view holding the bow; then, the right hand is removed, and the left hand remains on the bow,
assisting him in practicing until the video ends.
Additional descriptive elements are also missing, such as the boy’s hair length or color, the
natural lighting in the room through the window, and the color of the violin.Use these task examples for reference on how you will execute the tasks. Follow the order; the first sentence should summarize the video, including setting and featuring character, then the main subject, subject's skin tone and ethnicity, full dressing code of the subjects, action done by the subject, background information, lighting, and camera positioning. I will upload a screenshot for you to start. The lighting and camera angle should be in one line each
Kindly say if it's a man, a woman, or a child. Do not use "what appears to be" or "appears to be," or "suggesting" in your explanations, as this shows a lack of confidence.
Task Examples
Good task examples:
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/247758-clip_000000
00_0.mp4
"The setting is outdoors, surrounded by lush greenery, including many plants and trees in the background.
The main focus is a woman and a man.
A woman with thick black thick hair is in the foreground, wearing a white top, a silver colored chain, and
a surgical mask. Two moles are visible on her face, one at the edge of her right eyebrow and another on
the right side of her nose near her right eye above the mask. Behind her, slightly to her left, a man with
blonde hair, black framed glasses, and a blue checkered shirt is also wearing a surgical mask. He stands
close behind her left side with his right hand resting on her right shoulder.
At 00:00:00.71 , the man first looks at the woman before turning his gaze toward the
camera. His right hand is already on her shoulder. He then slowly moves his hand slightly to the right and
gently clenches her shoulder at 00:00:06.40 , and at 00:00:06.44 , he blinks
his eyes. Both are now looking at the camera.
The lighting is natural daylight. The video starts with a close-up shot, and the camera slightly tilts
upward."
Comment: This annotation is clear and objective, includes detailed descriptions of characters and their
appearance, positioning and movements. Has precise timestamps for actions and lighting conditions and
camera movement are well described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/337407-clip_000000
00_1.mp4
"The video takes place in a bright green background with even lighting, capturing a woman dancing
throughout. The focus is on a young woman with long, wavy, light brown hair. She wears an oversized white
long-sleeve shirt, round earrings and interacts with her hair while posing. At various points in the video, she
looks up and smiles.
At 00:00:00.06 , she raises her left hand and touches her hair with her fingers. Then, at
00:00:01.23 , she raises her right hand, touches her shoulder, moves her right hand through
her hair, and returns her left hand to her side at 00:00:02.12 . She begins jumping four
times, moving both hands up and down alternately and smiling. At 00:00:02.95 , she looks
directly at the camera and flips her hair upward with both hands. At 00:00:03.77 , she looks
up and stretches her arms above shoulder level and pucker whistling, lowering them below head level at
00:00:04.65 . Her movements gently reflect her shadow on the background.
The camera remains static with a medium shot angle throughout the video. The lighting conditions are
artificial indoor lighting."
Comment: clear and detailed action breakdown with accurate timestamps and no subjectivity. Well
described lighting condition and camera movements
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/254480-clip_000000
00_0.mp4
The scene is set inside a building that appears to be a doctor's oice. In the background are large windows
covered by translucent blinds, through which sunlight enters and illuminates the space. Beyond the
windows, there are faint silhouettes of skyscrapers that are blurred out. In the middle of the windows is a
pillar covered in red brick tiles. There are two female subjects present in the frame, both of whom are
wearing flu masks and dark blue sanitary gloves, and have slightly tanned skin tones with dark hair.
The first subject on the left has shoulder-length straight black hair and is wearing a brown buon-down
shirt. She is clutching a silver tablet device with a black case and a dark blue piece of paper in her right
arm. Throughout the video, she talks to the subject on the right while looking at her and briefly looks away
from the second subject at 00:00:04.30 until 00:00:05.96 without changing
the direction of her head, referring to a white piece of paper by signaling at it with her hand.
At 00:00:02.18 to 00:00:03.02 , the subject on the left gently nods her head.
She also shrugs her left shoulder slightly while nodding her head at 00:00:04.30 . The
second subject is wearing a pastel pink buon-down shirt with a white lab coat. She has curly hair styled in
a single ponytail and is holding up two pieces of paper in one hand, looking at their contents throughout
the video while the first subject talks. Both subjects have neutral facial expressions, which can be inferred
from their eyes and eyebrows, as their faces are covered with masks.
The camera is moving towards the left side of the frame at a fixed height, capturing the subjects above the
waist. The scene is lit by the natural sunlight coming through the windows.
Comments: well structured and detailed annotation with good timestamps and camera description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/365681-clip_000000
00_0.mp4
The scene is set outdoors in a forest filled with tall green trees and a pale blue twilight sky. The subject is a
young white blonde woman with blue eyes who is wearing a white sleeveless dress. The scene begins with
the woman facing forward and walking ahead, with the camera capturing her from the back at an angle
biased toward her left side.
At 00:00:00.50 , the woman begins to turn back to look over her left shoulder. At
00:00:03.12 , she has turned her body almost fully around to look back and slightly above
her eye level, wearing an anxious expression that conveys fear. At 00:00:03.86 , she begins
to turn back, and by 00:00:06.12 , she has completed turning her head forward again, with
her hair flowing in the direction of her head movement and starts to pick up the pace. As she begins
walking faster, her loose hair flows and bounces, reflecting her hurried manner of walking.
The lighting is natural, and the camera follows the woman's movement, capturing her only from the waist
up. The background is blurred with a bokeh effect.
Comment: Strong visual detail, with clear timestamped and camera movement.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/131239-clip_0000000
0_0.mp4
The scene is set in the afternoon in a jungle, featuring a paved path slightly covered with rocks and
surrounded by tall trees and short grass, with sunlight filtering through the foliage. The video begins with
the subject out of frame. The subject is a thin white female with dark, short, and curly hair, running for
exercise while wearing black capri pants and a light maroon long-sleeve t-shirt, along with gray sneakers
and white socks.
At the 00:00:00.58 mark, the subject enters the frame from the left side, running in the
same direction as the camera. Upon entering the frame, the subject gradually overtakes the camera, and
the distance between her and the camera increases as the video progresses.
Comment: Clear and focused description of environment and movement, timestamps well used and
camera movement well described in relation with the subject
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/253997-clip_000000
00_0.mp4
The scene begins with a female subject who in her early twenties, is thin and has light skin and dark hair
styled in a ponytail that reaches the length of her back. She is wearing beige trousers and a black and
white Aztec diamond-patterned overshirt over a white T-shirt, along with silver drop earrings. She stands
beside a river that flows parallel to the direction of the camera. On the left side of the frame is the subject,
who is standing atop small grass-covered rock formations and tree roots, with foliage of varying heights in
the background and a tree about 8 feet behind her. The right side of the frame is dominated by the flow of
the river, which is pale blue in color and has a rock protruding from the surface. In the distance, on the
ground of the right side of the frame next to the river, there is some foliage.
As the video progresses, the foliage sways slightly due to the wind, along with the overshirt of the female
subject. The subject displays an expression of calmness and pleasure in the clip, momentarily closing her
eyes as well. At the 00:00:02.39 mark, she drops her hand, which was originally over her
head, and looks towards the camera. At 00:00:05.86 , she raises her hand to her collarbone
while tilting her head and closing her eyes. Afterwards, she lowers her hand and wraps it around her waist.
During the length of the video, her right hand remains stationary and rests beside her leg. At
00:00:05.91 , as the camera pans to the right, a second rock protruding from within the
river comes into view.
The camera is positioned at a fixed distance from the subject but occasionally pans around and changes
slightly. The scene is illuminated primarily by warm sunlight, as well as by light reflected o the water.
Comment: Highly detailed and well structured annotation, with clear timestamped actions, and accurate
camera behavior well described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/199735-clip_000000
00_0.mp4
The scene takes place indoors in an oice, featuring a white man in his forties with gray hair and a short
beard. He is wearing a blue, long-sleeve buon-down shirt and gray pants, and he is wearing glasses. He is
seated at his gray desk, operating a desktop computer with his left hand while taking notes with his right
hand. A metal table lamp, which is lit, is positioned to the right of the monitor. In the background, a glass
wall consists of white beams serving as panels, through which another oice, mostly obscured and
illuminated with neon blue light, can be seen. A white blonde woman is seated at her desk, facing to the
right of the frame.
Between 00:00:00.00 and 00:00:01.42 , he looks back and forth between
his notebook and the monitor before focusing on the computer screen. Over his right shoulder, on the glass
wall in the background, is a sketch of the front of a car, while over his left shoulder is a sketch of the top
view of the car from an angle. At 00:00:07.10 , he stops taking notes and presses a single
key on the keyboard, looking at the camera with a neutral expression.
The majority of the lighting is dim and artificial, with a blue hue, likely projecting from ceiling lights.
Additionally, the scene is illuminated by a table lamp which is projecting warm light on to the desk of the
main subject. Further, there are neon blue light fixtures in the background that are adding illumination. The
camera zooms in slowly throughout the scene.
Comment : detailed and objective description with well used timestamps, great camera and lighting
condition description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/376414-clip_000000
00_0.mp4
The scene is set at sunset in a forest with dense trees in the background, featuring a patch of trees on the
left side of the frame that has a few bright yellow leaves. Small clouds of smoke linger behind the subject,
who is a bald, slender, gray-haired monk in his 50s, wearing a deep red Buddhist robe. The monk is seated
still with his eyes closed, wearing a neutral expression while meditating.
He maintains this position for the duration of the clip, while the camera gradually moves closer to him as
the clip progresses.-
The camera captures only the monk from the waist up and zooms in at an upward angle. The scene is
illuminated by the warm natural sunlight from sunset through the trees and leaves in the background.
Comment: Strong environmental detail and lighting description, camera movement clearly described.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/374764-clip_000000
00_0.mp4
The scene is set on a street surrounded by various buildings. The o-white building on the left side of the
frame features revivalist architecture and extends into the distance, while the one on the right has modern
architecture and is also o-white in color. The subject is Black, with curly hair and a high fade haircut. He is
wearing a thin grey turtleneck sweater and a black backpack with yellow and black steel zippers, along
with a leather strap chronograph wristwatch and a face mask. In the distance, there are modern
skyscrapers primarily made of glass.
The video begins with the subject looking down at the screen of his phone, which he is holding with both
hands while typing. He is standing next to the upper part of the stairway leading to the subway below.
There is a black sign overhead the staircase indicating the name of the station, i.e., 34 St-Penn Station, in
white and blue font. The middle rail of the stairway is slightly visible on the left side of the frame.
At the 00:00:03.05 mark, a man wearing a white and blue polo shirt and a white hat starts
to walk across the bridge, on the wall of which the aforementioned sign is aached. At the
00:00:05.97 mark, a steel grey hatchback drives across the road into the distance from the
right side of the frame.
The camera is handheld at a fixed position, capturing the subject from the waist up. In terms of lighting,
the environment is naturally sunlit with overcast clouds.
Comment: Great details for the character and well timestamped actions, camera and lighting condition
are described clearly.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/378498-clip_000000
00_0.mp4
The scene begins with a young Black man standing and clutching thin window bars in a dark, poorly lit
room, looking through them at the outdoor environment that appears to contain trees in the distance
beyond a large empty courtyard. He is standing in the right half of the frame. The subject is wearing a
raglan half-sleeve T-shirt with a white torso and black sleeves, along with black true wireless earphones.
He has short, curly black hair and a low fade haircut, and a neutral expression on his face as he peers
through the window bars.
At the 00:00:02.03 mark, the camera moves forward enough to crop out the background
of the interior while geing closer to the subject, who is only visible from above the shoulders, along with
his forearms and hands.
The camera is positioned to the left side of the subject. The scene is primarily illuminated by sunlight and
the white tube light in the interior.
Comment : Well balanced annotation, objective and great character description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/275926-clip_000000
00_1.mp4
The video shows a person on the left side of the screen, kneeling on a wooden plank surface as he plants
tubers into the dark soil. He is wearing blue jeans, a gray knied sweater, and black gloves. Beside his knee
is a white container holding the tubers. The soil, positioned on the right side of the screen, has been dug
into a trench, forming a small mound along the edge. One tuber is already planted at the far end.
At 00:00:01.10 , the person places a tuber into the soil. Then he moves his hand into the
container, picks another one, and plants it at 00:00:04.87 . He repeats this process ,
steadily working along the dugout space. The camera starts o still, then gradually tilts upward, capturing
the scene.
Comment : Great scene description with accurate layout and it is objective.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/291922-clip_000000
00_1.mp4
The video is set indoors in a studio, with a plain black background that has a round light source attached in
the middle to cast a warm light, making the subject appear as a silhouette. From what can be inferred, the
subject is wearing waist-high flare pants and a tucked-in long-sleeve shirt. The subject has short hair, and
in this scene, they are performing hand combat moves using a small knife.
At 00:00:00.00 , the subject is facing their body towards the camera while turned to their
left, wielding the knife in their right hand. They raise the knife upwards with an arm movement while
throwing a punch with their other hand until the 00:00:00.26 mark. Afterwards, they take
a neutral stance and proceed to swing the blade over and then under their hand to their right side while
executing an additional move in that direction with the blade at eye level at 00:00:02.54 .
At 00:00:03.89 , the subject adopts another neutral stance, keeping the blade and their
free hand close to their body while looking to the right. By 00:00:05.09 , they proceed to
throw another combination involving a low blow and another strike at eye level while facing their right
side. They repeat the combination one more time, this time having their left arm resting against the side of
their waist by 00:00:06.85 . The video concludes with the subject initiating the
aforementioned combination again in the same direction, which they start for the last time at
00:00:07.50 , this time having their left arm raised to their head in a defensive position.
The camera remains static, with the subject placed in the middle of the frame the entire time. The scene is
lit using a single warm elliptical light source in the background.
Comment : Well detailed and structured annotation with good timestamped actions and an objective
tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/285583-clip_000000
00_1.mp4
A woman in her twenties with a fair skin tone and golden brown hair tied behind her head swims toward
the foreground right of the frame wearing a black scuba diving and yellow gloves. The oxygen
cylinder behind her back has a pale yellow shade. She holds a small metallic silver rectangular object
between her left index finger and thumb. The flaps on her feet are also black. She is visible in the middle of
the screen at 00:00:00.00 in a horizontal swimming position with her body extending from
the background toward the foreground right and her face turned toward the camera. Water bubbles are
visible rising above her head from both side of her face.
Another person wearing the same is visible in the background. They are also swimming toward the
camera.
The surrounding is filled with bluish water with a rocky boom covering most of the screen in the
foreground in the boom half and the middle ground in the top half. They gradually become blurred
toward the background. The rocks in the foreground are visibly covered with thin patches of green algae.
Small gray fishes swim around the woman and are visible near the top edge.
As the video starts, the woman and the person in the background continue to swim and two fishes from the
top swim and come near the foreground on the right side. They have small yellow tails, a white body with a
black strip on top. The woman then starts turning her head at 00:00:00.98 to look at the
fishes. her eyes then follow the movement of a fish that starts swimming toward the boom from the front
of the woman. She also drops the object from her hand at 00:00:01.25 which is revealed to
be aached with a black stick extending from her. She then extends her right hand to touch the fish
and moves her hand near the fish such that the fish touches the back of her hand at
00:00:02.91 . The woman's gaze follows her.
The camera pans a bit toward the left at 00:00:03.83 and then again starts panning to the
right while moving a bit upward. The woman stops swimming and stays at one place from
00:00:04.33 as she looks at the camera. She is very close to the camera and is visible
mostly on the left half of the screen at the end at 00:00:05.11 .
Comment: well detailed and accurate annotation with great use of timestamps on actions, clear camera
description.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/208037-clip_000000
00_0.mp4
The video shows an older woman in an outdoor setting surrounded by various flowers and green plants.
The flowers and vases, in shades of brown, black, and gray, are lined up on the right side of the screen
against a dark-framed wall. Some flowers are nestled among green plant stems, with white, pink, and a
hint of purple flowers visible. Behind the woman, there’s a brown brick wall and a gray door frame on the
left, leading to another area with billowing plants.
The woman has short white hair and is dressed in an o-white shirt with buons down the front, paired
with blue jeans. She wears silver dangling earrings in her right ear. Initially bending over, she straightens
up, at pulls away from a black vase, and at 00:00:01.60 places her left hand on a small
brown vase. At 00:00:02.77 she touches a white flower with her left hand and another with
her right, smiling as she admires them. Finally, at 00:00:08.40 she stretches her right hand
to caress one of the green leaves.
Comment: well detailed annotation with accurate timestamped actions and an objective tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/239838-clip_000000
00_0.mp4
A woman walks on a flat sandy terrain outdoors. She wears a wide-brimmed hat, sunglasses, a face mask
pulled below her chin, a rust-colored jacket over a pink t-shirt, and a backpack which she is wearing on her
shoulders. She also has long dark hair.
The background features a desert-like surface with vehicle tire marks behind her. On the left side of the
frame, a parked silver SUV with a black roof cargo bag is visible. The sun is low on the horizon behind her,
creating natural lighting and casting shadows on the ground.
At 00:00:00.00 she starts walking forward while holding her backpack straps.
At 00:00:01.79 she turns her head slightly to the right while continuing to walk she continues
to turn her head until she looks almost behind her and sun now reflecting to her face at
00:00:07.33
The camera remains static, capturing the subject in a wide shot for the duration of the video. The scene is
illuminated by the sunlight emitted from the setting sun.
Comment: well detailed annotation with accurate timestamped actions and an objective tone.
hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/155361-clip_0000000
0_0.mp4
The scene is set inside a gymnasium with large window panels on the upper walls, allowing ample sunlight
to stream in, creating a cooler tone. Below the windows are two wall-mounted air conditioners positioned
far apart, and an LED scoreboard displays a score of 12:11. The floor of the gymnasium is painted blue,
with a green area in the middle. The two subjects are fencers in standard white fencing aire, engaged in a
duel, each having a cable attached to the back of their aire.
The video begins with the fencers standing face to face, legs wide apart, holding up their foils and looking
for an opportunity to strike. The fencer on the right, whose body is facing the camera, is advancing toward
the fencer on the left, who is facing away from the camera and gently stepping back. At the
00:00:01.67 mark, the fencer on the left lifts his foil, and at 00:00:02.14 ,
both fencers lower their foils to feint at each other. The fencer on the right has advanced further to the
left of the frame, while the fencer on the left continues to retreat. At 00:00:04.42 , the
fencer on the left raises his foil, while the fencer on the right lowers his once again. At
00:00:04.57 , the fencer on the right retreats with his lowered foil and strikes forward,
moving his body ahead and taking a wide step to make contact with his rival. The fencer on the left lowers
his foil in an attempt to carry the pack, but ultimately fails.
The camera dollies to the left of the frame at a constant speed, tracking the movement of the subjects.
The environment is naturally lit due to the sunlight coming through the windows.
Comment: Well detailed movement breakdown, timestamps well placed and camera clearly described.
Bad task examples:
1. hp://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/177736-clip_0
0000000_0.mp4
"00:00:00:01 The scene shows a fair skinned woman with blonde long hair in a brightly lit
room looking down at her phone during the day.
She is wearing a black sleeveless dress and holding a phone with a brown pouch on her right hand.
00:00:00:04She is seen at the middle of the screen bending her head downward leaning a
bit more into the phone.
00:00:01:74She moves a little bit backing out of the phone
00:00:02:96She tilts her head to the left side of the screen and is seen smiling at the phone
00:00:05:92She tilts her head back to the right side of the screen and holds the phone with
both her hands.
Sunlight can be seen at the back of the scene."
Comment: The first timestamp is incorrect because it doesn't correspond to an actual action, it should
only be used to mark specific moments when something happens.
In addition, the caption misses several key visual details. It does not mention the blurry background or the
object visible on the wall. There's also no description of the camera work, such as the angle, movement, or
framing, which are important for understanding how the scene is visually presented.
Important visual elements are also omied, including the light reflecting on her face, her eye color, and her
eyelashes. Furthermore, the caption fails to note the moment when she closes her mouth, which is a clear
and noticeable action that should have been included.
2. http://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/347440-clip_00
000000_0.mp4
00:00:04.54 Close-up of a street-side coee vendor, visible only from the midsection. He’s
wearing a blue shirt with black buons. With his right hand, he pours freshly brewed coee into a small
white wine cup wrapped in crystals that he's holding with his left hand. In the boom right corner of the
frame, an ornate coeepot with a crystal lid and a small ornate plate are visible.
Comment: The caption is too short and omits several important visual details. For instance,
there's no mention of the man's blue jeans, his dark blue, short-sleeve shirt with doed paerns,
or his hairy hands—all of which are clearly visible and contribute to the scene’s realism.
Additionally, key background elements are missing: the blurry area to his left reveals a circular
floor design, and there's a stainless steel coee stand that goes unmentioned. The small coee
cup is described inadequately—it features a white plastic upper part, a shiny, crystalline lower
part, a scooping spoon, and a distinct gold band separating the two sections, but none of this is
captured in the caption.
To his right, there’s a truncated stainless steel bowl resting on a brown wooden stool, and in front
of it, a black, wrapped cloth becomes visible as the camera pans. These are also excluded from
the description.
The caption also lacks an action timestamp to indicate when the man is scooping coee into the
cup, which is a key moment. Finally, it fails to mention the natural lighting conditions and a
daytime scene. The timestamp in the caption does not point to an action.
3. http://ai-lumalabs-uber-labelling.s3-us-west-1.amazonaws.com/avlm_benchmark/309939-clip_00
000000_1.mp4
00:00:00.00 The right profile of a brown-skinned boy is seen on the left side of the frame,
being instructed on how to play the violin. He is wearing a short-sleeved navy blue collared shirt and is
visible only from his chest to his brows. In the slightly blurry background, a mid-sized plant sits in front of a
large window. Everything is painted white. Outside the window, green trees are visible.
00:00:01.55 A woman's hands appear in the boom right corner of the frame, gently
assisting the boy's right hand as he uses the bow on the violin.
Comment: The first timestamp does not correspond to any specific action. Additionally, it is
inaccurate to describe the boy as brown-skinned; he is fair or light-skinned. It's also subjective to
assume the hands shown belong to a woman when there’s no clear indication of gender, it's more
appropriate to refer to them simply as a fair-skinned hands.
Several important visual and contextual details are missing. There is no mention of the violin's
placement, the finger movements of the boy’s left hand, or his focused gaze on the instrument
while playing. The camera movement, angle, and framing are also left out, which are important
for understanding how the scene is presented.
The caption omits the actions of the other person’s hands as well: at first, both hands come into
view holding the bow; then, the right hand is removed, and the left hand remains on the bow,
assisting him in practicing until the video ends.
Additional descriptive elements are also missing, such as the boy’s hair length or color, the
natural lighting in the room through the window, and the color of the violin.Follow the order; the first sentence should summarize the video, including setting and featuring character, then the main subject, subject's skin tone and ethnicity, full dressing code of the subjects, action done by the subject, background information, lighting, and camera positioning. I will upload a screenshot for you to start. The lighting and camera angle should be in one line each
Kindly say if it's a man, a woman, or a child. Do not use "what appears to be" or "appears to be," or "suggesting" in your explanations, as this shows a lack of confidence.">