top of page

AI Diaries: Weekly Updates #2

Welcome back to our weekly blog series! 🎉 This week, we're continuing our journey through the exciting world of AI. We’ll explore fresh breakthroughs and significant developments that have captured our attention. 🌐 Plus, in this edition, we’ll also feature a special segment revisiting a pivotal moment in AI history, adding a touch of nostalgia to our futuristic exploration.


Stay tuned as we delve deeper!


Tiny but mighty: The Phi-3


TL;DR: Microsoft introduces phi-3-mini, a compact 3.8 billion parameter language model rivaling larger counterparts, boasting impressive performance metrics like 69% on MMLU and 8.38 on MT-bench.



What's the Essence? The phi-3-mini model lies in its innovative approach to scalability and functionality within the constraints of mobile technology. Traditionally, high-performance language models have required substantial computational resources, which limited their deployment to powerful servers and cloud-based platforms. However, Microsoft's phi-3-mini shatters this norm by compressing state-of-the-art AI capabilities into a model that can run locally on a smartphone.


This breakthrough is achieved through a focused optimization of the model's dataset and architecture. The phi-3-mini, with 3.8 billion parameters trained on 3.3 trillion tokens, uses a refined dataset that includes heavily filtered web data and synthetic data specifically created to enhance model performance in smaller-scale environments​​. This allows the model to achieve impressive benchmarks, rivaling those of larger models like GPT-3.5 and Mixtral 8x7B, but with the advantage of local deployment without connectivity dependencies.


How Does It Tick? Phi-3-mini operates on a transformer decoder architecture, a prevalent structure in modern AI due to its effectiveness in handling various language processing tasks. The model has been tailored for optimal performance on mobile devices by reducing its memory footprint—quantized to occupy only about 1.8 GB of space. Furthermore, it supports a default context length of 4,000 tokens, with an extended version capable of handling up to 128,000 tokens for more complex tasks​​.


The training methodology employed for the phi-3-mini is particularly notable. By leveraging a unique combination of high-quality, heavily filtered web sources and synthetic data, the model is trained in two sequential phases that focus first on general knowledge and language understanding, and second on logical reasoning and specialized knowledge​​. This approach deviates from traditional model training that prioritizes scale over data quality, thus allowing the phi-3-mini to excel in reasoning tasks despite its smaller size.


Why Does It Matter? The significance of the phi-3-mini model extends beyond its technical achievements. By enabling powerful AI capabilities on mobile devices, Microsoft opens up a plethora of applications that were previously unfeasible due to hardware limitations. Users can now access sophisticated AI-driven features directly from their smartphones, enhancing both productivity and accessibility.


Moreover, the deployment of such technology has broader implications for data privacy and internet connectivity. With local processing, user data can remain on the device, reducing the risks associated with data transmission over the internet. Additionally, users in regions with poor internet access can still benefit from advanced AI features, democratizing technology access across different geographies.


Source: Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone


---


It's Finally Out!


TL;DR: Jesse, the CEO of Rabbit, introduced a revolutionary new product called Rabbit R1, powered by the innovative Large Action Model (LAM). This device, distinct from traditional smartphones, interacts through natural language and executes tasks with unprecedented speed and accuracy, designed to simplify user interaction by performing tasks directly, rather than through conventional apps.





What's the Essence? Rabbit R1 redefines user interaction with electronic devices by moving away from app-based interfaces to a natural language-driven experience. The foundation of this innovative approach is the Large Action Model (LAM), a new AI model that not only understands user commands but can execute actions across various applications seamlessly. Unlike current smart devices that operate within the constraints of their apps, R1 offers a direct, intuitive interaction method, reducing the time and steps required to perform tasks.


How Does It Tick? The core technology behind Rabbit R1 is the LAM, which bridges the gap between command recognition and action execution. This technology leverages neuro-symbolic programming to understand and perform user-directed actions on any software interface. Whether it's booking a ride, ordering food, or managing online transactions, LAM interprets the user's natural language commands and executes the necessary actions directly, without user navigation through multiple apps or menus.





Why Does It Matter? The introduction of Rabbit R1 and its underlying technology, LAM, marks a significant shift in how we interact with technology. It matters because it streamlines daily tasks, enhances productivity, and reduces the cognitive load of switching between different apps to complete actions. This technology could potentially redefine mobile computing, making technology more accessible and significantly more efficient at understanding and executing user intentions. Rabbit R1 is not just a new device but a new way of interacting with all our digital services, fundamentally changing our technological interactions towards more natural, intuitive experiences.


---


Mojo 24.2: Mojo Nightly, Enhanced Python Interop, OSS stdlib and more


TL;DR: Dive into the cutting-edge advancements in the Mojo SDK 24.2, exploring how its updated features enhance Python programming. Understand how to utilize its innovative capabilities for a more efficient coding experience.



What's the Essence? Mojo SDK 24.2 has been enhanced to offer significant improvements for Python developers. Key updates include the renaming of DynamicVector to List, making it more intuitive for Python users familiar with list operations. The new release is part of the MAX 24.2 rollout, and it includes several new features that facilitate easier and more efficient programming.


How Does It Tick? The Mojo SDK 24.2 introduces several key features aimed at simplifying the programming experience:

  1. List (formerly DynamicVector): This renaming aligns more closely with Python's list, making it intuitive for developers to perform list operations in Mojo.

from random import random_si64

def stack_sort(list: List[Int]) -> List[Int]:
    stack = List[Int]()
    input_list = List[Int](list)  
    sorted_list = List[Int]()  

    while len(input_list):
        temp = input_list.pop_back()
        while len(stack) and stack[-1] < temp:
            input_list.append(stack.pop_back())
        stack.append(temp)

    while len(stack):
        sorted_list.append(stack.pop_back())
    return sorted_list

def print_list(list: List[Int]):
    print('List','[', sep=': ', end=' ')
    for i in range(len(list)):
        print(list[i],end=' ')
    print("]", end='\n')

my_list = List[Int]()
for i in range(5):
    my_list.append(int(random_si64(0,100)))

print("Original")
print_list(my_list)
sorted_list = stack_sort(my_list)
print("Sorted")
print_list(sorted_list)
  1. Improved code snippets and syntax highlighting: Enhanced syntax support for multiple languages ensures that developers can easily integrate Mojo with their existing projects.

from python import Python
from math.polynomial import polynomial_evaluate

np = Python.import_module("numpy")
plt = Python.import_module("matplotlib.pyplot")

alias dtype = DType.float64
alias coeff = List[Scalar[dtype]](-6, 3, -2, 1)
alias deriv_coeff = List[Scalar[dtype]](3, -4, 3)

x_vals = np.linspace(-2, 3, 100)
y_vals = PythonObject([])
y_deriv_vals = PythonObject([])

for i in range(len(x_vals)):
    y_vals.append(polynomial_evaluate[dtype, 1, coeff](x_vals[i].to_float64()))
    y_deriv_vals.append(polynomial_evaluate[dtype, 1, deriv_coeff](x_vals[i].to_float64()))

plt.figure(figsize=(10, 6))
plt.plot(x_vals, y_vals, label="Polynomial:x^3 - 2x^2 + 3x - 6")
plt.plot(x_vals, y_deriv_vals, label='Derivative of the Polynomial')
plt.xlabel('x')
plt.ylabel('y')
plt.title("Plot showing root of x^3 - 2x^2 + 3x - 6")
plt.legend()
plt.grid(True)
plt.axhline(0, color='black',linewidth=0.5)
plt.axvline(0, color='black',linewidth=0.5)
plt.axvline(x=2, color='r', linestyle='--', linewidth=2)

plt.plot(2, 0, marker='o', markersize=15, color='b')
plt.text(2.1, -2, 'Root', fontsize=12, color='b')

plt.show()
  1. Enhanced copy functionality: New JavaScript functionality allows users to copy code snippets directly from the browser, simplifying the workflow.

from random import random_si64

def stack_sort(list: List[Int]) -> List[Int]:
    stack = List[Int]()
    input_list = List[Int](list)  # Create a copy of the input list
    sorted_list = List[Int]()  # This will hold the sorted elements

    while len(input_list):
        temp = input_list.pop_back()
        while len(stack) and stack[-1] < temp:
            input_list.append(stack.pop_back())
        stack.append(temp)

    while len(stack):
        sorted_list.append(stack.pop_back())
    return sorted_list

def print_list(list: List[Int]):
    print('List','[', sep=': ', end=' ')
    for i in range(len(list)):
        print(list[i],end=' ')
    print("]", end='\n')

my_list = List[Int]()
for i in range(5):
    my_list.append(int(random_si64(0,100)))

print("Original")
print_list(my_list)
sorted_list = stack_sort(my_list)
print("Sorted")
print_list(sorted_list)

Why Does It Matter?

The updates in Mojo SDK 24.2 are crucial for developers seeking to optimize their coding processes. The changes not only enhance the usability of the SDK but also promote better integration with Python, a widely-used programming language. This makes it a valuable tool for developers looking to leverage Python's capabilities within Mojo's robust framework.



---


EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions


TL;DR: Emote Portrait Alive utilizes innovative audio-to-video diffusion technology to transform audio cues into expressive, realistic video portraits without the need for complex modeling.



What's the Essence? EMO is fundamentally an audio-driven video generation tool that allows the creation of vivid, lifelike portrait videos from a single reference image and audio input, such as a voice recording. The core achievement of EMO is its ability to generate videos that not only speak or sing but do so with a range of facial expressions and head movements that genuinely match the nuances of the audio input.


In the initial stage, termed Frames Encoding, the ReferenceNet is deployed to extract features from the reference image and motion frames. Subsequently, during the Diffusion Process stage, a pretrained audio encoder processes the audio embedding. The facial region mask is integrated with multi-frame noise to govern the generation of facial imagery. This is followed by the employment of the Backbone Network to facilitate the denoising operation. Within the Backbone Network, two forms of attention mechanisms are applied: Reference-Attention and Audio-Attention. These mechanisms are essential for preserving the character's identity and modulating the character's movements, respectively. Additionally, Temporal Modules are utilized to manipulate the temporal dimension, and adjust the velocity of motion.

How Does It Tick? The mechanism behind EMO leverages a direct audio-to-video diffusion model, a sophisticated form of machine learning that processes audio signals to generate corresponding video frames. This process involves several innovative components:


  1. Diffusion Models: These are at the heart of EMO's capability, allowing the generation of detailed and high-resolution images and videos. The model gradually learns to introduce and then reduce noise in the data, refining the output to closely align with the input audio's emotional and physical cues.

  2. Stable Control Mechanisms: To ensure the videos do not suffer from common issues such as jittering or facial distortions, EMO incorporates control mechanisms like speed controllers and face region controllers. These tools provide subtle cues to the model, helping maintain the stability and continuity of the video output.

  3. Frame Encoding and Backbone Network: These components work together to maintain the identity of the character across the video while accommodating the dynamic expressions and movements dictated by the audio.


Why Does It Matter? The implications of EMO are profound in several domains:

Entertainment and Media: For filmmakers and content creators, EMO offers a tool to create more engaging and emotionally resonant content without the need for extensive motion capture setups.


  1. Virtual Reality and Gaming: EMO can be used to generate realistic avatars that respond dynamically to user interactions, making virtual experiences more immersive.

  2. Telecommunications: In video conferencing, EMO could enable more expressive avatars or assist in bandwidth-limited situations by generating talking heads in real-time.


Moreover, the ability to produce such expressive videos with minimal input and setup could revolutionize how we interact with digital content, making it more accessible, customizable, and engaging.



Comments


bottom of page