top of page

AI Diaries: Weekly Updates #3

🎉 Welcome back to our beloved blog series! ✹ After a little hiatus, we're thrilled to reconnect with you and dive back into the dynamic realm of AI. đŸ€–đŸŒŸ This time around, we'll uncover the latest advancements and intriguing innovations that have recently emerged. 🚀 In this edition, we also have a treat in store with a retrospective glance at a groundbreaking milestone in AI, blending a bit of history with our forward-looking insights. 🔍


Join us as we embark on another exciting journey! 🚂



Nvidia Eclipses Microsoft as World's Most Valuable Company


TL;DR: Nvidia has surpassed Microsoft to become the world's most valuable company, driven by skyrocketing demand for its AI processors. The company's market capitalization reached over $3.3 trillion, reflecting its critical role in the AI revolution.



What's The Essence?: Nvidia, a leader in the semiconductor industry, has achieved a significant milestone by becoming the world's most valuable company. Its high-end processors, crucial for artificial intelligence (AI) technologies, have propelled it past tech giant Microsoft. This shift underscores Nvidia's pivotal position in the rapidly evolving AI landscape.


How Does It Tick?: Nvidia's ascent is fueled by the surging demand for AI chips, which are essential for training and deploying AI models. The company's stock has nearly tripled in 2024, highlighting investor confidence in its growth potential. Nvidia's advanced graphics processing units (GPUs) are indispensable in AI applications, making the company a central player in the tech industry's race to dominate AI.


Source: LSEG. Created by Thomson Reuters

Why Does It Matter?: This development is significant because it marks a shift in the tech hierarchy, with Nvidia's market cap exceeding $3.3 trillion. It highlights the growing importance of AI in the global economy and Nvidia's role in shaping this future. As AI continues to integrate into various sectors, Nvidia's technological advancements and market leadership will likely influence the direction of innovation and economic growth in the years to come.


---


Meta's New AI Research Models: Accelerating Innovation at Scale


TL;DR: Meta’s Fundamental AI Research team is releasing new AI models to enhance research and innovation. These models include multi-modal processing, improved language predictions, music generation, AI speech detection, and enhanced diversity in text-to-image generation.


What's The Essence?: Meta's AI team, FAIR (Fundamental AI Research), has publicly released several advanced AI models designed to boost innovation and collaboration within the AI community. These models cover various aspects of AI, including text and image processing, language prediction, music generation, AI speech detection, and diversity in image generation.


How Does It Tick?:


  • Meta Chameleon: A multi-modal model that processes and generates both text and images simultaneously. It can create content by combining text and images, allowing for creative and diverse outputs.





  • Multi-Token Prediction: This approach improves the efficiency of large language models by predicting multiple words at once, reducing the amount of training data required and accelerating the learning process.

  • JASCO (Joint Audio and Symbolic Conditioning): A text-to-music model that accepts various inputs like chords or beats, offering greater control over the generated music. It enhances the creative possibilities in music generation.


Top figure presents the temporal blurring process, showcasing source separation, pooling and broadcasting. Bottom figure presents a high level presentation of JASCO.

Figure: Top figure presents the temporal blurring process, showcasing source separation,

pooling and broadcasting. Bottom figure presents a high level presentation of JASCO.


  • AudioSeal: An AI speech detection tool that uses localized detection for faster and more efficient identification of AI-generated speech segments, aiding in real-time applications.


Figure: The generator is made of an encoder and a decoder both derived from EnCodec’s design, with optional message embeddings.

The encoder includes convolutional blocks and an LSTM, while the decoder mirrors this structure with transposed convolutions.

The detector is made of an encoder and a transpose convolution, followed by a linear layer that calculates sample-wise logits.

Optionally, multiple linear layers can be used for calculating k-bit messages.


  • Diversity in Text-To-Image Generation: Meta has developed indicators to evaluate and improve geographical and cultural diversity in AI-generated images, ensuring better representation and inclusivity.


Why Does It Matter?: Meta’s release of these AI models underlines their commitment to open research and collaboration in the AI community. By making these models publicly available, Meta aims to drive innovation, enhance AI capabilities, and promote responsible development. The advancements in multi-modal processing, efficient language prediction, and improved diversity in AI outputs will have significant implications for various industries, from creative fields to tech development, fostering a more inclusive and innovative future.



---


Shaping the Future: Biden’s Approach to AI Regulation


TL;DR: President Biden's Executive Order 14,110 aims to ensure the safe, secure, and trustworthy development and use of AI in the United States. This comprehensive directive focuses on various aspects such as safety, responsible competition, consumer protections, and equity, but faces scrutiny regarding its enforceability and potential regulatory challenges.


What's The Essence?: President Biden's Executive Order 14,110, titled “Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence,” is a landmark initiative to regulate AI technologies across the United States. It outlines key areas including safety, security, responsible competition, supporting American workers, furthering equity, and protecting civil liberties. The order is a response to growing concerns about AI's potential risks and aims to create a balanced framework for AI governance.


How Does It Tick?: The Executive Order outlines eight key areas for AI regulation:


  1. Safety and Security: Ensuring AI technologies are developed and used in ways that protect national security and public safety.

  2. Responsible Competition: Promoting fair competition in the AI industry to foster innovation.

  3. Supporting American Workers: Creating guidelines to protect workers and ensure job opportunities are not undermined by AI advancements.

  4. Furthering Equity: Addressing potential biases in AI systems to promote fairness and equality.

  5. Strong Consumer Protections: Safeguarding consumer rights and preventing the misuse of AI.

  6. Protection of Civil Liberties: Ensuring AI use does not infringe on individuals' civil rights and freedoms.

  7. Data Collection and Reporting: Aligning AI use-case data collection with federal laws.

  8. Interagency Coordination: Encouraging collaboration among federal agencies to streamline AI governance.


Why Does It Matter?: The regulation of AI is crucial as these technologies become increasingly integrated into various sectors, from criminal justice to employment. The Executive Order represents a significant step towards establishing a comprehensive framework for AI governance. However, it also raises several concerns:


  • Enforceability: Critics argue that the order’s broad scope makes it challenging to enforce effectively, especially with a shortage of subject matter experts in the federal workforce.

  • Regulatory Crossfire: The potential for different agencies to claim authority over the same AI regulations could lead to conflicts and inefficiencies.

  • Barriers to Entry: Strict regulations might create hurdles for new companies and innovators, potentially stifling growth in the AI sector.


The Executive Order underscores the need for a balanced approach that fosters innovation while ensuring ethical standards and protections are in place. As AI technologies evolve, ongoing dialogue and adjustments to regulations will be essential to address new challenges and opportunities in this dynamic field.



---


Enhancing Mobile Vision Transformers with Separable Self-Attention


TL;DR: MobileViTv2 introduces a separable self-attention mechanism that reduces complexity and latency, achieving state-of-the-art performance in mobile vision tasks while running significantly faster than its predecessor, MobileViT.


Figure: Comparison of self-attention units. (a) shows standard multi-headed self-attention (MHA).(b) introduces token projection layers to MHA, reducing complexity but still using costly operations. (c) presents a proposed separable self-attention layer with linear complexity and faster element-wise operations.


What's The Essence?: MobileViTv2 builds on the foundation of MobileViT by replacing the multi-headed self-attention (MHA) mechanism with a separable self-attention method. This new approach reduces the computational complexity from quadratic to linear, making it more efficient for resource-constrained devices. With this enhancement, MobileViTv2 achieves better accuracy and speed across various mobile vision tasks.


Figure: Comparison of different attention units: Transformer and Linformer rely on resource-intensive batch-wise matrix multiplication, which slows down inference on limited-resource devices. Our proposed method avoids this, speeding up inference. The left panel shows the top-5 operations by CPU time in a single layer for k = 256 tokens. The top right panel compares the complexity, and the bottom right panel shows the latency as a function of the number of tokens k.

Figure: MobileViTv2 models outperform MobileViTv1 models in speed and accuracy across various tasks. This improvement is due to replacing multi-headed self-attention in MobileViTv1 with separable self-attention. Inference times were measured on an iPhone12 with input resolutions of 256x256 for classification, 512x512 for segmentation, and 320x320 for detection.


How Does It Tick?: The separable self-attention mechanism in MobileViTv2 optimizes the self-attention process by using element-wise operations instead of batch-wise matrix multiplications. This change reduces the time complexity from O(k^2) to O(k), where kkk is the number of tokens. The new method computes context scores with respect to a latent token, which are then used to re-weight the input tokens, encoding global information efficiently.


Figure: Example illustrating token interaction for global representation learning in attention layers. In (a), each query token calculates distances with all key tokens using dot-product, normalizes them with softmax to form an attention matrix, encoding contextual relationships. In (b), the inner product between input tokens and latent token L is computed, normalized with softmax to produce context scores. These scores weight key tokens to generate a context vector encoding contextual information.


Why Does It Matter?: The introduction of separable self-attention addresses the key bottleneck in the original MobileViT, which struggled with high latency due to its complex MHA operations. By significantly reducing the computational load and improving inference speed, MobileViTv2 enables faster and more efficient processing on mobile devices without sacrificing accuracy. This advancement is crucial for deploying high-performance vision models in real-world applications where computational resources are limited.


Figure: Layer-wise visualization of context score maps c​ at various output strides.



---


If you've read this far, you're amazing! 🌟 Keep striving for knowledge and continue learning! 📚✹

댓Ꞁ


bottom of page