AI Diaries: Weekly Updates #4

Welcome to our latest edition of AI Diaries: Weekly Updates! In this post, we dive into groundbreaking advancements in technology, including MIT's revolutionary DNA storage, Google's enhanced text-to-image model Imagen 3, and NVIDIA's powerful NeMo Curator for LLM data curation. Stay tuned for in-depth insights and the latest trends shaping the future of AI and data science. Whether you're a tech enthusiast, a researcher, or a professional, this update is packed with valuable information to keep you at the forefront of innovation. Enjoy the read and don't forget to share your thoughts in the comments below!

Let's begin!

MIT's Revolutionary DNA Storage: A Game-Changer for Long-Term Data Preservation

TL;DR: MIT researchers have developed a new amber-like polymer that can store DNA at room temperature, protecting it from damage and enabling long-term data storage.

What's the Essence?: The innovative "T-REX" method encapsulates DNA in a glassy, amber-like polymer, which can be used to store human genomes or digital data like photos and music, without the need for freezing temperatures. They successfully stored DNA encoding the Emancipation Proclamation, the MIT logo, and the theme music from “Jurassic Park.”

How Does It Tick?: By using a thermoset polymer from styrene and a cross-linker, researchers created a hydrophobic material that protects DNA from moisture and heat. The DNA can be embedded and later retrieved from the polymer using cysteamine, without damage. The polymer can protect DNA from temperatures up to 75 degrees Celsius (167 degrees Fahrenheit).

Source: Elisabeth Prince, Ho Fung Cheng, James L. Banal, and Jeremiah A. Johnson

Journal of the American Chemical Society 2024 146 (25), 17066-17074

DOI: 10.1021/jacs.4c01925

Why Does It Matter?: This breakthrough offers a scalable, energy-efficient solution for DNA preservation, with potential applications in personalized medicine and digital data storage. It could revolutionize how we store genetic information and other critical data for future analysis.

For more: https://news.mit.edu/2024/scientists-preserve-dna-amber-polymer-0613

---

Introducing Imagen 3: Greater Versatility, Higher Quality, and Better Text Rendering

Prompt: Photographic portrait of a real life dragon resting peacefully in a zoo, curled up next to its pet sheep. Cinematic movie still, high quality DSLR photo.

TL;DR: Imagen 3, Google's latest and most advanced text-to-image model, offers enhanced versatility, superior image quality, and improved text rendering. Available now for select creators on ImageFX, and coming soon to Vertex AI.

What's the Essence?: Imagen 3 sets a new standard in AI-driven image generation, boasting greater versatility, higher quality images, and better text rendering. It can produce a wide array of visual styles with improved detail and accuracy.

How Does It Tick?: Imagen 3 is designed to be the most capable text-to-image model, understanding natural language prompts better, generating higher quality images, and rendering text with greater precision. Here’s how it excels:

Greater Versatility and Prompt Understanding: Imagen 3 excels at understanding natural language prompts, making it easier to generate desired outputs without complex prompt engineering. It supports diverse formats and styles, from photorealistic landscapes to whimsical claymation scenes.

Prompt: Claymation scene. A medium wide shot of an elderly woman. She is wearing flowing clothing. She is standing in a lush garden watering the plants with an orange watering can.

Higher Quality Images: This model generates visually rich, high-quality images with excellent lighting and composition. It can accurately render small details and complex textures, ensuring every image is crisp and detailed.

Prompt: A view of a person's hand as they hold a little clay figurine of a bird in their hand and sculpt it with a modeling tool in their other hand. You can see the sculptor's scarf. Their hands are covered in clay dust. a macro DSLR image highlighting the texture and craftsmanship.

Better Text Rendering: Imagen 3's improved text rendering capabilities open up new possibilities for creative uses, such as stylized birthday cards and professional presentations. Its ability to handle text within images has been significantly enhanced.

Prompt: A single comic book panel of a boy and his father on a grassy hill, staring at the sunset. A speech bubble points from the boy's mouth and says: The sun will rise again. Muted, late 1990s coloring style.

Why Does It Matter?: With Imagen 3, content creators, designers, and artists can explore new creative horizons, leveraging AI to produce stunning visual content effortlessly. Its integration into Google products like ImageFX and Vertex AI ensures broad accessibility and utility.

For more: https://deepmind.google/technologies/imagen-3/

---

SynthID: Identifying AI-generated content with SynthID

TL;DR: SynthID is a toolkit for watermarking and identifying AI-generated content, ensuring trust and transparency in digital media.

What's the Essence?: SynthID embeds digital watermarks into AI-generated images, audio, text, and video. These watermarks are imperceptible to humans but detectable for identification, helping to address AI safety issues.

How Does It Tick?: SynthID uses advanced deep learning models to embed imperceptible watermarks into AI-generated content, ensuring these watermarks do not compromise the quality of the original content. Here's how it works across different media types:

SynthID for AI-generated text: SynthID watermarks text generated by AI by adjusting the probability scores of tokens during text generation. This technique embeds the watermark without compromising the quality, accuracy, or creativity of the content. The watermark becomes more robust as the text length increases.

Video: SynthID enhances the probability scores of tokens in generated text, creating a unique pattern or "watermark" that improves in robustness and accuracy with longer text.

SynthID for AI-generated music: SynthID watermarks AI-generated audio by converting audio waves into spectrograms, embedding the watermark, and then converting it back to the audio wave. This process ensures the watermark is inaudible to humans while remaining detectable, even after common modifications like noise addition or compression.

Video: SynthID adds an imperceptible digital watermark to AI-generated audio by embedding it into the spectrogram, ensuring robustness against common modifications and enabling detection to verify if parts were generated by Lyria.

SynthID for AI-generated images and video: SynthID embeds watermarks directly into the pixels of AI-generated images or each frame of a video. These watermarks are imperceptible to the human eye and remain detectable even after modifications such as cropping, adding filters, or compression.

Figure: The watermark remains detectable even after applying modifications such as adding filters, changing colors, and adjusting brightness.

Why Does It Matter?: Identifying AI-generated content is crucial for preventing misinformation and ensuring responsible use of AI. SynthID enhances transparency and trust, empowering users and organizations to work safely with AI-generated content.

For more: https://deepmind.google/technologies/synthid/

---

Transform Your LLM Data Curation with NVIDIA’s NeMo Curator

TL;DR: NVIDIA introduces NeMo Curator, an open-source, GPU-accelerated framework designed to streamline and accelerate the data curation process for large language models (LLMs). Leverage its powerful modules to enhance dataset quality and scalability.

Figure: The diagram outlines a GPU-accelerated pipeline for data download, text extraction, language detection, document-level deduplication, quality filtering, content filtering, and dataset construction from various data sources.

What's the Essence?: NeMo Curator is a Python library aimed at efficient and scalable dataset preparation for LLMs. It utilizes GPUs with Dask and RAPIDS for speed and flexibility, offering customizable modules for tasks like text extraction, language identification, quality filtering, and PII redaction.

How Does It Tick?: At its core, NeMo Curator uses `DocumentDataset`, a Dask `DataFrame` wrapper, to manage datasets. Its key features include multilingual text processing, document deduplication, data classification, and more. Installation is simple via PyPI, source, or the NeMo Framework container.

Loading the dataset using the document builders

from nemo_curator.datasets import DocumentDataset
# define `files` to be a list of all the JSONL files to load
dataset = DocumentDataset.read_json(files, add_filename=True)

Text cleaning and unification

from nemo_curator.modifiers import DocumentModifier
from nemo_curator import Sequential
from nemo_curator.modules.modify import Modify
from nemo_curator.modifiers.unicode_reformatter import UnicodeReformatter


class QuotationUnifier(DocumentModifier):
    def modify_document(self, text: str) -> str:
        text = text.replace("‘", "'").replace("’", "'")
        text = text.replace("“", '"').replace("”", '"')
        return text


def clean_and_unify(dataset: DocumentDataset) -> DocumentDataset:
    cleaners = Sequential(
        [
            # Unify all the quotation marks
            Modify(QuotationUnifier()),
            # Unify all unicode
            Modify(UnicodeReformatter()),
        ]
    )
    return cleaners(dataset)

Data filtering

from nemo_curator.filters import DocumentFilter
 
class IncompleteStoryFilter(DocumentFilter):
    def __init__(self):
        super().__init__()
        self._story_terminators = {".", "!", "?", '"', "”"}
 
    def score_document(self, text: str) -> bool:
        ret = text.strip()[-1] in self._story_terminators
        return ret
 
    def keep_document(self, score) -> bool:
        return score


def filter_dataset(dataset: DocumentDataset) -> DocumentDataset:
    filters = Sequential(
        [
            ScoreFilter(
                WordCountFilter(min_words=80),
                text_field="text",
                score_field="word_count",
            ),
            ScoreFilter(IncompleteStoryFilter(), text_field="text"),
            ScoreFilter(
                RepeatingTopNGramsFilter(n=2, max_repeating_ngram_ratio=0.2),
                text_field="text",
            ),
            ScoreFilter(
                RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18),
                text_field="text",
            ),
            ScoreFilter(
                RepeatingTopNGramsFilter(n=4, max_repeating_ngram_ratio=0.16),
                text_field="text",
            ),
        ]
    )
    return filters(dataset)

Deduplication

from nemo_curator.modules import ExactDuplicates
 
def dedupe(dataset: DocumentDataset) -> DocumentDataset:
    deduplicator = ExactDuplicates(id_field="id", text_field="text", hash_method="md5")
    # Find the duplicates
    duplicates = deduplicator(dataset)
    docs_to_remove = duplicates.df.map_partitions(
        lambda x: x[x._hashes.duplicated(keep="first")]
    )
    # Remove the duplicates using their IDs.
    duplicate_ids = list(docs_to_remove.compute().id)
    dataset_df = dataset.df
    deduped = dataset_df[~dataset_df.id.isin(duplicate_ids)]
    return DocumentDataset(deduped)

PII redaction

from nemo_curator.modifiers.pii_modifier import PiiModifier
 
def redact_pii(dataset: DocumentDataset) -> DocumentDataset:
    redactor = Modify(
        PiiModifier(
            supported_entities=["PERSON"],
            anonymize_action="replace",
            device="cpu",
        ),
    )
    return redactor(dataset)

Putting the curation pipeline together

curation_steps = Sequential(
    [
        clean_and_unify,
        filter_dataset,
        dedupe,
        redact_pii,
    ]
)
dataset = curation_steps(dataset)
print("Executing the pipeline...")
dataset = dataset.persist()
dataset.to_json("/output/path", write_to_filename=True)

Why Does It Matter?: Efficient data curation is crucial for training high-performance LLMs. NeMo Curator's GPU acceleration and modular approach significantly reduce time and resource costs, enabling faster model convergence and better performance. This makes it an invaluable tool for AI researchers and developers working with large datasets.

For more: https://developer.nvidia.com/blog/curating-custom-datasets-for-llm-training-with-nvidia-nemo-curator/

For a tutorial: https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/distributed_data_classification/distributed_data_classification.ipynb

---

Polars Version 1: Major Updates and Improvements

TL;DR: Polars has released Version 1 with significant updates including stricter Series constructors, improved DataFrame orientation logic, consistent time zone handling, and more.

What's the Essence?: Polars Version 1 introduces stricter parameters for Series constructors, changes in DataFrame orientation inference, consistent time zone conversions, and better error handling. These updates enhance performance, efficiency, and accuracy.

How Does It Tick?: Polars Version 1 brings several key technical improvements that streamline data handling and boost performance. Here are some of the major changes:

Stricter Series Constructors: The `strict` parameter is now properly applied, ensuring data type consistency.

# Before
>>> s = pl.Series([1, 2, 3.5], strict=False, dtype=pl.Int8)
Series: '' [i8]
[
        1
        2
        null
]

# After
>>> s = pl.Series([1, 2, 3.5], strict=False, dtype=pl.Int8)
Series: '' [i8]
[
        1
        2
        3
]

DataFrame Orientation Logic: Inference based on data and schema dimensions instead of data types, with warnings for row orientation.

# Before
>>> data = [[1, "a"], [2, "b"]]
>>> pl.DataFrame(data)
shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ str      │
╞══════════╪══════════╡
│ 1        ┆ a        │
│ 2        ┆ b        │
└──────────┴──────────┘

# After
>>> pl.DataFrame(data, orient="row")
shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ str      │
╞══════════╪══════════╡
│ 1        ┆ a        │
│ 2        ┆ b        │
└──────────┴──────────┘

Update reshape to return Array types instead of List types: The reshape function now returns an Array type, which is more efficient and maintains data integrity.

# Before
>>> import numpy as np
>>> arr = np.array([[1, 2], [3, 4]])
>>> pl.Series(arr)
shape: (2,)
Series: '' [list[i64]]
[
        [1, 2]
        [3, 4]
]

# After
>>> import numpy as np
>>> arr = np.array([[1, 2], [3, 4]])
>>> pl.Series(arr)
shape: (2,)
Series: '' [array[i64, 2]]
[
        [1, 2]
        [3, 4]
]

Split replace functionality into two separate methods: The replace function is now split into replace and replace_strict, providing more precise control over data replacement operations.

# Before
>>> s = pl.Series([1, 2, 3])
>>> s.replace(1, "a")
shape: (3,)
Series: '' [str]
[
        "a"
        "2"
        "3"
]

# After
>>> s.replace(1, "a")
Traceback (most recent call last):
...
polars.exceptions.InvalidOperationError: conversion from `str` to `i64` failed in column 'literal' for 1 out of 1 values: ["a"]
>>> s.replace_strict(1, "a", default=s)
shape: (3,)
Series: '' [str]
[
        "a"
        "2"
        "3"
]

Why Does It Matter?: The updates in Polars Version 1 are significant for several reasons:

Improved Performance and Efficiency: By making the Series constructor stricter, Polars ensures that data type consistency is maintained, leading to more efficient data processing and reduced computational overhead. This means faster data operations and less memory usage, which is crucial for handling large datasets.
Enhanced Data Integrity: The changes in DataFrame orientation logic ensure that data is inferred correctly based on its structure, reducing errors and inconsistencies. This update simplifies data manipulation and provides clearer guidance through warnings when row orientation is inferred, ensuring that users are aware of potential edge cases.
Optimized Data Reshaping: The update to reshape functionality, returning Array types instead of List types, optimizes data storage and access. Arrays are more efficient and performant, which is beneficial for numerical and scientific computations where large multi-dimensional datasets are common.
More Versatile Data Replacement: By splitting the replace functionality into replace and replace_strict, Polars offers users more control and flexibility in handling data replacements. The replace method now maintains the existing data type, preventing unexpected type changes, while replace_strict allows for precise mapping with optional default values. This change addresses user confusion and enhances the robustness of data transformation processes.
Consistent Error Handling: The refinement of error types to more appropriate variants like InvalidOperationError and SchemaError provides clearer, more informative error messages. This improvement aids in debugging and ensures that developers can quickly identify and resolve issues in their data pipelines.
Broader Implications: These enhancements collectively push Polars forward as a leading DataFrame library, emphasizing performance, accuracy, and user experience. For businesses and data professionals, these updates translate to more reliable data analytics, quicker insights, and the ability to handle complex data operations with ease.

By addressing these critical areas, Polars Version 1 sets a new standard for data processing libraries, making it an essential tool for modern data workflows.

Fore more: https://docs.pola.rs/releases/upgrade/1/

---

If you've read this far, you're amazing! 🌟 Keep striving for knowledge and continue learning! 📚✨

MIT's Revolutionary DNA Storage: A Game-Changer for Long-Term Data Preservation

Introducing Imagen 3: Greater Versatility, Higher Quality, and Better Text Rendering

SynthID: Identifying AI-generated content with SynthID

Transform Your LLM Data Curation with NVIDIA’s NeMo Curator

Polars Version 1: Major Updates and Improvements

Comments