A new language model tool for web scraping and conversion Plus, a plan to combat AI sexual abuse imagery

Published

Sep 16, 2024

Reading time

3 min read

Twice a week, Data Points brings you the latest AI news, tools, models, and research in brief. In today’s edition, you’ll find:

Hugging Face open-sources an LLM evaluation suite
Adobe announces its Firefly Video model
Meta researchers blend image diffusion with text transformers
NotebookLM can now generate synthetic podcasts

But first:

Using language models rather than rules-based tools to clean HTML

Jina AI released two language models, reader-lm-0.5b and reader-lm-1.5b, designed to convert raw HTML into text-based Markdown files for web content extraction and cleaning. Both models support a context length of 256,000 tokens and outperform larger language models despite their compact size. Jina AI trained them using a combination of real-world data from the Jina Reader API and synthetic data generated by GPT-4, implementing techniques like contrastive search and chunk-wise model forwarding to address challenges such as degeneration and memory constraints. The company plans to make both models available on Azure Marketplace and AWS SageMaker, with a non-commercial license for other use cases. (Jina AI)

AI companies and Big Tech move to block sexual abuse imagery

Major U.S. tech companies including Adobe, Anthropic, Cohere, Common Crawl, Microsoft, OpenAI, Cash App, Square, Google, GitHub, Meta, and Snap Inc. pledged to take action against AI-generated sexual abuse imagery. Different companies made different commitments, including responsibly sourcing datasets, implementing safeguards, and improving reporting processes. The companies’ pledges follow a White House call to action and build on previous voluntary agreements to reduce risks from AI tools and address the surge in non-consensual intimate images and child sexual abuse materials. (The White House)

LightEval evaluation suite released under an MIT license

Hugging Face released LightEval, an open source evaluation suite that allows companies and researchers to assess large language models according to their specific needs. The tool integrates with Hugging Face’s existing libraries and supports evaluation across multiple devices, offering flexibility for various hardware environments. LightEval addresses the growing demand for more transparent and adaptable AI evaluation methods as models become increasingly complex and integral to both users and developers. (GitHub)

Adobe will add video generation by the end of the year

Adobe introduced its new Firefly Video Model, which will power AI-driven features in video editing tools like Premiere Pro. The tool enables editors to generate B-roll footage, extend existing video clips, create animations, and produce atmospheric elements using text prompts or reference images. The tool supports text-to-video, image-to-video, and video-to-video (in limited contexts). Adobe designed the model to be commercially safe, training it only on licensed content to protect creators’ rights and ensure it can be used in commercial contexts. The company announced that Firefly Video will be available in beta later this year, with a waitlist for users interested in early access. (Adobe)

Meta experiments with joint text and image “transfusion” models

Transfusion combines language modeling and diffusion techniques to train a single transformer on both text and image data. Researchers at Meta pretrained models up to 7 billion parameters and found that Transfusion scales better than traditional methods of quantizing images for language models. This joint approach allows for efficient processing of mixed-modality data and produces competitive results in both text and image generation tasks. (Meta)

NotebookLM can generate synthetic podcasts from your notes

Google introduced Audio Overview, a feature in NotebookLM that generates AI-hosted discussions based on uploaded documents. The tool creates a conversation between two AI hosts who summarize and discuss the content, offering users an audio alternative to reading. While promising, the podcast feature has limitations such as English-only output, potential inaccuracies, and longer generation times for large notebooks. (Google)

Still want to know more about what matters in AI right now?

Read last week’s issue of The Batch for in-depth analysis of news and research.

Last week, Andrew Ng discussed why science-fiction scenarios of AI’s emergent behavior are likely to remain fictional:

“While analogies between human and machine learning can be misleading, I think that just as a person’s ability to do math, to reason — or to deceive — grows gradually, so will AI’s. This means the capabilities of AI technology will grow gradually (although I wish we could achieve AGI overnight!), and the ability of AI to be used in harmful applications, too, will grow gradually.”

Read Andrew’s full letter here.

Other top AI news and research stories we covered in depth: Waymo highlighted its safety record, arguing that its autonomous vehicles are safer than human drivers on the same roads; 2D-to-3D mesh generation is becoming widely accessible for industries like gaming and animation; Western powers signed a legally binding AI treaty to regulate the technology’s impact on democracy and human rights; and a new automated method was developed to balance unbalanced datasets scraped from the web.

Subscribe to Data Points