Hugging Face reposted this
🚨 New Article: Empowering Public Organizations: Preparing Your Data for the AI Era, with Yacine Jernite 🏛 Public organizations are authoritative sources for critical information: monitoring environmental conditions, tracking educational outcomes, documenting workforce trends, preserving cultural heritage, and managing public infrastructure. However, much of this data exists in formats that AI systems can't easily use — stored in PDFs, scattered across Excel files with inconsistent structures, and often organized in specialized formats designed for human consumption rather than machine learning. 📊 This means that the public data commons that power models, especially models made by small developers who cannot afford millions of dollars in licensing fees, would benefit the most if this data were made available in AI-consumable formats. Data quality is incredibly important for model performance and efficiency (we have written about this, too!), and this is already public knowledge and free! 🤝 There are clear benefits to doing this: Orgs get to enable technology that better serves communities, amplify the value of public data through collaboration, and maintain principled control over data use. Orgs like NASA - National Aeronautics and Space Administration, the National Library of Norway, the French Ministère de la Culture, The National Archives of Finland / Kansallisarkisto, and other public organizations are already on Hugging Face, releasing their rich datasets and models that add to the public commons and enrich us all. 📝 So, we wrote a guide for all public organizations to do so! We use the Massachusetts Data Hub as a case study for this article and convert four datasets! We look at: - MCAS education data 📚 (Excel files with different formats) - Labor market reports 💼 (PowerPoint presentations) - Occupational safety stats 🦺 (PDF reports) - 2023 aerial imagery 🛰️ (JP2 image files) For each of these datasets, we show why they were not AI-ready, the steps we took to clean, standardize, and convert them to Hugging Face datasets, and the release of both the code and datasets on the Hub! This was a really fun exercise, and we have some important takeaways for other public organizations that are looking to release data on Hugging Face to add to public knowledge and power better AI: 1️⃣ Identify Your Most Valuable Datasets: Which dataset is your organization the most authoritative source for, and releasing which in AI-ready formats would bring the most value to your mission? 2️⃣ Determine Format Needs: Consider who will be using this data and for what purpose, and how best to release this dataset so that it is optimal for downstream use. 3️⃣ Document Clearly: Rich documentation is important for downstream users and actually drives adoption! A study has shown that almost 90% of dataset downloads on the Hub have come from fully documented datasets. We can't wait to see your datasets on 🤗! [Article linked in comments]