How I Built a Media Cleanup Tool That Scans Products, Collections, Blog Posts, and Theme Settings

App — Media Cleanup
The Technical Challenge of Shopify Media Management
Building a media cleanup tool for Shopify sounds simple on paper: find which files aren’t used, delete them, done. But the reality is far more complex. Media files can be referenced in products, collections, blog posts, pages, theme settings, shop branding, and even in your store’s JSON templates.
Missing even one reference point means you might flag an active file as “unused,” leading to broken images on your store. Here’s how I built a comprehensive solution that checks everywhere.
Understanding Shopify’s Media Structure
Shopify’s media system is surprisingly complex. Files aren’t just “in use” or “not in use.” A single image might be:
The main image for a product, an alternate image in a product gallery, the featured image for a collection, referenced in a blog post body, used in a page’s content, embedded in theme settings like your logo or announcement bar, set as your shop’s branding (logo, square logo, or cover image), or referenced in theme JSON files with relative paths.
To build an accurate scanner, you need to check all these locations. Miss one, and you risk deleting files that are actually in use.
The Scanning Architecture
I built the scanner using a multi-phase approach. Each phase queries a different part of the Shopify ecosystem.
Phase One: Fetch All Media Files
First, we query Shopify’s GraphQL Admin API to retrieve every file in the store. This includes images, videos, documents, and generic files. For each file, we store its ID, filename, URL, size, MIME type, and alt text.
The challenge here is pagination. Large stores might have thousands of files, so we need to handle GraphQL cursor-based pagination efficiently.
Phase Two: Product Media References
Products are the most obvious place to check. We query all products and their media attachments, storing which files are used as product images. But it’s not just about the product media field — we also check product descriptions for embedded images.
Some merchants embed images directly in product descriptions using HTML tags, so we parse the description HTML looking for image URLs that match our file list.
Phase Three: Collection References
Collections can have featured images. We query all collections and check their image fields, mapping which files are used as collection headers or thumbnails.
Phase Four: Blog Post and Page References
Blog posts and pages can embed images in their content. We fetch all blog articles and pages, then parse their body HTML to find image references.
This is trickier than it sounds because Shopify uses CDN URLs that might look different from the original file URLs. We have to match both the file ID and the filename to catch all references.
Phase Five: Theme Settings
This is where it gets really interesting. Shopify themes store settings in JSON files that can reference media. Common examples include logos, favicons, announcement bar images, and section-specific images.
We use the Asset API to fetch theme settings JSON files and parse them for media references. We look for both Shopify file IDs and filename patterns.
Phase Six: Shop Branding
Your shop’s branding — the logo that appears on mobile, the cover image for social sharing — is stored separately from theme settings. We query the shop object specifically for these branding assets.
Phase Seven: Cross-Reference Everything
Now comes the hard part: determining if a file is actually used. We built a usage checker that takes a file ID and checks it against all the reference maps we created:
Is it in the product media map? Is its filename referenced in any product descriptions? Is it a collection image? Is it embedded in any blog posts or pages? Is it referenced in theme settings? Is it part of the shop’s branding?
Only if the answer to all these questions is “no” do we mark the file as unused.
The Database Design
We use Prisma with SQLite for local development and can easily switch to PostgreSQL for production. The schema includes several key models:
ScanResult: Stores aggregate statistics for each shop — total files, unused files, wasted storage, missing alt text count, oversized files, and recently added files.
MediaFile: Stores individual file details with JSON arrays tracking exactly where each file is used (which products, collections, blog posts, pages, theme settings, and branding).
FileTypeStat: Breaks down file usage by type (images, videos, documents) with size and percentage calculations.
DeletedFile: When a file is deleted, we back it up here for thirty days with the original file stored in Cloudflare R2 for safe restoration.
The Safe Delete Flow
The most critical feature is safe deletion. We never permanently delete files immediately. Here’s the flow:
When a user clicks delete, we first show them a modal with detailed usage information. If the file is used in products or collections, we display a warning with the specific product names and admin links. Then we download the file from Shopify and upload it to Cloudflare R2 as a backup. We delete the file from Shopify. We store metadata in the DeletedFile table with a thirty-day expiration. The file can be restored anytime within thirty days.
The Restoration Challenge
Restoring a deleted file is technically complex. You can’t just “undelete” in Shopify — you have to re-upload it as a new file. Our restoration process downloads the file from Cloudflare R2, uploads it back to Shopify using staged uploads, creates a new file record with the original alt text, updates all statistics, and shows the user which products and collections they need to manually re-attach it to.
Performance Considerations
Scanning large stores is expensive in terms of API calls. Shopify has rate limits, so we implemented exponential backoff and parallel request batching where possible. A typical scan of a store with five hundred products and one thousand media files takes about two to three minutes.
We also cache scan results in the database so the dashboard loads instantly on subsequent visits.
Statistics and Insights
Beyond just finding unused files, we calculate several valuable metrics:
Files without alt text (hurting SEO), files over 1MB (slowing page speed), files added in the last thirty days (might want to keep these), and file type breakdown with storage consumption per category.
These insights help merchants prioritize what to clean up first.
The User Interface
We built the frontend with Remix and Shopify Polaris components for a native Shopify Admin feel. The dashboard shows colorful gradient cards with key metrics, a file type breakdown with progress bars, and a preview of recent files with usage indicators.
The files page includes advanced filtering by file type, usage status, alt text presence, and size; search by filename; sortable columns; and pagination for large libraries.
Lessons Learned
Building this tool taught me several valuable lessons:
Never assume you’ve found all file references — theme customizations can hide images in unexpected places. Always provide an undo mechanism — merchants will accidentally delete important files. Performance matters — slow scans frustrate users. Clear usage indicators prevent mistakes — show exactly where each file is used before deletion. Backup everything — trust is earned by never losing customer data.
The Results
Merchants using the tool typically discover that thirty to forty percent of their media files are unused, recover fifty to five hundred megabytes of storage, identify hundreds of images missing alt text, and find dozens of oversized files that need optimization.
Open Questions
There are still challenges to solve. How do we detect files referenced in third-party apps? How do we handle files used in draft products? Should we auto-delete files after a certain period of non-use? These are questions I’m still working through.
Final Thoughts
Building a media cleanup tool for Shopify is more complex than it appears. The devil is in the details — specifically, in finding every possible location where a file might be referenced. But when done right, it provides immense value to merchants struggling with media bloat.
If you’re building Shopify apps, remember: your users trust you with their store data. Design for safety first, then optimize for speed and convenience.



