A Tour of Specialized AI Tools: Music, Video, Images, and More

By Alex Merced | Mon Jun 01 2026 12:00:00 GMT+0000 (Coordinated Universal Time) | 20 min read | ai, artificial intelligence, machine learning, llm, productivity

The first three parts of this series covered general purpose AI assistants: the chatbots and writing tools that handle text based tasks. But AI in 2026 extends far beyond chat windows. A whole ecosystem of specialized tools creates original music, generates cinematic video, produces professional images, and designs presentations.

This is Part 4 of "Catching Up with Using AI for All Levels." If you are new here, start with Part 1 for the fundamentals and Part 2 for the free tools, then Part 3 for ChatGPT and Claude. This post covers the creative side: what the tools are, what they cost, how good the output actually is, and when they make sense for daily productivity rather than just artistic projects.

Return to Part 1: What AI Is and Isnt

Return to Part 2: Getting Started for Free

Return to Part 3: ChatGPT and Claude Deep Dive

Skip to Part 5: Going Advanced: Open Source, Local Models, and Agent Tools

Image Generation: The Mature Category

Image generation is the most mature of the creative AI categories. It started with DALL E 2 in 2022 and has grown into a competitive market with multiple strong options at various price points. The quality has improved to the point where AI generated images are used professionally in marketing, publishing, and product design.

Midjourney

Midjourney remains the gold standard for artistic quality. It operates through Discord, which is both its strength and its biggest friction point. You join the Midjourney Discord server, type your prompt in a channel, and the bot generates images in the thread.

The Discord interface has improved over time. The bot now supports private messaging, so you do not need to share your generations with strangers. The web gallery is well organized. But the overall experience still feels like a workaround for the lack of a native application.

Midjourney's output quality is excellent. The model understands composition, lighting, color theory, and artistic style better than any competitor. Its strength is producing images that look like professional photography or illustration work. If you need a photorealistic product shot in a specific lighting setup, Midjourney delivers. Its weakness is that it struggles with precise text rendering and specific brand requirements. Do not ask Midjourney to generate an image with a specific word or logo displayed clearly. It will get close but not exact.

Pricing starts at $10 per month for Basic (3 hours of GPU time, roughly 200 images) and goes up to $60 per month for Mega (60 hours of GPU time). The Standard plan at $30 per month is the sweet spot for regular users.

DALL E 4 (via ChatGPT)

DALL E 4 is OpenAI's latest image generation model, available through ChatGPT Plus ($20/month). It has improved significantly over DALL E 3, with better prompt adherence, more consistent anatomy, and improved text rendering. DALL E 4 can render short words and phrases legibly in images, a longstanding weakness of earlier AI image generators.

DALL E 4's main advantage is integration. Because it lives inside ChatGPT, you can iterate naturally. Generate an image, discuss it with the model, request changes, and generate the next version, all in one conversation. This tight feedback loop makes it the most efficient image generator for most workflows, even if Midjourney produces better standalone results.

For productivity, the ChatGPT integration is the killer feature. You can be drafting a presentation slide in ChatGPT, generate an accompanying image in the same conversation, and get suggestions for how to arrange both on the slide. No switching between tools. No copying prompts between applications.

Flux and Stability AI

Flux, created by Black Forest Labs (a team of former Stability AI researchers), has emerged as a strong open weights competitor. Flux Pro is available through Fireworks AI and other providers. It competes with Midjourney on quality while being accessible through developer friendly APIs. Flux is available in several variants: Flux Pro for highest quality, Flux Dev for faster generation, and Flux Schnell for rapid prototyping.

The open weights nature of Flux means you can run it on your own hardware or through any provider that hosts it. This flexibility makes it popular with developers who want to integrate image generation into their own applications without per image API fees from a single vendor.

Stability AI continues to develop Stable Diffusion, the open source image generation model. Stable Diffusion 4 is available in 2026 with strong quality and the advantage of running locally on consumer GPUs. For users who want privacy, offline access, and no subscription fees, Stable Diffusion remains the best option. The tradeoff is that running it locally requires a capable GPU and some technical setup.

Nano Banana

Nano Banana has gained attention in 2026 as a new contender. It produces high quality images with a simple interface and competitive pricing. The Pro version includes upscaling, inpainting, and style transfer. It is worth trying alongside Midjourney and DALL E to see which style fits your needs. Nano Banana's strength is its ease of use for non technical users who want good results without learning complex prompt engineering.

Who Should Use Image Generation

The most practical productivity use is creating visuals for presentations, social media, and internal documents. Instead of spending 30 minutes searching stock photo sites for the right image, you generate exactly what you need in 30 seconds.

Business use cases include: product mockups for proposals, custom illustrations for blog posts, branded social media graphics, concept visualizations for client presentations, and placeholder images for website designs.

The quality is good enough for professional use in most contexts, but you should still use real photography for anything that represents an actual product, person, or location. AI generated images have subtle tells that trained eyes notice.

Video Generation: The Fastest Evolving Category

Video generation has progressed faster than any other AI category in the last 18 months. We went from short, glitchy clips to multi minute videos with consistent characters, coherent motion, and usable quality.

Veo 3.1 (Google)

Veo 3.1 is widely considered the best overall AI video generator in 2026. It produces high resolution videos with strong prompt adherence, consistent character appearance across frames, and minimal artifacts. The improvement from Veo 2 to Veo 3.1 was dramatic, with much better motion coherence and fewer of the warping artifacts that plagued earlier video models.

Veo is available through Google AI Studio with a free tier that lets you generate short clips for testing. Full access requires a Google AI Premium ($20/month) or AI Ultra ($100/month) subscription. The free tier is generous enough for experimentation and light use, making Veo the most accessible high quality video generator.

Veo excels at text to video: describe a scene and get a video clip. It also supports image to video: upload an image and animate it. The image to video feature is particularly useful for creating short animations from still photographs or illustrations. You can take a product photo and generate a slow orbit around it, turning a static image into a dynamic product showcase.

Veo's main limitation is creative control. You describe what you want and accept what you get. There is no way to fine tune the motion, adjust camera angles, or edit specific frames. For one shot generation where speed matters more than precision, Veo is the best choice.

For productivity, Veo is useful for creating short explainer videos, social media content, and presentation clips. A 15 second product demo video that used to take a day to produce can now be generated in minutes. The quality is good enough for social media and internal use, though not yet at broadcast quality.

Runway Gen 4

Runway is the veteran of AI video generation, having launched Gen 1 in early 2023. Gen 4, released in late 2025, offers the most creative control of any video generator. It includes features like Motion Brush (paint movement onto specific areas of an image), Act One (transfer facial expressions from a reference video), and inpainting (edit specific areas of a generated video).

The Motion Brush is Runway's standout feature. You upload an image, paint a brush stroke across the area you want to animate, and define the direction and speed of movement. Want smoke rising from a chimney in a still photo? Paint the chimney area and set the motion upward. Want water flowing in a river? Paint the river surface and set the flow direction. This level of granular control is unique to Runway.

Runway's pricing starts at $15 per month for the Standard plan with limited credits (enough for roughly 50 short generations). The Pro plan at $35 per month gives more credits and higher resolution output. For heavy use, the Unlimited plan at $95 per month removes credit limits. The pricing is higher than Veo's bundled cost, but the additional control justifies the premium for professional use.

Runway is the best choice when you need precise control over the output. If you need a specific camera movement, a particular character action, or an edit to an existing generation, Runway gives you the tools to iterate. Veo is better for one shot generation where you describe what you want and accept the result.

Kling AI

Kling, developed by Kuaishou (the company behind a major Chinese video platform), has emerged as a strong competitor. It offers high quality video generation at competitive prices, with particularly good results for character animation and cinematic shots.

Kling uses a credit system with free trial credits and paid packs starting around $10. The quality is comparable to Veo and Runway for many use cases, though it lags slightly on text rendering and complex scene composition.

Who Should Use Video Generation

Video generation is still more of a content creation tool than a daily productivity tool for most people. The practical use cases are concentrated in marketing, content creation, education, and internal communication.

A non obvious productivity use: creating quick tutorial videos for your team. Instead of writing a 3 page document explaining how to use a new process, generate a 60 second video walkthrough. The video is easier to consume and more likely to be watched than a document to be read.

The current limitations are real. Videos longer than 30 seconds still struggle with consistency. Characters in the first frame may change appearance by the tenth frame. Complex action sequences produce artifacts. Text rendering in video is unreliable. You should budget time for multiple attempts and manual editing to get a usable result.

Music Generation: From Novelty to Useful

AI music generation has come into its own in 2026. The tools produce genuinely listenable songs with vocals, multiple instruments, and coherent structure.

Suno

Suno is the leading AI music generator. It generates complete songs with lyrics, vocals, and instrumentation from a text prompt. You describe the genre, mood, and subject, and Suno produces a full track with verses, choruses, and a bridge.

Suno's free tier gives you a limited number of generations per day, enough for experimentation. The Pro plan at $10 per month gives 500 generations and commercial usage rights. The Premier plan at $30 per month gives 2,000 generations and priority processing.

The output quality varies by genre. Pop, rock, electronic, and hip hop work well. Classical and jazz are less convincing. The vocals sound synthetic on close listening but pass for casual listening in the background. The instrumental quality is generally strong.

The most practical productivity use for Suno is creating custom background music for videos, presentations, and internal content. Instead of searching royalty free music libraries for the right track, you generate a track that matches your specific needs: "Upbeat electronic background music, 120 BPM, no vocals, suitable for a tech product demo."

Udio

Udio is Suno's primary competitor. It produces comparable quality with a slightly different emphasis. Udio excels at creative exploration: you can remix existing songs, extend specific sections, and edit the structure more granularly than Suno allows.

Udio's pricing is similar to Suno's, with a free tier and paid plans starting at $10 per month. The choice between Suno and Udio comes down to personal preference for the output style and the editing workflow you prefer.

AIVA

AIVA specializes in orchestral and cinematic music. If you need a string quartet arrangement, a film score style piece, or ambient orchestral background music, AIVA produces the most convincing results in this niche.

AIVA has a free tier for limited generations. Paid plans start at $15 per month for higher quality output and commercial rights. It is less versatile than Suno or Udio but better within its niche.

Who Should Use Music Generation

Music generation is the most situational of the creative AI tools. If you create any kind of video content, presentations, podcasts, or social media posts, generating custom background music saves time and avoids copyright issues.

The hidden productivity use is inspiration and mood setting. Generate a few short musical pieces for a creative project and use them as background while you work. The music sets a tone that helps you get into the right mental state for the task.

Presentation and Design Tools

Beyond the obvious image, video, and music generators, a category of AI tools focuses specifically on business productivity tasks.

Gamma

Gamma creates presentations, documents, and web pages from a single prompt. You describe what you need, and Gamma generates a complete deck with text, images, and layout. The output is good enough for internal presentations and early stage client work.

Gamma's free tier allows a limited number of generations. The Pro plan at $16 per month removes most limits and adds higher resolution exports.

The productivity gain is significant for anyone who regularly creates presentations. A deck that takes two hours to build manually takes five minutes with Gamma. The tradeoff is that the output looks like an AI generated deck: competent but generic. Gamma is best for first drafts that you then customize with your own branding and specific content.

Beautiful AI

Beautiful AI predates the current generative AI wave. It uses AI for layout and design recommendations rather than content generation. You add text and images manually, and the AI arranges them into professional looking slides.

Beautiful AI complements Gamma well. Use Gamma to generate the first draft, then import it into Beautiful AI for layout refinement. The combination covers both content generation and visual polish.

Canva AI

Canva has integrated AI features across its entire platform. Magic Design generates complete designs from text prompts. Magic Eraser removes unwanted objects from images. Magic Expand extends image boundaries. Magic Write generates and edits text.

Canva's AI features are available on the free tier with usage limits. The Pro plan at $13 per month removes most limits and adds brand kits, background removal, and premium templates.

Canva AI is the most practical choice for non designers who need to create visual content regularly. The learning curve is minimal, the output quality is good, and the integrations with social media platforms streamline publishing.

Audio: Transcription, Voice, and Sound

Descript

Descript is an audio and video editor with AI transcription at its core. You upload an audio or video file, Descript transcribes it, and you edit the media by editing the text. Delete a sentence from the transcript and the corresponding audio is removed. Change a word and Descript regenerates the audio in the original speaker's voice.

The workflow is simple. Import a recording of a meeting, podcast, or voiceover. Descript transcribes everything automatically, usually within a few minutes for a one hour recording. You see the full transcript with speaker labels and timestamps. Edit the transcript as you would edit a document: delete filler words, reorder sections, fix mispronounced terms. The audio and video update automatically to match your text edits.

Descript also includes AI voice generation (Studio Sound), noise reduction, and filler word removal. The Studio Sound feature analyzes your recording and removes background noise, echo, and room tone. It is good enough to make a recording from a noisy coffee shop sound like it was recorded in a treated studio.

The free tier covers basic transcription with limited exports. The Pro plan at $24 per month adds screen recording, unlimited transcription, and Studio Sound. The Business plan at $40 per month adds team features and brand voices.

For productivity, Descript is invaluable for anyone who creates audio or video content. Editing a spoken recording by editing text is dramatically faster than working with audio waveforms. The filler word removal alone saves 20 minutes per hour of recording. For meeting recordings, Descript generates searchable transcripts that let you find any topic discussed in a one hour meeting within seconds.

ElevenLabs

ElevenLabs is the leading AI voice generation platform. It produces the most natural sounding synthetic voices available, with accurate emotion, pacing, and emphasis. The voice cloning feature lets you create a digital copy of your own voice from a short recording, as little as 30 seconds of audio.

The quality has improved to the point where short AI generated voice clips are difficult to distinguish from human recordings. Longer passages still have subtle tells: slightly unnatural pacing, odd emphasis on certain words, and a lack of breath sounds at natural intervals. But for most practical purposes, the quality is sufficient.

ElevenLabs pricing starts at $5 per month for the Starter plan with limited characters (roughly 30 minutes of generated speech). The Creator plan at $22 per month is suitable for regular use with longer character limits. The Pro plan at $99 per month is for high volume commercial use.

Productivity use cases include: generating voiceovers for videos and presentations, creating audio versions of written content (your blog posts, newsletters, internal memos), adding narration to tutorials and training materials, and producing multilingual versions of existing audio content. ElevenLabs supports 29 languages with good quality across most of them.

A Note on Ethics

Voice cloning raises obvious ethical concerns. You should only clone a voice with the person's explicit consent. Using ElevenLabs to impersonate someone without permission is not just unethical. It could be illegal in some jurisdictions. The platform has safety measures in place, including voice authentication and content moderation, but the responsibility ultimately rests with the user.

Adobe Firefly and the Creative Suite Integration

Adobe has integrated AI into its Creative Cloud suite through Firefly, its generative AI engine. Photoshop includes Generative Fill and Generative Expand, which let you add or remove elements from an image with text prompts. Illustrator has Generative Recolor and text to vector graphics. Premiere Pro includes text based editing similar to Descript.

Firefly is notable because it is trained on Adobe Stock images and openly licensed content, which means the output is cleared for commercial use. If you work in marketing, publishing, or any context where copyright ownership matters, Firefly's training data provenance gives it an advantage over models trained on scraped internet data.

Firefly is included in existing Creative Cloud subscriptions. Photoshop users with a subscription get a certain number of generative credits per month. Additional credits are available for purchase.

Putting It All Together: A Creative AI Workflow

Here is how these tools work together for a real project.

You need to create a product launch video for a new software feature. Start with Suno to generate background music: "Upbeat electronic track, 90 BPM, no vocals, 60 seconds with a clear crescendo at the 45 second mark." Download the track.

Use Midjourney to generate key visual frames: "A person using a laptop with a glowing screen, clean modern office, cinematic lighting, photorealistic." Select the best images.

Upload the images to Veo or Runway. Generate short animated clips from each image: "Camera slowly zooming in on the screen." Combine the clips.

Use ElevenLabs to generate a voiceover from your script. Import the voiceover, music, and video clips into Descript. Edit by editing the transcript. Fine tune the timing. Export the final video.

The entire workflow takes two to three hours for a 60 second launch video. The same project with traditional tools would take a full day or more, depending on your skill level with each medium.

Internal Training Videos

Your company needs a short training video explaining a new expense reporting process. Start with Gamma to generate a presentation deck with the key steps. Use ElevenLabs to generate a voiceover from the deck text. Use Suno to generate background music. Use Runway's image to video feature to animate any static diagrams. Combine everything in Descript for final editing. Total time: two hours for a five minute training video that would have taken a day and a half with traditional tools.

Social Media Content Calendar

You manage social media for a small business. Each week you need 5 images, 5 captions, and maybe one short video. Use Midjourney or DALL E to generate consistent branded images. Set a style reference in your prompts to keep visual consistency across posts. Use ChatGPT to draft captions in your brand voice. Use Veo to generate one 10 second video clip per week showcasing a product or service. Use Canva AI to arrange everything into the correct dimensions for each platform. The weekly content that used to take 4 hours now takes 45 minutes.

Client Proposal with Visuals

You are preparing a client proposal. Write the content in ChatGPT or Claude. Generate relevant diagrams and concept images in DALL E or Midjourney. If the proposal involves a physical product, generate a short Veo animation showing the product from multiple angles. Combine everything in Canva or Gamma for the final presentation. The result is a professional, visually rich proposal that looks like it took days to produce, completed in a few hours.

Personal Photo Projects

For personal use, the free tiers of these tools cover most needs. Edit family photos with Photoshop's Generative Fill to remove photobombers or improve composition. Use Google Photos AI to organize and search your library. Use Suno's free tier to generate a custom song for a friend's birthday. Use Canva AI to design invitations, cards, and social media posts for personal events.

The Real Productivity Question

The question to ask about any specialized AI tool is not "Can it generate what I need?" It is "Does it save me more time than it costs?"

The cost is not just the subscription price. It is the time spent learning the tool, the time spent iterating on prompts to get the output you want, and the time spent fixing problems that the AI introduced.

For a business user who creates presentations and social media graphics regularly, Canva AI and Gamma are clear wins. The time saved per task is dramatic, and the learning curve is shallow.

For a content creator who publishes weekly videos, Descript and Runway are worth the investment. The production speed increase pays for the subscriptions many times over.

For someone who generates music or images recreationally, the free tiers are generous enough that you can explore without commitment. Pay only when you hit the free limits and find yourself wishing for more.

Part 5 of this series covers the most advanced territory: open source models that run on your own computer, agent frameworks like Hermes Agent, and coding tools like OpenCode. These tools require more setup but offer privacy, offline access, and capabilities that cloud services cannot match.