Introduction

As an experienced software engineer with a background in various programming languages but limited exposure to Python, I embarked on a journey into the world of generative AI about a year ago. With the buzz around technologies like Dall-E 2 and Stable Diffusion, my curiosity was piqued, leading me to get into the field with a toy project and a desire to learn.

This post reflects on my year-long experience, the lessons learned, the challenges faced, and the progress made. It’s a journey of continuous learning, and I’m here to share insights I wish I had a year ago, hoping they can guide others into this growing technology.

Generative AI, particularly Large Language Models (LLMs) and image generation models, have seen incredible advancements. Technologies like GPT-4, Llama2, and image generators like Stable Diffusion and Midjourney have revolutionized how we interact with AI, pushing the boundaries of creativity and innovation.

1. The Beginnings

My journey began with the fast.ai courses, which laid the foundation for understanding the principles behind machine learning and AI. You can skip this step if your main interest is generative AI from a practical point of view. My first advice would be learn Python, most of the AI development is writen in Python, and the most used libraries like HuggingFace’s transformers and diffusers also are writen in Python. These libraries are to generative AI what ImageMagick is to image processing — indispensable.

It’s important to note that PyTorch is also a foundational building block in the AI ecosystem. Used by researchers to implement algorithms and also used by Hugging Face’s libraries. However, as an engineer getting into generative AI, you may not need to directly use PyTorch unless you wish to dive deeper. This flexibility allows engineers to focus on application-level development without getting into the nitty-grity of the underlying algorithms.

2. The Cloud and Beyond: Leveraging Resources

Not having a powerful GPU initially seemed like a barrier. However, platforms like Google Colab and Runpod give you access to virtual machines with GPUs. This accessibility allowed me to experiment with image generation models without hardware investments. Running examples from Hugging Face in google colab or runpod provided a hands-on experience that demonstrated the potential of AI models to generate content on demand, without relying on external APIs.

3. Deep Dive into Python and Open Source Models

At this point it was clear that learning Python was a necessity, the extensive ecosystem of libraries made it an ideal choice for AI development. I focused mainly in using Hugging Face, but I also started reading the fundamental papers on Stable Diffusion, ControlNet, and some new developments that happened,at least it gave me enough understanding of the different building blocks of each model, which normally are a composition of multiple models. For instance, a “simple” text to image, needs to understand the text, so there’s a text encoder model. Also, the UNET which is the generation image block in stable diffusion doesn’t work in the pixel space, but in latent space, a VAE encoder / decoder is used to convert an image from pixels to latents and vice versa. It’s a very interesting topic to explore, but again as I said pretty much HuggingFace can be used in a similar fashion to how some of the 3rd party APIs offer.

4. User Interfaces for Interactivity

Being able to use AI models with code, mostly using HugginFace libraries is great. But to experiment generating images and play around with different models and LoRas it’s easier to use a UI. The ones I have used the most are: Automatic1111 and ComfyUI. ComfyUI, in particular, stood out with its node-based programming approach, allowing for the creation of complex workflows that can be automated and shared. There are some other options like Fooocus, Invoke.ai and probably others.

5. Investing in GPUs

I was already hooked on playing around with the different models, trying to build different workflows, getting annoyed with deformed hands, many legs, or some of the shortcomings still present in AI image generation. But I wanted to be able to run everything locally, to have dependencies ready, and models downloaded and not lose time setting things up each time I wanted to do some AI work. I got a 3090ti as it seemed quite good value for money and for image generation, training LoRas, or even Dreambooth it’s enough. I think GB VRAM GPU it’s still good enough for image generation. For LLMs is another story, where even 24GB VRAM is quite short, and recommended having even 2 cards, or one of the 48GB or even 80GB is better. I don’t have the budget for that nor the justification to have one of those locally, for a server if you are running some models commercially it’d be a different thing, but in that case probably I’d opt to rent those out.

6. Using Docker for Deployment

After I was able to run models locally, do text to image, image to image, inpainting, etc, I wanted to deploy them and use from an API or background worker in the cloud. Docker simplifies the deployment of models by containerizing them, making it easier to manage dependencies and ensure consistency across environments. Whether it’s deploying a model as an API or setting up a background worker on the cloud, Docker encapsulates the complexity, allowing you to focus on coding rather than infrastructure. The transition from a local experimental setup to a scalable, cloud-based deployment underscores the practicality Docker offers to developers. You can use one of the base images available with cuda and pytorch installed, or some images provided by HuggingFace too. The resulting images are quite large, specially if you include the AI model with it, but you can host them in DockerHub.

7. Staying Updated

Staying engaged with the AI community through Subreddits, arXiv.org, GitHub, and HuggingFace has become invaluable for keeping up with the latest advancements. Although I still feel like I have much to learn and am aware of my limitations, I’m enjoying witnessing the field’s growth, reminiscent of how I saw the web start and evolve. Starting a side project has brought me closer to launching. My initial ignorance has led me to this point, and I’m thoroughly enjoying the journey, which I find important. I plan to share some reference blog posts here, mainly to remind myself of what’s available. There are many great YouTube tutorials as well, which are useful for visualizing what you want to do before coding it or understanding a certain workflow. However, sifting through YouTube content can be overwhelming when a blog post might suffice. Having a written reference is beneficial for revisiting and building upon ideas. I believe this might also be useful for others, so I’ll share it here.

Conclusion

I never imagined this field would captivate me so much, but I’m glad to have found a new hobby where I can spend time learning about what I believe will be this decade’s memorable tech breakthrough. Who knows!