Using Dreambooth fine tuning to create isometric views
- Stable Diffusion v1.5
- Dreambooth (JoePenna Repo)
- Automatic1111 WebUI
- Google Earth
- Google Earth Focal Lengths
Early on I was researching methods for doing custom isometric views of cityscapes using Stable Diffusion. One idea was to fine-tune a custom model based on aerial photography though it was tricky to find decent source images. I stumbled on a tutorial for capturing perfect isometric views from Google Earth using some free fixed focal length camera presets.
I used the technique and lens presets mentioned in the video to capture some shots of dense Asian cities with a variety of urban architectural features that represented aspects of modern cityscapes that I wanted to be reproducible. The images were cropped to 512x512 and curated into a dataset of 8 images.
The dataset was used in JoePenna鈥檚 Dreambooth repo running locally on a RTX 3090Ti since it requires 24GB VRAM to train. I chose V1.5 Stable Diffusion as the base model and trained the dataset as a style, following my usual approach of using artstyle class and 1500 artstyle regularisation images, since I wanted the aesthetic composition to be learned and not necessarily the subject content. Training was ran for 3000 steps with the default learning rate 1e-06.
The first outputs generated from the model using the prompt:
A photo of <token artstyle>
already proved the concept, showing a variety of random cityscape locations with the isometric view featuring some of the architectural features but without repeating anything from the dataset.
The quality of the images aren鈥檛 great and it has a problem generating clean lines particularly for the roads and road markings (most likely due to the source images) but the concept is well captured. It was also surprisingly good at producing novel locations not featured in the dataset like prompting for medieval castles:
Medieval castle in <token artstyle>
The style was also flexible enough (with some prompt weighting) to generate entirely novel scene content whilst still retaining the appearance of aerial isometric images. Here are some examples generating images of Mars base locations and Steampunk cityscapes.
((Space colony buildings on mars)) in <token artstyle> ((Steampunk city concept art)) in <token artstyle>
One unwanted element that it had learned from the dataset was a slightly muted color palette present in all generated outputs. Though this can be fixed in post I would probably edit the source images for another training attempt to fix this.
I was struggling to get painterly style outputs using the model until I started playing with the Prompt Editing features of Automatic1111 WebUI. By starting the generations in the trained style for the first few steps and switching to oil painting or similar art modifiers it provided a much wider variety of style outputs which still retained the aerial isometric look.
futuristic city street, metropolis, dystopian, in [<token artstyle>:vibrant oil painting:0.25]
I am quite happy with the results which matched the training goal outcome and it showcases the power of fine tuning custom models. I definitely feel like there are areas of improvement for future iterations:
- Higher quality dataset images
- Pre-processing images to increase contrast and saturation
- More variety of dataset examples (other than cities)