My version of SD is about a year old now so who knows how many advances have been made since then but it takes quite a while to get anything close to passable without a reference image. Well unless you’re lucky that is.
Because its text encoder is relatively simple and retarded, and i won't say midjourney is anything better in this regard. The only model that made a huge step forward in understanding complex prompts is dalle (its third version).
Most of unique SD capabilities comes from its free and open source nature, so there are shitload of controlnet plugins, loras and custom models Hello.png