There are different ways to answer that question, so I'll just pick one and start from there: "I'm just coming from a programmer perspective where you have some API into graphics and audio."
The kind of APIs you are used to often refer to themselves as "frameworks" to make a distinction from engines like Godot or Unity.
So what's the difference between frameworks and engines and what does that have to do with having to use objects?
The difference is that frameworks facilitate typical workflows like creating a window, loading a mesh or texture from diverse file formats, playing audio etc. That helps a lot in getting assets on screen, but it doesn't do much to improve performance, because it basically just wraps the calls you would be making if you were using DirectX or other more low-level APIs.
Engines on the other hand improve performance by making assumptions about the kind of game you want to create. The more specific these assumptions are the less viable an engine becomes for different kinds of games, but the easier it is to add performance enhancing algorithms that make use of the fact that they can expect a certain kind of scene.
That was a bit theoretical so what does this mean for Godot?
Pretty much any mainstream engine today uses one such assumption, and that is that your game is made of "objects" (sometimes called GameObjects, Entities, Nodes etc.)
That means that every element of your game is expected to have a position and a size. This is the most important assumption you can make, because it allows the engine developers to implement so called "culling" mechanisms. Culling means that objects that aren't visible / can't be heard / can't affect physics / won't interest other players on the network are removed from the output stream that is sent to the graphics card / audio device / physics engine / network server.
So if everything is an object and everything has a position and a size it's easy to detect whether or not the bounding box of an object intersects with the area that is visible from the camera. Anything that doesn't is not relevant for the graphical output and is not sent to the graphics card.
So that explains "objects", but why "nodes"?
Let's say in your game there is a ship. On the ship there are sailors. In your game the sailors can't leave the ship, so it's safe to assume that as long as you can't see the ship you won't need to check if you can see any of the sailors, you can just assume that all of them are invisible too. That's how scene graphs work. Objects that are expected to move together are structured in a tree, so that the tree can be culled beginning with the root node and ending with the leaf nodes, and if the root node of a branch is invisible (inaudible, not relevant for physics etc.) all of the child nodes don't need to be considered. (disclaimer: it's a little more complicated than that, but the details won't help understand the basic idea.)
That explains "nodes", but why make a distinction between "nodes" and "scenes"?
The most intutive way to write a program is to use the paradigm of "procedural programming". A procedure is a sequence of commands that alter the state of its environment, like a cooking recipe. A cooking recipe won't tell you to get a stove first, it will just assume that there is one available. It's not written in a way that makes sure that you can perform multiple recipes at the same time in the same kitchen. It just gives you a series of tasks that will result in there being a cake or something in your kitchen when you are done.
This is easy to understand, but it is problematic once you have multiple things going on, since you can't always rely on the state of the environment being what you need, since you can't know who touched a resource last and what state it has left it in. That's why the concept of object orientation came up.
In object oriented code you define objects (e.g. a car) and have your commands only access resources that are part of that object (e.g. the combustion process only uses the engine that is part of the car, fuel that is taken from the car's fuel tank etc.). That way you can easily make sure that two objects can perform their methods without interfering.
Godot is an object oriented engine and scenes are classes. The root node is the base class of the scene and the immediate children are member variables (I am leaving out scripts for now, which can add members as well).
You can tell that scenes are classes because they
- can be instantiated
- can be exchanged as long as they have the necessary base class (e.g. you can exchange any sprite with any scene that has a sprite as a root node, i.e. polymorphism)
- can inherit from each other
So the reason why they are called scenes is because they are more than just nodes and they are more than just classes as well. They are classes that inherit from Node, so they combine the idea of encapsulation and that of common culling criteria.
If you think about it like that the name "scene" does make sense, because it's a very unspecific term that describes any group of things that are adjacent and somehow belong to each other.