Spatial structure of image data:
This refers to the way that the parts of an image are arranged and related to each other. For example, in a picture of a cat, the ears are at the top, the eyes are below the ears, the nose is below the eyes, and so forth. That’s a spatial structure. The relationship of these features to each other, and their arrangement in space, is crucial to recognizing that the image is of a cat.
How CNNs use this: CNNs are designed to respect this spatial structure. They process small pieces of the image at a time (for instance, small squares of a few pixels each), gradually building up an understanding of larger and more complex shapes. By doing this, they can start by identifying simple patterns, like lines and edges, then combine these to recognize more complex shapes (like an eye), and then combine those shapes to recognize the cat.
How FCNs differ: A Fully Connected Network (FCN), on the other hand, doesn’t care about where things are in the image. Instead, it treats each pixel in the image independently, no matter where it’s located. If you were to shuffle all the pixels around randomly, a CNN would see a different image, but an FCN would see the “same” image. As such, FCNs are not as good as CNNs at recognizing spatial patterns in image data. They’re more suited to tasks where the location of features in the input data doesn’t matter as much.
In essence, the advantage of CNNs for image processing tasks lies in their ability to understand and utilize the layout and structure of the image, something that FCNs are not designed to do.