The Problem
Create a program that will analyse a document image and will determine the document's physical layout and in some extent the logical layout of the document.
The Test Document Images



The Solution
My solution is a simple and quite restrictive bottom-up document physical layout analyzer.
Bottom-up layout method depends on local information to analyze and build the parts of the document, like black or ON pixels and connected components as opposed to top-down methods that makes use of black/white stripes/spaces/gaps that is use to strip the document into smaller parts for analysis. Many variation have been proposed for each of the two methods and hybrid methods are also available but there is none (based from the literatures that I have read) that will work on any type of document layout.
The solution presented in this blog makes use of three level/pass of simple connected components (CCs) analysis which is described below :
1. For the first level the images where converted to binary image using a a naive threshold method that makes use of a predefined gray threshold value of 220.
2. After binarization, connected components (CCs) where extracted from the binary image and where placed on a list.
3.After producing the CCs, the components where sorted with respect to their Y and X coordinate in the binary image, having greater priority for the Y value than the X, making the component with both the least X and Y value (the topmost-leftmost component) the first on the list. Small components like the dot of the letter "i", period , comma, hypens, and the like plus some unfiltered noise where placed near the end of the list so that they will not hinder later processing.
4. After sorting, the euclidian distance of adjacent pair components on the list where computed, grouped based from the range of the Y - value of the components and then stored in a group-distance list.
5. Each group-distance where sorted and for each group the 20th percentile was computed. This value will be the maximum allowable distance (MAD) of adjacent connected components to determine which will be linked as one bigger component.
6. The building up of a bigger component will stat by analyzing the list of sorted CCs again. Distance of adjacent pair components where measured again and compared to the computed MAD. If the distance of the pair is acceptable, the component with the lesser X and Y value will be linked to the bigger component being formed. Once the distance of a chosen pair goes beyond the MAD value, the component with the lesser X and Y value will be linked to the currently being formed component and the building up this bigger component will be stopped and a new bigger component will be built using the other half of the pair. The process is done repeatedly until no pair of component can be produced. The process of linking components is achieved by drawing a solid rectangular binary box that covers all the linked component in the produced binary image in step 1. This linking is shown as transparent rectangular box drawn in the color image. (presented in the Sample Results section below)
7. After linking all the smaller components, the second level processing follows, which is just a repetition of Step 2 - 6 changing only the percentile value to 80 in step 5.
8.The third level processing follows, which is also a repetition of Step 2 - 6 changing only the percentile value to 10 in step 5.
Sample Results
First Pass ( Forming of Words)
Second Pass ( Forming of Lines/Sentences)
Third Pass ( Forming of Paragraph)
Final Remarks
The solution presented is far from perfect. As seen from the images above, there are components that were either fail to be linked or has been linked incorrectly. Also, no preprocessing was made to the image, like skew correction, noise removal, and background removal.














































