One of our providers give us some data ad PDF and I have to produce a JSON object for further elaborations.
For the textual information non problem: I used pdftotext
to extract the text.
$content = shell_exec('pdftotext -enc UTF-8 -layout input.pdf -');
Then I used regular expressions to extract the data
$anagrafica=array();
if(preg_match('/^Denominazione\W*(.*)/m', $content, $aDenominazione)) {
$anagrafica['denominazione']=$aDenominazione[1];
}
How to extract the data of the semaphores that are images without labels?
I used the linux command pdftohtml
$rawImages = shell_exec('pdftohtml -enc UTF-8 -noframes -stdout -xml "'.$this->filePath.'" - | grep image');
$tok = strtok($rawImages,"\r\n");
while ($tok !== false) {
$oImage = simplexml_load_string($tok);
$images[]=$oImage;
$tok = strtok("\r\n");
}
The output of pdftohtml
in a xml document for each text box or image.
$rawImages
is an array of the xml elements of the images ans I put them as SimpleXmlObjects in $images
array.
Than I searched trough the array the images with 77 pixel of width and sort the by the vertical position.
The images are saved in the current directory of the script.
I queried the color of a pixel in a specific position of the image with convert
command of ImageMagick library and saved the data in the JSON object.
$color = shell_exec('convert "'.$imagePath.'" -format \'%[pixel:p{100,50}]\' info:- ');
switch ($color) {
case 'srgb(253,78,83)':
$anagrafica[$this::chekcs[$pos]]='red';
break;
case 'srgb(123,196,78)':
$anagrafica[$this::chekcs[$pos]]='green';
break;
case 'srgb(254,211,80)':
$anagrafica[$this::chekcs[$pos]]='yellow';
break;
};
At this point: is there an easy way to do the trick?
Top comments (0)