LISA adapted to SamGIS
Image segmentation is a crucial task in computer vision, where the goal is to extract the instance segmentation mask for a desired object within the image. I've already worked on a project, SamGIS, that focuses on this particular application of computer vision. A logical progression now would be incorporating the ability to recognize objects through text prompts. This apparently simple activity is actually different compared to what Segment Anything (the ML backend used by SamGIS) does. In fact "SAM" does not outputs descriptions nor categorizations for its input images. Starting from a written prompt at the contrary requires understanding which classes of objects exist in the image under analysis. A visual language model (or VLM) that performs well for this task is LISA. LISA's authors built their work on top of Segment Anything and Llava, a large language model with multimodal capabilities (it can process both text prompts and images). By leveraging LISA's "reasoned segmentation" abilities, SamGIS can now conduct "zero-shot" analyses, meaning it can operate without specific or specialistic prior training in geological, geomorphological, or photogrammetric fields.
Some input text prompts with their geojson outputs
I can't show this part on dev.to, then I refer you to my blog page.
Duration of segmentation tasks
At the moment, a prompt that also requires an explanation about the segmentation task slows down greatly the analysis. The same prompt on the same image without "descriptive" or "explanatory" questions instead finish much faster. Tests with explanatory text perform in more than 60 seconds while without duration is between 3 and 8 seconds, using the HuggingFace hardware profile "Nvidia T4 Small" with 4 vCPU, 15 GB RAM and 16 GB VRAM.
Software architecture
Technically and architecturally, the demo consists of a frontend page like SamGIS demo. Instead of the drawing tool bar there is a text prompt for natural language requests with some selectable examples displayed at the top of the page. The backend utilizes a FastAPI-based API that calls a custom LISA function wrapper.
Unfortunately I have to pause my demo due to GPU cost, but I am requesting the use of a free GPU from HuggingFace. Please feel free to reach out to me on LinkedIn for a live demonstration, ask for more information or further clarifications.
Top comments (0)