What I learnt from development on LISA with SamGIS (So far)
Read publications related to the projects I work on
To improve my understanding of my machine learning project I decided to read the papers on which LISA and Segment Anything are based. Besides some theoretical informations about LLM, I noticed that the modular architecture of "SAM" permits to save and re-use image embeddings. Since SamGIS didn't work this way initially, I formulated an hypothesis about this.
Debugging, measures and optimization: Image Embedding Hypothesis
At this point I continued my debugging work by measuring the duration of individual steps during the execution of SamGIS functions. Creating an image embedding is quite an expensive operation, so it is advantageous saving it and re-using it (I verified that implementing my hypothesis would improve the performance of the software). Using the HuggingFace hardware profile "Nvidia T4 Small" (with 4 vCPU, 15 GB RAM and 16 GB VRAM) it's possible to save almost 1 second on every inference after the first, using the same image (without change the geographical area tiles provider).
The role of LLMs with prompts having different characteristics
LISA inherits the language generation capabilities of multi-modal LLMs such as Llava. These models excel at handling complex reasoning, world knowledge, explanatory answers and multi-turn conversations. They’re powerful tools for bridging the gap between text and visual understanding.
LISA allows you to perform rather complex reasoning during image segmentation (e.g. "identify the houses near the trees..." vs "identify the houses...") without any particular performance degradation. On the contrary, requests containing the explanation of reason ("explain why") the segmentation task is done in a certain way will have much higher execution times (in the order of minutes).
There are more details here about these improvements following the changes described and regarding different performance due to different cases when using SamGIS with LISA.
Top comments (0)