I tell people that I am a software developer, but that’s not what I really do. I’m a problem solver that sometimes has to use code to solve problems I am faced with. When I was only a few years into my career, my boss once referred to me as a “hacker”, and I found a great truth in the term. My job wasn’t to package and sell applications to users, it was to hack away and eventually solve complex problems involving spatial data and network routing. As I matured as a problem solver, I came to identify three different levels of abstraction.
For context, consider a request for some processing in a spatial database on data representing a collection of points and roads. To join along or understand further, reference the MTFCC Classifications and an SQL script tailored for DuckDB.
- How many points are within 100m of a primary road?
- How many points are within 100m of a secondary road?
Level 1: Solves one known problem
In the first level, abstraction is nearly non-existent. We write two queries, one to answer each request. Our solution doesn’t provide the ability to answer any questions beyond the identified requests. These types of solutions lack reusability and are more likely to be redundant with other solutions created in the past and future. They only require knowledge of the task at hand.
select count(distinct p.id) Count
from points p
join roads r on st_dwithin(p.geom,r.geom,100)=true
where r.mtfcc='S1100';
| Count |
|---|
| 26 |
select count(distinct p.id) Count
from points p
join roads r on st_dwithin(p.geom,r.geom,100)=true
where r.mtfcc='S1200';
| Count |
|---|
| 107 |
Level 2: Solves multiple known problems
Instead of answering each request independently, we create a solution that can answer both requests. As a bonus, we reveal additional information as we can now see the distribution of nearest road types for any point within 100m. This depended on the fact that we had knowledge of more than one request. When working in teams or organizations, effective communication becomes increasingly important at this level. It’s easier to create solutions for known problems than unknown ones.
select mtfcc Classification
,count(distinct p.id) Count
from points p
join roads r on st_dwithin(p.geom,r.geom,100)=true
group by 1
order by 1;
| Classification | Count |
|---|---|
| S1100 | 26 |
| S1200 | 107 |
| S1400 | 1182 |
| S1500 | 2 |
| S1630 | 3 |
| S1640 | 2 |
| S1710 | 1 |
| S1750 | 8 |
Level 3: Solves multiple known and unknown problems
At this level, we are predicting and anticipating the future. We ask ourselves what requests will come next and what other questions we can answer. We don’t just answer the things we know have been asked of us, we explore further. The solution we create is capable of answering multiple known and unknown questions. We elect to create a new data set that provides the closest road_id for every point_id regardless of distance.
drop table if exists points_all;
create table points_all as
select distinct on (p.id)
p.id point_id
,r.tlid road_id
,r.mtfcc classification
,st_distance(p.geom,r.geom) distancem
from points p,roads r
order by 1,3
| point_id | road_id | classification | distancem |
|---|---|---|---|
| 1 | 647157989 | S1400 | 826.0732468995094 |
| 2 | 2828553 | S1400 | 420.80429160397244 |
| 3 | 2822803 | S1400 | 138.23585861539044 |
| 4 | 2825668 | S1400 | 207.97482230400536 |
| 5 | 2840003 | S1400 | 70.72361199647597 |
Consider how much more we can now answer with our derivative table points_all showing every relationship, regardless of road classification or distance.
- How many points are within 200m of any road? Further than 1km?
- Statistics like average, percentiles, and more can be determined.
- We can bin the results into distance bands (0-10m, 10-25m, 25-50m, etc.) for all roads or by classification.
Simple example, but with big implications
The example provided is simple. The implications are large. What if our data set was hundreds of millions of points? If we aproach with the mindset of level 1, each query may run for hours before returning a result. If subsequent requests are made to determine the count of points for different distances or classifications, the solution will, again, be hours away. With the knowledge of more information we get in level 2 or level 3, we may choose to ask different questions.
The provided example uses a spatial data analysis task, but the same principles can be applied to software development. The lessons are clear: Work with a more expansive understanding of the problems you are solving while focusing on building tools and capabilities rather than just generating one off answers.