While a
market model is developed using the recent arms-length sales, the real
objective is to generate efficient values for the unsold population, which is
generally 95% of the entire population from which a sales sample is derived.
Therefore, it is critical to scientifically test the representativeness of the
sample.

While most
AVMs are built around Multiple Regression Analysis (MRA) and Time Series, an
efficient AVM process should also include Sampling considering its inherent
power and strong association with the MRA. A properly derived sample is almost
as good as the population, so sampling is extremely useful when the population
is large.

The power of
sampling makes the population more manageable, helps optimize allocation of
resources, and points more scientifically to the inefficient areas of modeling
or work-flow.

Also, there
is no such thing as a generic 5% sample either. A 5% sample of a homogeneous
(residential) population could be an excessive sample, while it could be an
inadequate sample for a very heterogeneous population like the commercial
properties, ranging from multifamily to automotive to daycare to dining to
entertainment to office to retail to industrial to warehouse, etc.

A sample, by
definition, is representative of the population, but there is no such thing as
a perfect sample. A perfect sample is the population. While testing the
representativeness of the sample, selecting variables from the different strata
of variables – quantitative variables, qualitative variables, general location variables,
etc. – is of utmost importance.

The sales
sample has to pass the “representative” test not only at the overall level, but
also pass a series of stratified tests, e.g., geography, size, age, price,
exterior characteristics, etc. Once the sales sample is established, it must be
split it up between a modeling sample (70-80%) and a holdout sample (30-20%).

Since the model
is tested on the holdout sample before being applied on to the population, the
holdout sample must have very similar attributes as the modeling sample. Of
course, both must also represent the parent sales sample. If the sales sample
is large, the split could be larger: modeling sample 65% and holdout sample 35%.

Moreover, the
Percentile distribution (“Pctl”) curve – at least from the 25

^{th}Percentile to the 75^{th}Percentile – is appropriate to test the representativeness of the quantitative variables, while the frequency percentile distribution is fitting for the qualitative and general location variables. Tests pertaining to the quantitative variables are more important than their qualitative counterparts.
The sample
representative test for the quantitative variables must be performed in three sequential
steps: (1) Median-based tests, (2) Body of the Curve (25

^{th}to 75^{th}Percentile distribution) Tests, and (3) Expanded Curve (5^{th}to 95^{th }Percentile distribution) Tests.

**Median-based Tests**

Though the above
sample is 6.90% of the population – admittedly, quite high for a homogeneous
sample – it will be reduced later when it gets split up between modeling and
holdout samples. Again, even the most representative samples will marginally
differ from their parent populations in almost all categories. Therefore, in
evaluating the representativeness of the quantitative variables, a 10%
differential could be accepted as the rule of thumb. In Table-1 above, all of
the variables, except Bldg Age, have passed the 10% test. Bldg Age – derived as
Current Year–Year Built – is used in place of the Year Built in order to ease into
the modeling, thus allowing direct introduction as an independent variable into
the MRA equation.

The median
Bldg Age in the sales sample is 40% lower than the population’s, clearly pointing
to the fact that the recent buyers preferred younger homes without sacrificing
other attributes. When the Gross Bldg Area (GBA) variable is available in the
database, it must also be evaluated in order to make the comparison more apples-to-apples,
meaning the additional non-living improved areas must also be compared as they
contribute to the overall value of the home. Of course, the Living Area variable
must always be preferred as an independent variable to the GBA in the actual
MRA. Some county databases may contain Heated Area instead of or in addition to
the Living Area. Heated Area could be as good. While Bath (BA) counts are important
comparison metrics, Bedroom (BR) count is not.

Given the fact
that only a limited number of truly meaningful quantitative variables are generally
available, any significant divergence – as in Bldg Age here – must be subjected
to more in-depth examination and, more often than not, they could thus be saved
and utilized. When the median shows a significant divergence, despite a
reasonably large sample size, the entire percentile curve – from the 5

^{th}to the 95^{th}– must be closely analyzed. The curve often points to a cluster in the short or long end (of the curve), forcing the sample median to be skewed and consequently diverge from the population median.
The Chart-1
above shows a significant cluster at the short end (through the 25

^{th}) of the sample curve which is pulling the sample median down, although the long end of the curve (75^{th}to 95^{th}) shows meaningful convergence. This phenomenon often occurs in the marketplace when builders offer additional incentives (free upgrades, low interest financing, interest buy down, etc.) to push the excess inventory out. As the excess inventory gets absorbed, the local market returns to normalcy, resulting in the future sales samples to mimic the population more closely. This tends to be a very short-term phenomenon, perhaps an aberration, and as such gets smoothed out over time. The Bldg Age variable here is therefore perfectly usable in actual modeling.
If more and
more millennials continue to move into the area and follow suit, a structural
shift (a new trend) in the market would occur. While such a shift would be
quickly captured in the sales sample, it would take much longer to reflect in
the population stats. Meanwhile, in order to make the sales sample usable in market
modeling, it would require lot more liquidity, e.g., instead of most recent
twelve months worth of sales, eighteen to twenty-four months worth of sales
could be experimented with, thus significantly reducing or perhaps ironing out
the recent aberration (assuming, of course, it’s statistically non-structural).

The Chart-2
above shows the two Living Area percentile curves are moving in tandem, without
any divergence in short, mid or long ends, proving that no such structural
change in the market has been taking place. Nonetheless, this test is critical
when one important quantitative variable shows any misalignment, thereby instilling
the necessary statistical confidence into the modeling spectrum.

Though many
practitioners do not go past the median-based test, it is necessary but not
sufficient. Models developed off of the median-based test samples are prone to
hidden errors on both ends of the curve as they are totally overlooked at the
point of sampling. If one is forced to accept the median-based solution, one
must additionally compare the relative Standard Deviations (SD) and Coefficient
of Variations (COV). In real estate economics, COV is normalized by the median,
rather than the mean, as the median reduces the incidence of outliers in the
data series.

**Body of the Curve (25**

^{th}to 75^{th})

Even the body
of the curve comparison (Table-4) shows very consistent (within 10%) results,
except for the previously-identified Bldg Age variable. The only other
out-of-range node happens to be the 75

^{th}Percentile of the Land Area, which is common in suburban counties.
In urban
areas, the vast majority of inner city residential lots are zero-lot-line lots,
while the remainders are generally smaller and are fairly homogeneous, thus
compressing the deviation. While the GBA tends to become erratic on outer end
of the curve, it is nevertheless quite consistent here. In analyzing and
comparing the body of the curve, Q-Range (75

^{th}– 25^{th}) is a good measure of variability.
The body of
the curve methodology would be perfectly fine when the intended model is of
lesser complexity. In other words, if the model does not demand ultra precision
and accuracy, one could stop right at this step, to avoid having to drill down
to the following more complex step.

**Expanded Curve (5**

^{th}to 95^{th})
When a highly
accurate model is required, this expanded approach is recommended considering
it depicts both ends of the curve as well. Predictably, the outer end (90

^{th}– 95^{th}) of the Land Area is exponentially diverging. Therefore, if the Land Area variable becomes significant in the model, model values pertaining to those parcels (Land Area >=90^{th}) must be subjected to additional tests and scrutiny. Conversely, the Bldg Age is fast converging on the outer end of the curve.
Since the COV
is a normalized measure, it is a better measure of variability than the SD. For
example, though the GBA’s SD (Table 5) is significantly higher than Living
Area’s, the two COVs are very close to each other. Similarly, despite the Bldg
Age’s lower SD (Table 5) than the Population’s (Table 6), the COVs are
transposed.

**Testing Qualitative Variables**

In both
instances (Tables 8-9) the sample adequately represents the population, proving
that the qualitative variables are statistically well-aligned from the modeling
point of view. Since this dataset has been extracted from a Southern coastal
state, the Waterfront variable is obviously meaningful. Alternatively, for a NE
state, the Style variable would make sense in view of the diversity of home
styles there.

Table-10
demonstrates that the homebuyers had an inverse love affair with the two most
liquid towns, i.e., TOWN 7 and TOWN 8, albeit both towns have successfully met
the 10% threshold. Homebuyers were however more steady in approaching TOWN 5,
the third most liquid town on the lineup. In any event, the town-wise
distribution amply confirms the representativeness of the market sample a well.

__Holdout Sample__
Now that the primary
sales sample has been established, it needs to be split up into two parts:
Modeling Sample and Holdout Sample. The latter ensures the accuracy and consistency
of the model before being applied on to the population. When the final model, developed
off of a modeling sample, shows a COV of 12, the holdout application should
show a very similar COV (e.g., between 11.50 and 12.50), after removal of the outliers
involving the same range. If the holdout application produces a COV of 14.00,
the model must be re-examined. Often, this interim step helps identify the
areas of model failures and weaknesses. Often, based on the holdout results,
the model is recalibrated, refined or fine-tuned. Absent this step, the direct
model application on to the population would be unscientific, at least
statistically insignificant.

Again, while
the primary sales sample is tested against the population as a whole, the
modeling and hold-out sub-samples are tested against the primary sales sample they
are derived from. Needless to say, both sub-samples must retain the attributes
of the primary. Unlike the extensive and multi-tier tests the primary sales sample
is subjected to, the median-based tests for the sub-samples would suffice for
the sub-samples, considering this is a review step, rather than a modeling
step.

Table-11 shows how the primary sales
sample has been split up into modeling (70%) and holdout (30%) sub-samples
while retaining the original statistical properties of all of the seven
analysis variables. Post split, the modeling sample has been reduced to 4.83%
(6,632/137,012). Depending on the size of the sales sample, the modeling and
holdout splits could be manipulated. In this case, the sample was large to
justify a 70%-30%. Had the sample been smaller, a 75%-25% (even 80%-20%) would
have made sense. Since the model would be developed off of the modeling sample,
its liquidity is more important.

__Splitting up a Sample in Excel__

- In splitting up the sales sample, Excel’s Random Number Generator was used.
- A new column B
was created to store the resulting random numbers.
__Note: In order to access the Random Number Generator, Excel’s Data Analysis module needs to be activated.__ - The sales sample was then sorted by the random number column.
- The top 30% of the sales sample contributed to the holdout sample, while the bottom 70% was retained as the modeling sample. Note: It’s a manual step. No other manipulation is needed to generate the sub-samples.

Excel’s

*were utilized to derive median and percentile distributions of the quantitative variables.*__Median and Percentile (statistical) functions__
Excel’s

*was utilized to derive the frequency distributions of the qualitative variables.*__Frequency function__
_________________________________________

Reprinted from the forthcoming book 'A Fully Illustrated Guide to Automated Valuation Modeling (AVM) . 100% Excel-based (No need to learn a complicated Stat package'