Friday, April 21, 2017

Testing the Representativeness of a Sales Sample

While a market model is developed using the recent arms-length sales, the real objective is to generate efficient values for the unsold population, which is generally 95% of the entire population from which a sales sample is derived. Therefore, it is critical to scientifically test the representativeness of the sample.

While most AVMs are built around Multiple Regression Analysis (MRA) and Time Series, an efficient AVM process should also include Sampling considering its inherent power and strong association with the MRA. A properly derived sample is almost as good as the population, so sampling is extremely useful when the population is large.

The power of sampling makes the population more manageable, helps optimize allocation of resources, and points more scientifically to the inefficient areas of modeling or work-flow.

Also, there is no such thing as a generic 5% sample either. A 5% sample of a homogeneous (residential) population could be an excessive sample, while it could be an inadequate sample for a very heterogeneous population like the commercial properties, ranging from multifamily to automotive to daycare to dining to entertainment to office to retail to industrial to warehouse, etc.

A sample, by definition, is representative of the population, but there is no such thing as a perfect sample. A perfect sample is the population. While testing the representativeness of the sample, selecting variables from the different strata of variables – quantitative variables, qualitative variables, general location variables, etc. – is of utmost importance.

The sales sample has to pass the “representative” test not only at the overall level, but also pass a series of stratified tests, e.g., geography, size, age, price, exterior characteristics, etc. Once the sales sample is established, it must be split it up between a modeling sample (70-80%) and a holdout sample (30-20%).

Since the model is tested on the holdout sample before being applied on to the population, the holdout sample must have very similar attributes as the modeling sample. Of course, both must also represent the parent sales sample. If the sales sample is large, the split could be larger: modeling sample 65% and holdout sample 35%.

Moreover, the Percentile distribution (“Pctl”) curve – at least from the 25th Percentile to the 75th Percentile – is appropriate to test the representativeness of the quantitative variables, while the frequency percentile distribution is fitting for the qualitative and general location variables. Tests pertaining to the quantitative variables are more important than their qualitative counterparts.

The sample representative test for the quantitative variables must be performed in three sequential steps: (1) Median-based tests, (2) Body of the Curve (25th to 75th Percentile distribution) Tests, and (3) Expanded Curve (5th to 95th Percentile distribution) Tests.
Median-based Tests

Though the above sample is 6.90% of the population – admittedly, quite high for a homogeneous sample – it will be reduced later when it gets split up between modeling and holdout samples. Again, even the most representative samples will marginally differ from their parent populations in almost all categories. Therefore, in evaluating the representativeness of the quantitative variables, a 10% differential could be accepted as the rule of thumb. In Table-1 above, all of the variables, except Bldg Age, have passed the 10% test. Bldg Age – derived as Current Year–Year Built – is used in place of the Year Built in order to ease into the modeling, thus allowing direct introduction as an independent variable into the MRA equation. 

The median Bldg Age in the sales sample is 40% lower than the population’s, clearly pointing to the fact that the recent buyers preferred younger homes without sacrificing other attributes. When the Gross Bldg Area (GBA) variable is available in the database, it must also be evaluated in order to make the comparison more apples-to-apples, meaning the additional non-living improved areas must also be compared as they contribute to the overall value of the home. Of course, the Living Area variable must always be preferred as an independent variable to the GBA in the actual MRA. Some county databases may contain Heated Area instead of or in addition to the Living Area. Heated Area could be as good. While Bath (BA) counts are important comparison metrics, Bedroom (BR) count is not.

Given the fact that only a limited number of truly meaningful quantitative variables are generally available, any significant divergence – as in Bldg Age here – must be subjected to more in-depth examination and, more often than not, they could thus be saved and utilized. When the median shows a significant divergence, despite a reasonably large sample size, the entire percentile curve – from the 5th to the 95th – must be closely analyzed. The curve often points to a cluster in the short or long end (of the curve), forcing the sample median to be skewed and consequently diverge from the population median.

The Chart-1 above shows a significant cluster at the short end (through the 25th) of the sample curve which is pulling the sample median down, although the long end of the curve (75th to 95th) shows meaningful convergence. This phenomenon often occurs in the marketplace when builders offer additional incentives (free upgrades, low interest financing, interest buy down, etc.) to push the excess inventory out. As the excess inventory gets absorbed, the local market returns to normalcy, resulting in the future sales samples to mimic the population more closely. This tends to be a very short-term phenomenon, perhaps an aberration, and as such gets smoothed out over time. The Bldg Age variable here is therefore perfectly usable in actual modeling.  

Of course, when one of the basic quantitative variables exhibits such short-term aberration, the Living Area variable must also be co-examined alongside, to avoid having to unintentionally dismiss or gloss over any emerging or ongoing structural shift in the market. A structural shift could be in the offing if both variables move in tandem, in clear divergence from the population. For example, due to the vibrant local economy driven by the IT industry, if the millennial population starts to invest in smaller but younger homes, both sample medians would contract, diverging from the population medians.

If more and more millennials continue to move into the area and follow suit, a structural shift (a new trend) in the market would occur. While such a shift would be quickly captured in the sales sample, it would take much longer to reflect in the population stats. Meanwhile, in order to make the sales sample usable in market modeling, it would require lot more liquidity, e.g., instead of most recent twelve months worth of sales, eighteen to twenty-four months worth of sales could be experimented with, thus significantly reducing or perhaps ironing out the recent aberration (assuming, of course, it’s statistically non-structural).       

The Chart-2 above shows the two Living Area percentile curves are moving in tandem, without any divergence in short, mid or long ends, proving that no such structural change in the market has been taking place. Nonetheless, this test is critical when one important quantitative variable shows any misalignment, thereby instilling the necessary statistical confidence into the modeling spectrum.

Though many practitioners do not go past the median-based test, it is necessary but not sufficient. Models developed off of the median-based test samples are prone to hidden errors on both ends of the curve as they are totally overlooked at the point of sampling. If one is forced to accept the median-based solution, one must additionally compare the relative Standard Deviations (SD) and Coefficient of Variations (COV). In real estate economics, COV is normalized by the median, rather than the mean, as the median reduces the incidence of outliers in the data series.       
Body of the Curve (25th to 75th)

Even the body of the curve comparison (Table-4) shows very consistent (within 10%) results, except for the previously-identified Bldg Age variable. The only other out-of-range node happens to be the 75th Percentile of the Land Area, which is common in suburban counties.

In urban areas, the vast majority of inner city residential lots are zero-lot-line lots, while the remainders are generally smaller and are fairly homogeneous, thus compressing the deviation. While the GBA tends to become erratic on outer end of the curve, it is nevertheless quite consistent here. In analyzing and comparing the body of the curve, Q-Range (75th – 25th) is a good measure of variability.

The body of the curve methodology would be perfectly fine when the intended model is of lesser complexity. In other words, if the model does not demand ultra precision and accuracy, one could stop right at this step, to avoid having to drill down to the following more complex step.

      Expanded Curve (5th to 95th

When a highly accurate model is required, this expanded approach is recommended considering it depicts both ends of the curve as well. Predictably, the outer end (90th – 95th) of the Land Area is exponentially diverging. Therefore, if the Land Area variable becomes significant in the model, model values pertaining to those parcels (Land Area >=90th) must be subjected to additional tests and scrutiny. Conversely, the Bldg Age is fast converging on the outer end of the curve.

Since the COV is a normalized measure, it is a better measure of variability than the SD. For example, though the GBA’s SD (Table 5) is significantly higher than Living Area’s, the two COVs are very close to each other. Similarly, despite the Bldg Age’s lower SD (Table 5) than the Population’s (Table 6), the COVs are transposed.

        Testing Qualitative Variables

In both instances (Tables 8-9) the sample adequately represents the population, proving that the qualitative variables are statistically well-aligned from the modeling point of view. Since this dataset has been extracted from a Southern coastal state, the Waterfront variable is obviously meaningful. Alternatively, for a NE state, the Style variable would make sense in view of the diversity of home styles there.     

Table-10 demonstrates that the homebuyers had an inverse love affair with the two most liquid towns, i.e., TOWN 7 and TOWN 8, albeit both towns have successfully met the 10% threshold. Homebuyers were however more steady in approaching TOWN 5, the third most liquid town on the lineup. In any event, the town-wise distribution amply confirms the representativeness of the market sample a well.

Holdout Sample
Now that the primary sales sample has been established, it needs to be split up into two parts: Modeling Sample and Holdout Sample. The latter ensures the accuracy and consistency of the model before being applied on to the population. When the final model, developed off of a modeling sample, shows a COV of 12, the holdout application should show a very similar COV (e.g., between 11.50 and 12.50), after removal of the outliers involving the same range. If the holdout application produces a COV of 14.00, the model must be re-examined. Often, this interim step helps identify the areas of model failures and weaknesses. Often, based on the holdout results, the model is recalibrated, refined or fine-tuned. Absent this step, the direct model application on to the population would be unscientific, at least statistically insignificant.

Again, while the primary sales sample is tested against the population as a whole, the modeling and hold-out sub-samples are tested against the primary sales sample they are derived from. Needless to say, both sub-samples must retain the attributes of the primary. Unlike the extensive and multi-tier tests the primary sales sample is subjected to, the median-based tests for the sub-samples would suffice for the sub-samples, considering this is a review step, rather than a modeling step.

          Table-11 shows how the primary sales sample has been split up into modeling (70%) and holdout (30%) sub-samples while retaining the original statistical properties of all of the seven analysis variables. Post split, the modeling sample has been reduced to 4.83% (6,632/137,012). Depending on the size of the sales sample, the modeling and holdout splits could be manipulated. In this case, the sample was large to justify a 70%-30%. Had the sample been smaller, a 75%-25% (even 80%-20%) would have made sense. Since the model would be developed off of the modeling sample, its liquidity is more important.
Splitting up a Sample in Excel

  1. In splitting up the sales sample, Excel’s Random Number Generator was used.
  2. A new column B was created to store the resulting random numbers. Note: In order to access the Random Number Generator, Excel’s Data Analysis module needs to be activated.
  3.   The sales sample was then sorted by the random number column.
  4. The top 30% of the sales sample contributed to the holdout sample, while the bottom 70% was retained as the modeling sample. Note: It’s a manual step. No other manipulation is needed to generate the sub-samples.

Excel’s Median and Percentile (statistical) functions were utilized to derive median and percentile distributions of the quantitative variables.

Excel’s Frequency function was utilized to derive the frequency distributions of the qualitative variables.

Reprinted from the forthcoming book 'A Fully Illustrated Guide to Automated Valuation Modeling (AVM) . 100% Excel-based (No need to learn a complicated Stat package'

No comments:

Post a Comment