SmallNetBuilder Roaming Testbed
How To Fix Wi-Fi Roaming tried to explain the whys and wherefores of Wi-Fi roaming. The key intended takeaway was that the wireless device, not the AP/router/mesh point, is the biggest factor in how a device moves between APs.
But Wi-Fi marketeers continue to promise "seamless" roaming and frustrated Wi-Fi users continue to believe them, especially when vendors promise their products use special techniques to bend Wi-Fi devices to their will.
So since I always say that data should do the talking, I'm going to show, in detail, how different devices can have entirely different roaming behavior with the same Wi-Fi system.
I decided to use a NETGEAR AC2200 Orbi "mini" (RBK40) for this experiment because I had it on hand and because it supports 802.11k and v roaming assistance standards. So if devices also support either or both these standards, Orbi should be able to help them on their roaming journey. The Router and Satellite backhaul was via Ethernet.
NETGEAR RBK40 Orbi ("mini") AC2200 Wi-Fi System
For devices, I chose octoScope's Pal-245 partner device, a Windows 10 computer with embedded Wi-Fi adapter and an Android tablet.
With most devices, roaming behavior is a closely guarded secret; no so with the Pal. The Pal-245 provides a STA with known and controllable roaming behavior. Virtually every aspect of its Wi-Fi functions can be controlled, including # of streams, MCS rate and 802.11 mode (a/b/g/n/ac).
For roaming, threshold RSSI (when to roam), band preference and target threshold RSSI can all be controlled (they are outlined in red below). The only thing I couldn't do was disable its 802.11v support (it doesn't support 11k or r).
octoScope Pal-245 controls
The Lenovo M600 computer with Intel Wireless-AC 8260 internal adapter running Windows 10 Pro 64 bit with latest 188.8.131.52 driver for the adapter were chosen because Microsoft says Windows 10 supports 802.11k,v and r (depending on device driver support, of course) and Intel says the AC-8260 supports 802.11k,v and r.
Intel adapter 802.11k,v,r support (partial list)
Finally, the Samsung Galaxy Tab A 8" (2017 version - SM-T380) was chosen because it was found to support 802.11k and v by packet inspection, even though neither is specified (of course). In fact, while the tablet is dual-band, it supports 802.11n (and only 1x1, too).
I used the octoScope based Wi-Fi System Revision 1 testbed for the tests, configured as shown below. Note the Pal-24 and Pal-5 were not used as STAs. Instead, each was set to its packet capture mode to grab control and management frames only, i.e. no data frames.
Each Pal was set to a channel used by Orbi and the capture files were merged before analysis.The Pals and test computer were all synchronized via NTP to ensure proper sequencing of the merged capture files. As noted earlier, the Orbi router and satellite used Ethernet for backhaul.
Roaming test setup
All Pals and the Windows STA connected via cable directly to the RF splitter. But the Samsung tablet had to be placed in an octoBox RF chamber and connected over the air (OTA) via antenna. So its signal levels were generally lower.
The roaming test itself was controlled by a Python script to run this sequence:
- Set Attenuator 1 to 0 dB, Attenuator 2 to a larger value. This ensures STA will connect to the Router.
- Start packet capture
- Associate STA to Orbi Router
- Start traffic between AP and STA (optional)
- Roam device to Orbi Satellite by increasing Atten1 and decreasing Atten 2
- Continue to monitor connection for awhile after roam to check for post roam band-steering or other changes
- Reverse roam from Satellite to Router by decreasing Atten1 and increasing Atten 2
- Monitor post roam as above
- Stop traffic
- Stop packet capture and save
As noted, running traffic during roaming was not done in each test run. When traffic was present, it came from running a single TCP/IP iperf3 stream from AP to STA, rate-limited to 5 Mbps. The script attempted to restart iperf3 between the post-roam and monitor periods and before the reverse roam start. However, it was not always successful.
In addition to the pcap capture file, a CSV log file was also created. At each attenuator step, the following was logged for Windows or Android:
- STA status (associated, scanning, etc.)
- IP address
- Associated BSSID
- Channel or frequency
- RSSI (dBm) for Android / Signal % for Windows
- Link Rate
- Atten 1 value
- Atten 2 value
Windows data came from connecting via SSH, running netsh commands and parsing the results. Android data was obtained via a custom app I had developed, which pulled information from the WiFiInfo class in the android.net.wifi object. The data was then downloaded in JSON format via HTTP. Both the Windows and Android status data retrievals were made over the device's WiFi connection. Since the connection could temporarily drop during the roam, the script was designed to poll the device until it retrieved data.
Pal's log file captured more data:
- IP address
- STA status (associated, scanning, etc.)
- Associated BSSID
- Channel or frequency
- Tx Link Rate
- Rx Link Rate
- Roam status
- Roam scan channel
- Channel congestion
- Atten 1 value
- Atten 2 value
This data was retrieved in JSON format via Pal's Ethernet control connection.
People have different needs and expectations from Wi-Fi roaming. The toughest application is VoWiFi, i.e. voice callers using some form of VoIP over Wi-Fi. The rule of thumb for these users is that a roam must be completed in 50 - 100 ms to avoid call interruption.
Most of us, however, would be happy if our devices would move at all to a better connection instead of stubbornly hanging onto the first AP they see. And, oh yeah, it would be nice if when the device moved, it didn't drop our video call or make us have to restart the video we were watching.
Since my script only polls the connected device after each attenuator step, the CSV roam log isn't going to capture sub-second events. But since packet capture runs continuously, the data is there if we need it.
The attenuators were stepped every second, using 2 dB increments to move things along. But due to overhead of the different data capture methods, log entries were about every 1.5 seconds for Android, 2.5 seconds for Pal and 3 seconds for Windows. From what I saw, even these relatively rough capture resolutions were sufficient to catch most devices in the act.
For the first example, the Pal's Roam Threshold was set to -70 dBm, Band Preference to Auto and Roam Target Threshold to -60 dBm. You generally want the target threshold set for a higher signal level than the roam threshold. For example, Apple says macOS uses a -75 dBm roam threshold and +12 dB target difference, i.e. -63 dBm.
Setting the band preference to Auto means Pal will select the strongest signal, which will naturally come from Orbi's 2.4 GHz radio. Neither Orbi—nor any other consumer Wi-Fi system I know of—automatically adjusts transmit power level to optimize roaming.
The plot below shows a round-trip roam session, with callouts marking roam points. This BSSID decoder will come in handy for decoding the following Pal logs, RSSI plots and packet captures.
|2.4 GHz (Channel 6)||5 GHz (Channel 40)|
|Orbi Router (AP1)||0e:02:8e:9f:39:c5||08:02:8e:9f:39:c8
|Orbi Satellite (AP2)||0e:02:8e:9f:3a:f6||08:02:8e:9f:3a:f9
|Pal-245||04:F0:21:28:C6:80 / CompexPt_28:c6:80|
The first roam was quick, occuring within one plot tick (~2.5 sec). The gaps in the plot (where there is no connecting line) shows discontinuities. The first break (~105 sec) is between the active roam and after-roam monitor period. The second break (~130 sec) marks where the first monitor period ends and the return roam starts. The third break at the 210 second mark is where the second after-roam monitor period starts.
Roaming test - Pal w/ -70 threshold
Note that band-steering occurs right at the end of the 10 second after-roam monitor period. The Pal log file below shows RSSI jumping around due to Pal scanning activity. Note that Pal remains associated while it's scanning for a new connection, which should happen so that the connection is not dropped. But since Pal (and any other STA) has only one radio, it can't transmit or receive data during the scan time. Wi-Fi product designers must carefully mange scan time so that data handling isn't significantly affected.
Pal log - roam scan
So what made Pal change bands? The answer to that lies in the packet capture. I use a Wireshark filter to quickly home in on key roaming events.
The fun actually starts back at the highlighted time 101.1, with a BSS Transition Management Request. This is an 802.11v feature that allows an AP to suggest or force a STA to roam, along with a suggested BSSID to roam to. (This 7Signal whitepaper is an excellent resource for understanding 802.11k,v and r features and modern roaming mechanics in general!)
Packet capture - 11v influenced roam
In this case, the currently connected AP—0e:02:8e:9f:3a:f6—is suggesting a move to 08:02:8e:9f:3a:f9, which turns out to belong to the same AP's 5 GHz radio set to Channel 40. Note, however, the roam doesn't happen until around 25 seconds later, when Pal—shown as CompexPt_28:c6:80 in the capture—finally says bye-bye by issuing a disassociation.
What happens in that 25 seconds? The good news is that the STA stays connected. But not shown in the filtered capture are the oh so many probes the Pal makes to the radio it's currently connected to, i.e. 0e:02:8e:9f:3a:f6. Pal first spends around 10 seconds probing the BSSID it's currently connected to on Channel 6, then 2 seconds issuing probes to the same BSSID on Channel 40, then back to Channel 6 for a few more seconds and finally pausing around 6 seconds before disassociating at time 125.6 and finally completing authentication and association to the BSSID suggested by the BSS Transition Management Request.
Note the last move back to AP1, Channel 6 (0e:02:8e:9f:39:c5) is done without an 11v BSS Transition request or disassociation. But the AP immediately disassociates Pal in the last frame shown 183.2, most likely trying to get it to band steer to 5 GHz. Not shown in the above capture is that the AP follows up around 8 seconds later at time 191.3 and again at time 207.3 with BSS Transition Management Requests with 08:02:8e:9f:39:c8 as the candidate AP, which is AP1's 5 GHz radio. The capture ends before that roam is made.
Now let's see what happens with a "sticky" STA. To emulate one, I set the Pal's roaming and target thresholds to -95 dBm. As noted earlier, I couldn't disable Pal's 11v feature.
This time, the roam plot shows Pal doesn't move from its initial Channel 6 connection, which is the expected behavior since the RSSI doesn't fall to the -95 dBm setting. But why does a band-steer—on the same AP—occur toward the end of the run?
Roaming test - Pal w/ -95 threshold
The packet capture shows Orbi made many unsuccessfully attempts to get our sticky Pal to move. It first disassociated Pal (time 20.2), which promptly connected right back. Orbi next tried to band steer Pal by issuing an 11v BSS Transition Management Request, suggesting a move to the same AP's 5 GHz radio (08:02:8e:9f:39:c8) @ 27.2 and again at 48.3. Then, for some reason, it tries to disassociate Pal from the same radio it just tried to steer it to! Orbi then tries one more 11v request @ 64.3 before giving up for awhile.
Packet capture - sticky STA
The action picks up again @ 153.2 with another 11v BSS Transition suggestion, once again to 08:02:8e:9f:39:c8, followed by a disassociation @ 184.3. This last attempt finally succeeds in making Pal move @ 199.5. I would not count this as a successful roam!
If you're wondering about that Probe Request @ 114.7, so am I. The source MAC address (16:02:8e:9f:3a:f6) is the same as the Orbi Satellite's 2.4 GHz radio (0e:02:8e:9f:3a:f6), except for the first octet. I suspect it has something to do with Orbi's mesh management.
Another thing that can affect roaming is whether the STA is busy, i.e. running traffic. This next test changes the Pal roam settings back to -70 threshold and -60 target, but this time starts a TCP/IP iperf3 stream, limited to 5 Mbps, right after the STA associates. The stream runs until the connection is broken, which happens when Pal roams and gets a new IP address. This really shouldn't happen and appears to be due to some interaction between Pal and Orbi that hasn't yet been tracked down.
Roaming test - Pal w/ -70 threshold w/ traffic
This roam actually appears smoother than the first example, which was done with the same Pal settings, but without traffic. The partial Pal log below shows roam scanning starts right when Pal hits -70 RSSI. Traffic ("Throughput" column), however, stops when the IP address changes (I think the lone 11 Mbps log entry is a glitch). The capture log shows a pattern of disassociations and 11v BSS Transition Management Requests similar to those shown in the first roam example.
Pal log - roam w/ traffic
So far, we've set Pal's band preference to Auto, which causes it to connect to the strongest signal when roaming. Our last experiment sets Pal's roaming band preference to 5 GHz, while leaving roam threshold at -70 dBm and roam target at -60 dBm. The roam plot would seem to indicate band preference doesn't ensure Pal always connects to 5 GHz.
Roaming test - Pal w/ -70 threshold, 5 GHz band preferrence, no traffic
The first part of the Pal log shows connection to the Orbi router's 5 GHz radio. Then a roam scan starting right when the roam threshold of -70 is hit results in a move to the 2.4 GHz radio on the same AP. But even though the move to 2.4 GHz puts RSSI above the roam threshold, Pal almost immediately starts scanning again @ 83.6. It finally moves to the Orbi satellite 5 GHz radio (08:02:8e:9f:3a:f9) 93.2 seconds into the run.
Pal log -roam with 5 GHz band preference
The packet capture shows Orbi and Pal battling between where each one thinks Pal belongs. Orbi first kicks Pal from the router's 5 GHz radio (Netgear_9f:39:c8), but it connects right back. No disassociation, deauth or 11v BSS Transition Requests are involved when Pal moves to Orbi router's 2.4 GHz radio at the 75.1 second mark. But Orbi tries to kick it right off with a disassociation @ 75.5. Orbi finally makes its own decision to move, disassociating from Orbi router's 2.4 GHz radio @ 91.7 and immediately connecting to Orbi satellite 5 GHz (08:02:8e:9f:3a:f9).
Packet capture - Pal 5 GHz band preference
I don't know why Orbi satellite immediately tries to kick Pal off @ 92.1. But Pal reconnects immediately, staying put for 90 seconds before Pal moves back to Orbi router's 2.4 GHz radio (0e:02:8e:9f:39:c5) @ 184.3, which immediately disassociates Pal, which immediately reconnects in response.
Packet capture - Pal 5 GHz band preference - more
Note the first time Orbi attempts to band steer pal via 11v is 192 seconds into the run, recommending a move back to Orbi router's 5 GHz radio. The takeaway from this experiment is that a band "preference" is only that, not a guarantee.
So given the same conditions, is a device's roaming behavior always the same? Well, in real life, it would be impossible to duplicate the exact conditions from roam to roam. Device signal levels, RF environment including neighboring networks and traffic levels are constantly changing, all of which can affect roaming behavior.
Even under conditions as controlled as I can make them, roaming behavior can vary widely from run to run. The plot below (sorry for the eye chart) combines three runs, all with Pal roaming threshold set to -70 dB, roaming target set to -60 dBm and traffic running at the start of each run. Roam 2 takes the prize for most device transitions, six in all. Actually, there were seven connection changes during the run if you count the first one—a band steer from Channel 6 to 40 not shown. This is why the run starts around 20 seconds later than the other two.
Roaming test - Pal w/ -70 threshold w/ traffic - three runs
I know I said we'd also look at roams using Windows and Android STAs, but that will have to wait for next time!