ChromeOS issue: Chromebooks pulling multiple IP addresses from DHCP server |
|||||||||||
Issue descriptionChromeOS version: 63.0.3239.140 ChromeOS device model: Lenovo N23 Case#: 14725362 Description:Devices are constantly requesting to renew their IP address, and sometimes they request a new one Server info: Server 2008 R2, Server 2012 DHCP servers Steps to reproduce: This issue is not reproducible and it happens randomly Current Behavior / Reproduction: Chromebooks are looking to renew their IP constantly or ask for a new one Expected Behavior: Chromebooks join a network and keep their IP address for as long as the network allows - Timeframe when issue started: As per customer, sometime between December 22 and January 4th. Additional info: -Drive link to logs: https://drive.google.com/open?id=1gGjBPFlZXO4ReDlx1kNOp15galbBORUF -Customer info: https://drive.google.com/open?id=1KcUxbwzt73erD9izNTQ21t-v6ERrib-BoFFccctBZDo -The similar error log with chrbug.com/3377990: https://drive.google.com/open?id=1XxvuKt_XOP8ba5PqIRx-zwNqZKdI46-qSVUXdvu_UVc -DHCP info: https://drive.google.com/open?id=1aopDMTkAMqK_P1u9ZcxO_0LeC599RcSC9-X-6Cf8oQI -DHCP log: https://drive.google.com/open?id=1RFdQ2widyRaTOos8XDReiCGSQncRUPdCQgPf9ew29lA -Other additional comment from the customer: https://drive.google.com/open?id=1b98Q5q-0Sg20lRwCF3i0v5zRViNSLDG7pCwsMZP-Ofw Troubleshooting steps taken: - Disabled play store access for the entire domain - Moved devices to different OUs - Wiped devices - Updated devices to the latest OS version - Is this reproducible in Beta/Dev? No devices available in this version - Existing Workaround: None -Device version info: https://drive.google.com/open?id=1SJBC5KZPbq2MAEYx0wmSMCsM2jYP_mcq -Device extension info: https://drive.google.com/open?id=1fMcQ-8AbgSy1_eVV4lVlcmnaVsddpiY0 -Device Policy info: https://drive.google.com/open?id=176UneAFwpF2j5y6AAHbOYU3o68LUGG86
,
Feb 8 2018
,
Feb 8 2018
First off - I should probably start by saying that I am the customer who is seeing this behavior and reported it. We have had ChromeOS devices for several years and have never seen this behavior before until recently. The behavior caused our entire DHCP scope to fill up. We have almost 12000 addresses allocated to this network segment and every single one filled showing BAD_ADDRESS on the DHCP server. We observed it happening at a rate of 1000-1500 / day from a batch of 1500-2000 Chromebooks. We have always had our DHCP leases set to unlimited for device tracking purposes. We saw this behavior once before on a VLAN dedicated to Cisco networking equipment - specifically Access Points. When you would software update them and reboot, you would end up with many instances of "BAD_ADDRESS". Cisco has since programmed a work around into their devices to stop them from causing this problem and I have not seen the issue since. I can potentially pull more details from that case if it would help engineering to fix this. I know of a few others who are seeing this now as well and I have asked them to post on this ticket to give this issue further visibility. We know that it has to be something that changed with recent code updates to ChromeOS. We did not see this problem at all prior to the end of December. We have pulled DHCP logs from our server and see the client requesting DHCP leases multiple times within a short period of time. It seems like when the client does this more than about 3 times - the server feels there is a conflict and will put the client into a BAD_ADDRESS state. We also saw some logs in our DHCP server from clients running older version of code - 58, etc... - that never reported this problem. The DHCP server specifically shows a different dhcpd version reported by the client - so maybe it is a bug in a newer dhcpd version. I did manage to find a work around for this issue, but it is not something that is sustainable in the long run. I have created 3000+ IP reservations on our DHCP server using PowerShell for all known Chrome devices that might connect to our network. Since doing this, I have only had 106 BAD_ADDRESS entries show up instead of 1000s. I believe that the ones showing up now are from devices we do not even own (since non-district devices end up pulling down network settings just like our managed devices). Thank you.
,
Feb 9 2018
Update my system profile<LG Staylo 2> jg2225179@gmail.com
,
Feb 9 2018
We are also seeing this issue on our network. We are having are Chromebooks requesting multiple addresses causing our DHCP server to get filled with a ton of BAD_ADDRESS. This issue is only happening on your Chromebook vlan. We are not seeing any other issues on our other SSID's or vlans. I have also contacted both Cisco who handles our wireless environment and Juniper who handles the switching environment and they did not find any problems with their equipment that would be causing this problem.
,
Mar 1 2018
bhthompson@ please triage this case?
,
Mar 1 2018
Kevin, is this in your jurisdiction?
,
Mar 1 2018
,
Mar 5 2018
This may possibly be due to Chrome OS MAC Address randomization (see crbug.com/579598 ) - Enterprise managed / enrolled Chrome devices should not randomize their MAC address. Are these devices enrolled? - Setting unlimited (or effectively unlimited) lease times for DHCP is not recommended as a general IT best practice in my experience as you will eventually burn through the pool and it makes issues complicated to fix if device never renews. - Recommend setting shorter lease times (less than 1 week, potentially just 24 hours) to increase IP availability. If you are still seeing this issue please clarify: 1) are devices enrolled against a domain or not? 2) how long is your DHCP lease time?
,
Mar 6 2018
> Other additional comment from the customer: > https://drive.google.com/open?id=1b98Q5q-0Sg20lRwCF3i0v5zRViNSLDG7pCwsMZP-Ofw This shows two different IPs but the same MAC address. I don't think MAC address randomization is supposed to have any effect once the link comes up, for reasons like this. Guessing it is not a factor in this bug. > -Drive link to logs: > https://drive.google.com/open?id=1gGjBPFlZXO4ReDlx1kNOp15galbBORUF The only suspicious thing I see in this net.log is: 2018-02-05T10:48:28.334141-06:00 INFO dhcpcd[1735]: status changed to Reboot 2018-02-05T10:48:28.340176-06:00 INFO dhcpcd[1735]: wlan0: ARP probing 10.66.1.1 (1 of 3), next in 1.2 seconds 2018-02-05T10:48:28.340243-06:00 INFO dhcpcd[1735]: wlan0: rebinding lease of 10.66.127.215 2018-02-05T10:48:28.340375-06:00 INFO dhcpcd[1735]: wlan0: sending REQUEST (xid 0xf2fc6d80), next in 3.4 seconds 2018-02-05T10:48:28.343781-06:00 INFO dhcpcd[1735]: wlan0: received NAK with xid 0xf2fc6d80 2018-02-05T10:48:28.343808-06:00 WARNING dhcpcd[1735]: wlan0: NAK (deferred): from 1.1.1.2 2018-02-05T10:48:28.343821-06:00 INFO dhcpcd[1735]: status changed to NakDefer 2018-02-05T10:48:28.346722-06:00 INFO dhcpcd[1735]: event GATEWAY-ARP on interface wlan0 2018-02-05T10:48:28.347547-06:00 INFO shill[1253]: [INFO:dhcpv4_config.cc(123)] Event reason: GATEWAY-ARP 2018-02-05T10:48:28.348465-06:00 INFO shill[1253]: [INFO:connection.cc(276)] UpdateFromIPConfig: Installing with parameters: local=10.66.127.215 broadcast=10.66.255.255 peer=<unknown> gateway=10.66.1.1 2018-02-05T10:48:28.349233-06:00 INFO shill[1253]: [INFO:service.cc(400)] Service 1: state Configuring -> Connected 2018-02-05T10:48:28.359706-06:00 INFO shill[1253]: [INFO:manager.cc(1455)] Service 1 updated; state: Connected failure Unknown 2018-02-05T10:48:28.368705-06:00 INFO shill[1253]: [INFO:wifi.cc(2372)] Enabling high bitrates. 2018-02-05T10:48:28.369973-06:00 INFO shill[1253]: [INFO:service.cc(400)] Service 1: state Connected -> Online 2018-02-05T10:48:28.370039-06:00 INFO shill[1253]: [INFO:manager.cc(1455)] Service 1 updated; state: Online failure Unknown 2018-02-05T10:48:28.386534-06:00 INFO shill[1253]: [INFO:manager.cc(1703)] Default physical service: 1 (connected) 2018-02-05T10:48:28.827588-06:00 INFO ModemManager[1519]: <info> Couldn't check support for device '/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0': not supported by any plugin 2018-02-05T10:48:30.819582-06:00 INFO dhcpcd[1735]: wlan0: sending REQUEST (xid 0xf2fc6d80), next in 3.4 seconds 2018-02-05T10:48:31.455113-06:00 INFO dhcpcd[1735]: wlan0: Handling deferred NAK 2018-02-05T10:48:31.458589-06:00 INFO dhcpcd[1735]: status changed to Release 2018-02-05T10:48:31.458785-06:00 INFO dhcpcd[1735]: status changed to Discover 2018-02-05T10:48:31.458908-06:00 INFO dhcpcd[1735]: wlan0: soliciting a DHCP lease 2018-02-05T10:48:31.458989-06:00 INFO dhcpcd[1735]: wlan0: sending DISCOVER (xid 0x1fd2188d), next in 3.6 seconds 2018-02-05T10:48:31.998858-06:00 INFO dhcpcd[1735]: wlan0: received OFFER with xid 0x1fd2188d 2018-02-05T10:48:31.998891-06:00 INFO dhcpcd[1735]: wlan0: offered 10.66.91.50 from 1.1.1.2 2018-02-05T10:48:31.998908-06:00 INFO dhcpcd[1735]: wlan0: requesting lease of 10.66.91.50 2018-02-05T10:48:31.998922-06:00 INFO dhcpcd[1735]: status changed to Request 2018-02-05T10:48:31.999161-06:00 INFO dhcpcd[1735]: wlan0: sending REQUEST (xid 0x1fd2188d), next in 4.6 seconds 2018-02-05T10:48:32.004382-06:00 INFO dhcpcd[1735]: wlan0: received ACK with xid 0x1fd2188d 2018-02-05T10:48:32.004429-06:00 INFO dhcpcd[1735]: wlan0: acknowledged 10.66.91.50 from 1.1.1.2 2018-02-05T10:48:32.004448-06:00 INFO dhcpcd[1735]: wlan0: leased 10.66.91.50 for infinity 2018-02-05T10:48:32.004591-06:00 INFO dhcpcd[1735]: event BOUND on interface wlan0 2018-02-05T10:48:32.004818-06:00 INFO dhcpcd[1735]: status changed to Bound 2018-02-05T10:48:32.005856-06:00 INFO shill[1253]: [INFO:dhcpv4_config.cc(123)] Event reason: BOUND 2018-02-05T10:48:32.006075-06:00 INFO shill[1253]: [INFO:connection.cc(271)] UpdateFromIPConfig: Flushing old addresses and routes. 2018-02-05T10:48:32.007603-06:00 INFO shill[1253]: [INFO:connection.cc(276)] UpdateFromIPConfig: Installing with parameters: local=10.66.91.50 broadcast=10.66.255.255 peer=<unknown> gateway=10.66.1.1 There are a bunch of other DHCP / UpdateFromIPConfig events in there, but they all used the .91.50 IP. Is this the event we are trying to troubleshoot? If so, it would be useful to look at a packet trace + DHCP server logs to figure out what else is going on. (FWIW there was a fresh reboot at 10:48:22. So it's not a suspend/resume.) The deferred NAK logic may be a Chrome OS specific addition to dhcpcd: https://groups.google.com/a/chromium.org/forum/#!topic/chromium-os-reviews/ZFuyROCDdcc Bug 384897 suggests that this can involve situations where there are two DHCP servers answering requests for the same client.
,
Mar 6 2018
BTW, if you do post logs, could you please make sure that they are all from the same network + time period? That would help me correlate specific events from net.log with packets on the wire and DHCP server events. It is less helpful to see logs from different sides of the "conversation" that aren't capturing the same exact events.
,
Mar 16 2018
Sorry for the lack of response on this - too many other things going on recently. This is still happening frequently. You mentioned unlimited leases in your one response. I can elaborate on why we chose to do this initially. Before we had the ability to get a username from the Chrome device on our Internet filter (Palo Alto), the only way we could ensure valid log continuity was to do it by IP address. We needed to be able to search an IP in the log and know that we were looking at the same machine. We have since enabled SAML and hope to move to the fixed 802.1x policies soon when they finally can pass username/password properly. That note aside – it has never been an issue until randomly this past December or so when it decided to start having this BAD_ADDRESS issue constantly. We also have 12000+ addresses available and for 3000 systems, this should last quite a long time. All devices are domain enrolled, although we cannot always ensure that the only devices that hit this network are domain enrolled – there are also personal Chromebooks that end up hitting this network because Google does not let us restrict network policies by Device – only by user. A student’s personal BYOD Chromebook should not be connecting to this network but there is no way to stop it right now. Pulling logs is very difficult in this case. I had a system exhibit this behavior on my bench awhile back, but I could not readily reproduce it. It basically left the computer sit for several hours and suddenly, out of nowhere, the system had a new IP address. The only way we’ve been able to stop this is to create Static DHCP reservations for the Chromebooks but that is a gigantic pain to manage as we have to grab MAC addresses off any new batches of machines and we can’t account for those systems I mentioned above that are not ours. We have 2 other VLANs where the leases are around 10 hours and I have never seen this issue on there, granted, there also aren’t Chromebooks on those VLANs because of what I mentioned above with the forced network policies on non-enrolled devices. Tell us what you need from us in order to help (probably besides specific logs off a machine because it’s virtually impossible to pull). Thank you.
,
Mar 19 2018
> We also saw some logs in our DHCP server from clients running older version of code - 58, etc... - that never reported this problem. The DHCP server specifically shows a different dhcpd version reported by the client - so maybe it is a bug in a newer dhcpd version. Hmm, what are the good and bad version numbers? I did a quick check of the net-misc/dhcpcd ebuild and it doesn't seem to have changed since Apr 2016. net-misc/dhcp is newer, but that doesn't supply our client. If logs are unavailable, a packet trace could still help. Maybe set up a sniffer recording all DHCP traffic all day long, and then locate one instance of the same MAC address getting two different IPs?
,
Mar 19 2018
I don't have specifics on versions that seem to be impacted, I just know that I've never seen any of the older versions of dhcpd in the log. I did get a .pcap from our Windows DHCP server today and isolated the capture to a specific client. Basically - I see DHCP request DHCP ACK DHCP Decline DHCP Discover DHCP Offer DHCP Request DHCP ACK DHCP Decline DHCP Discover DHCP Offer DHCP Request DHCP ACK How do I share? Should I upload to Google Drive and share it with everyone assigned to this case?
,
Mar 19 2018
Sure you can just share it with me and I'll take a look.
,
Mar 19 2018
You should get a sharing request for the file I just uploaded. That file will have the data just for that one particular host. I have a less filtered file I can pull also if you need to compare to regular behavior. This particular Chromebook seems to be one that isn't district owned (but insists on connecting to this network because of how policies are distributed). I searched our enrolled Chromebooks and couldn't find a match. All I know is it has an Intel NIC. It might be an Acer.
,
Mar 19 2018
Seeing all traffic to/from the affected device, within 30 seconds of packet #8, would be helpful. Looks like there are a couple of different ways that dhcp_decline() can be reached in the code. Some paths leave messages in the log, others don't.
,
Mar 19 2018
One possibility is that the client is performing the "arpgw" check mentioned here, and it is failing: https://bugs.chromium.org/p/chromium/issues/detail?id=377990#c2 If you're seeing an unusually large number of arpgw failures, that might point to another issue (which is something that quiche alluded to in the comments on that bug). But if you just see the occasional arpgw failure, and that causes a DHCP Decline that results in the IP address being blacklisted forever, you may want to tell your DHCP server to release declined IPs instead of putting them into BAD_ADDRESS state.
,
Mar 22 2018
Before we made DHCP reservations for the clients, it was happening very frequently. Enough that it caused 12000 addresses to be consumed by 3000 devices. Regardless of what is causing this - this is a new issue that began towards the end of last year / start of this year. It doesn't happen on any other device types (even on the same VLAN). The only change we made on the VLAN was to switch it from FlexConnect to Controlled Switching. We did that awhile back when we enabled SAML SSO on the Chromebooks because SAML requires an ACL on the Wireless Network to control redirection to the SAML page and that can't be done locally on the APs - has to be at the Controller level. We hope to switch away from that again when Google finally releases the fix for Bug 377990 - 802.1x passing of passwords (although I was just testing that today and it isn't working). Back to what I had said before - we would prefer that clients keep the same IPs indefinitely. I just had a situation yesterday to prove why we like that. We had a system that has been missing since January, apparently. I didn't know who last used the device, but I knew the IP address that it typically used. So, I was able to dig back into firewall and wireless system logs to determine exactly where the computer was last located. If the IP changes, the logs are not consistent. I would still think there are things that can be done at the client to mitigate this sort of problem.
,
Mar 22 2018
Sorry - my reference above should be Bug 386606 instead of 377990.
,
Mar 22 2018
If the Chromebook is sending DHCP Decline for an improper reason, we can investigate and fix that. Logs + packet traces would be helpful in tracking down these issues. If the Chromebook is sending DHCP Decline for a legitimate reason (including transient failures), then the networking infrastructure at the site is expected to be able to recycle the declined IPs. AIUI this network is configured as a /16 with 3,000 devices on the same broadcast domain? You might want to experiment with splitting it up. Maybe some of the DHCP/ARP traffic is getting dropped and making things unreliable.
,
Mar 23 2018
From the log on the DHCP server, we're seeing an Ack on the request and then suddenly it denies it because it for some reason requests the IP again. That is the odd behavior. On the topic of 3000 devices on a broadcast domain - we have other VLANs across the district that have almost that many devices as well without issue. Those VLANs primarily have Microsoft devices on them, however. Also, the user behind Comment #5, I believe they are only have around 1000 devices on the VLAN and they're seeing this same behavior. I would say on average, there are probably 1000-1500 at most of those devices ever active at the same time. I can attempt to simulate the problem on a test Chromebook again by removing the static DHCP assignment as I did successfully get a client to have the IP shift randomly during the day when no static lease was in place. Just that DHCP packet capture from the other day was a challenge by itself and it took nearly 5 hours of watching to finally see a client exhibit the problem. Do I simply need to go under net-internals and start the capture? I'm still theorizing that many this has something to do with the Android subsystem possibly. On both my network and the network of the individual on Comment #5 - we started seeing this issue first around December and we had never seen it before - in 4+ years. I know there were other instances of people seeing this in the past, but in both of our cases, we had never seen it before. Too bad it's not possible to roll an entire domain back to an older version temporarily. You mentioned that maybe a newer version at some point might just fix this - but until then - we sit with broken networks.
,
Mar 23 2018
> From the log on the DHCP server, we're seeing an Ack on the request and then suddenly it denies it because it for some reason requests the IP again. That is the odd behavior. There is code in dhcpcd which does send DHCP Decline after Ack under some circumstances. I think we would need to get a packet capture (e.g. via tcpdump) to see exactly what was happening. I do not think the net-internals event log will have the info we need. IMO your most practical path forward is to configure the DHCP server so that it can gracefully handle DHCP Declines by recycling or reissuing the rejected IPs. We can track down the root cause for the declines, and if there's something going wrong, possibly eliminate a bunch of them. But there is no way to guarantee that they will never happen.
,
Mar 26 2018
I updated a client to the 66 build to test the new 802.1x username/password variables - which work fine with enrolled devices, but are broken for non-enrolled devices. A non-enrolled device signed in with a managed user results in hammering our authentication server with constant authentication requests. That issue aside - I left it sit all weekend and the IP address didn't change. I left it sit on the chrome://net-internals screen with the Events tab up and the "Include the actual bytes sent/received" setting enabled and it didn't have the issue. I still saw traffic to the UDP_SOCKET at 100.115.92.1. Oddly, on this older version, it shows UDP_SOCKET to 224.0.0.251:5353 instead. I'm going to let this system sit overnight on version 64 to see if the issue occurs. Maybe this is something that's fixed in the new builds and I can stop harping on you all for a fix. Our desire is for the DHCP server to never have to recycle an address. On our scope for Windows devices, we have never seen this issue and there are over 2800 addresses issued. This is the exact same DHCP server for both scopes too - the only difference is the client device. I'll report back with the status of the test on this v64 build.
,
Apr 17 2018
Hello, I have the same issue in my school. 200 reservations in windows server 2008 R2 DHCP and sometimes we have bad_address in our list, only chromebooks. The other devices we do not have problems, including a lot of android OS. It seems that for any reason the chrome OS do not work well with reservation + dhcp windows.
,
May 8 2018
I'm still seeing this issue frequently in our environment. I would imagine that many devices are now running the latest code - which means it isn't fixed in 64+. I just purged these the other day and now we have another 92 BAD_ADDRESS showing up in the DHCP console. What else can we do to help you isolate the issue? I've attempted to pull logs, but it is very difficult to recreate the issue consistently. We are using reservations for all systems right now, so the 92 that are in the console right now are likely from devices that do not have reservations created. Any update would be appreciated.
,
May 8 2018
We have all our Chromebook update to 64+ and the problem remain. We realized that when a student or teacher lose the AP signal the problem tend to be happen. I think we need more stars in this issue.
,
May 8 2018
All our Chromebooks are also running version 64+. We still continue to see multiple issues with BAD_ADDRESS in our DHCP server. This is only happening on the WLAN for the Chromebooks, all other devices are fine.
,
May 21 2018
I realized that the problem occurr when the Chromebook lose the AP signal. When it try to connect again the conversation between chromebook and dhcp windows server do not work properly. Do you feel the same in your environment?
,
May 21 2018
That very well might be happening in our environment, but I at one point had a system on my bench that randomly grabbed a new IP and it had been sitting in the same spot the entire day. Can an engineer assigned to this request please respond and let us know some next steps? I really do not want to have to create reservations for every new batch of Chromebooks that we buy in the future. If it was totally up to me, we wouldn't even be buying Chromebooks because of all this stupid network stuff that continually happens. No other devices have this many network problems constantly.
,
Dec 7
Hello! This bug is receiving this notice because there has been no acknowledgment of its existence in quite a bit of time. - If you are currently working on this bug, please provide an update. - If you are currently affected by this bug, please update with your current symptoms and relevant logs. If there has been no updates provided by EOD Wednesday, 12/12/18 (5pm EST), this bug will be archived and can be re-opened at any time deemed necessary. Thank you!
,
Dec 7
This issue seems to have gone away on its own (or was fixed without anyone realizing it). We had created static DHCP entries when this was bad last spring, but we got some new Chromebooks over summer and I didn't make any of those Static DHCP and I don't think any of them have had this problem. I just checked our DHCP server and there we only 12 of the BAD_ADDRESS notices. I'm assuming that the other people who were having this problem would also concur that it has been resolved in their environments, but they can definitely chime in if it hasn't. Now, if we could get the systems to actually publish a hostname when they do DHCP requests, we would be in business.
,
Dec 8
The problem remain in my enviroment. Windows server 2008 r2 DHCP server with reserved IP with Ubiquiti wireless system. We receive everyday bad address in our list, only Chromebooks, we have 200+ devices and only Chromebooks have this behaviour. If you need details for that i can send.
,
Dec 13
Changing priority due to lack of response from Eng. cernekee@ if this is something you're still working on would you be able to assist with confirming what information would still be helpful to investigate this issue?
,
Dec 13
cernekee@ no longer works on Chrome OS, so you're probably going to have to find someone else to look at this issue.
,
Dec 13
Abhishek do you know whom might own this sort of bug these days?
,
Dec 19
Assigning to @benchan for routing.
,
Dec 19
It was suggested to assign this to benchan@. Benchan, is this something you can assist with?
,
Jan 7
For those that are still experiencing this issue (NOTE: If this is affecting your domain please respond from your Domain email address and NOT your Gmail account): It appears due to time this thread has become difficult to follow for further analysis. Please submit the following information: ChromeOS version: ChromeOS device model: Case#: Description: Server info: Steps to reproduce: Current Behavior / Reproduction: Expected Behavior: - Timeframe when issue started: Additional info: -Drive link to logs: -Customer info: -DHCP info: -DHCP log: -Other additional comment from the customer: Troubleshooting steps taken: - Is this reproducible in Beta/Dev? - Existing Workaround: -Device version info: -Device extension info: -Device Policy info:
,
Jan 17
(5 days ago)
Due to lack of action this bug has been Archived. If work is still being done on this issue or you are still experiencing this issue please feel free to re-open with the appropriate information. Before re-opening, please provide the information requested in comment #39. |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by ryutas@chromium.org
, Feb 8 2018